Jan K.

Development of a comprehensive data strategy and governance framework for a data management platform on Databricks

Berlin, Germany

Experience

Mar 2023 - May 2024
1 year 3 months

Development of a comprehensive data strategy and governance framework for a data management platform on Databricks

Intralogistics

In this “lighthouse” project, I led the development of a robust data strategy and governance framework aimed at optimizing and improving the organization’s data processing capabilities. The core of the project was building a high-performance data management platform on Databricks, complemented by designing and implementing an efficient data-hub ingest platform.

  • Led the design and establishment of an enterprise-wide data strategy aligned with business goals and technological advances
  • Developed a comprehensive data governance framework to ensure data quality, privacy, and compliance with industry standards
  • Oversaw the deployment and customization of the data management platform on Databricks, enhancing data processing, analytics, and reporting capabilities with Power BI
  • Built a robust data hub with high-performance ingest pipelines based on AWS EventBridge, optimizing data flow from various sources to centralized storage (Data Lakehouse on Azure)
  • Collaborated with cross-functional teams to integrate the data management platform into existing IT infrastructure and business processes
  • Conducted training sessions and workshops for new teams, promoting a data-driven culture and improving data literacy across the organization
  • Azure Databricks
  • Databricks Data Catalog
  • AWS EventBridge
  • Kinesis Event Hub
  • Structured Streaming (Apache Spark)
Jan 2022 - Feb 2023
1 year 2 months

Innovative integration and analysis of logistics data streams with PySpark Structured Streaming and Data Mesh implementation

Logistics

This project focused on the complex integration of logistics data streams with Event Hub and Kafka using PySpark Structured Streaming. Our approach transformed how logistics data could be captured, processed, and linked in real time using a graph approach. Using technologies like GraphFrame, Azure Synapse Analytics, Apache Spark, and Power BI, we built a robust system that ensured high data quality, smooth transmission, and compliance with IT governance principles.

  • Integrated logistics data stream with Kafka using PySpark Structured Streaming
  • Defined required data structures for the data stream
  • Built a robust and efficient integration of the logistics data stream with Event Hubs
  • Enabled real-time use of logistics data for analysis and further processing
  • Designed and implemented pipelines for capturing, processing, and routing the data stream
  • Performed efficient data processing with PySpark Structured Streaming
  • Configured and initiated the PySpark streaming job
  • Implemented comprehensive testing and monitoring mechanisms
  • Ensured smooth data transfer and high data quality
  • Azure Synapse Analytics
  • Purview Data Catalog
  • Event Hub
  • GraphFrame
  • Power BI
Sep 2021 - Jan 2022
5 months

Enhanced data processing and integration systems for e-commerce with serverless and distributed Data Mesh architectures

E-Commerce

In this project, my main task was to lead and support various internal e-commerce product teams in developing, implementing, and maintaining powerful data processing and integration systems. The focus was on migrating existing data services and pipelines to a new, improved architecture, emphasizing the development of an event-based system using serverless technologies and big data frameworks.

  • Supported and guided the migration of existing data services, pipelines, and assets to a new and enhanced architecture
  • Developed an event-based system
  • Used Lambda functions and PySpark
  • Integrated with Kafka
  • Handled design and architecture planning
  • Implemented Lambda functions and PySpark jobs
  • Configured and connected to Kafka
  • Built serverless architecture for scalability and availability
  • Processed and analyzed event data in real time
  • Performed PySpark transformations, filters, and aggregations
  • Achieved efficient and reliable Kafka integration
  • Set up security configurations and integration with other components
  • Carried out extensive testing and monitoring mechanisms
  • Delivered a high-performance and scalable event system
  • Extracted valuable insights from event data
  • Enabled data-driven decision making
  • AWS Glue
  • Apache Spark
  • Data Catalog
  • Athena
  • Redshift
  • Lambda
  • ECS
  • Step Functions
  • Implemented distributed Data Mesh architectures so different product teams could work with data efficiently
  • Processed data with big data frameworks and database technologies
  • Developed serverless/elastic cloud architecture (AWS)
  • Deployed the architecture following DevOps best practices and Infrastructure as Code (AWS CDK & Terraform)
Apr 2020 - Sep 2021
1 year 6 months

Migration and enhancement of the e-commerce data platform to an AWS Data Lakehouse architecture

E-Commerce

This project involved the strategic development and migration of existing analytics data pipelines into a Data Lakehouse architecture using AWS services. A key focus was improving the big data lake environment and ensuring strict data quality and compliance standards, especially regarding GDPR.

  • Advanced the big data lake environment in AWS
  • Designed and implemented a Data Lakehouse
  • Conducted exploratory analysis and algorithm development through data provisioning and preparation (AWS Glue, Spark, Lambda)
  • Managed data ingestion
  • Developed data pipelines and ETL jobs to deliver ready-to-consume data sources (AWS Glue, AWS Redshift, Spark, PySpark)
  • Performed regression testing and quality checks on data paths and the data lake
  • Orchestrated and connected data sources
  • Automated deployments using DevOps best practices (AWS CodeBuild & CodePipeline, GitHub Actions)
  • Built infrastructure with IaC (AWS CDK)
  • Maintained the system (including cost monitoring)
Feb 2019 - Apr 2020
1 year 3 months

Development of an architecture and implementation of a big data environment for enterprise-wide standardized platform services

Transport & Logistics

This project covered the development and implementation of a standardized big data architecture for enterprise-wide platform services in the transport and logistics sector using various Azure services. My role was crucial in ensuring data transparency, data quality, DataOps, compliance with data regulations, and implementing agile methodologies.

  • Developed solutions in Azure and automation projects, presented and discussed them
  • Azure services: Azure Data Catalog, Azure Synapse Analytics, Azure Data Factory, Azure Databricks
  • Automated infrastructure setup with Infrastructure as Code (Terraform) and Ansible
  • Scrum, JIRA, GitLab, Docker
  • Implemented real-time data transfer with Apache Kafka
  • Advised on Azure platform strategy and reference architectures
  • Created mechanisms and automation to proactively eliminate weaknesses in Azure and Kubernetes components based on standardized clusters (security by default)
  • Conceptually advanced the platform’s architectural and technological roadmap in container orchestration with Kubernetes, CI & CD
  • Designed user and permission concepts according to corporate guidelines
  • Managed operations of the offered services
  • Agile team
  • Azure Data Catalog (Purview)
  • Azure Synapse Workspace Analytics
  • Azure Data Factory
  • Azure Databricks
  • Terraform
  • GitLab Runner
  • Azure DevOps
Sep 2018 - Feb 2019
6 months

AWS infrastructure consulting and implementation for global process operations in transport and logistics

Transport & Logistics

This project involved consulting and hands-on implementation of an AWS infrastructure to support a process operations team responsible for multiple international transport and logistics applications. My role was key in identifying and implementing optimizations, developing and maintaining critical system infrastructure, and providing extensive support and training to internal teams.

  • Provisioned and managed servers, OS environments, and database systems in AWS
  • Identified optimization potential from both commercial and technical perspectives
  • Developed and presented optimized processes
  • Implemented optimizations (AWS Lambda with boto3)
  • Acted independently and presented findings to the team and project stakeholders
  • Administered and maintained the provided systems
  • Developed maintenance and monitoring concepts for these systems
  • Supported and advised development projects on using, configuring, and optimizing the provided systems
  • Consulted on architectures and operational concepts using AWS Cloud infrastructures
  • Trained internal staff on new AWS services and updated workflows
  • Migrated applications for a business unit (Transport & Logistics), including setting up AD
  • Provided AWS infrastructure: SQL databases & EC2 instances, plus Lambda services
  • Deployed via Terraform
  • Planned and executed the application migration
  • Rolled out permissions
  • Provisioned infrastructure with AWS CloudFormation

Setup and consulting of a marketplace platform based on Microsoft Azure services

  • Integrated payment providers
  • Planned and architected Microsoft Azure services
  • Consulted on and implemented back end and front end
  • Built user management with Active Directory
  • Created an upload tool for uploading very large files directly from the web browser
  • Security engineering

Implementation of process automation: building an automatic ticket generator based on vulnerability scans

  • Designed and planned cost-effective infrastructure components
  • Implemented Python logic in AWS Lambda
  • Provisioned infrastructure with AWS CloudFormation using YAML templates
  • Optimized code
  • Sent notification emails via AWS SNS

Implementation and technical project support of a web application for managing the certification process

Renowned Automotive Manufacturer

  • Provisioned infrastructure on AWS (MySQL server)
  • Front end development with React.js
  • Back end implementation with Java EE | GlassFish
  • Refactored and optimized code
  • Agile work using Scrum with Jira
  • CI/CD with Jenkins
  • Documentation in Confluence

Summary

  • Big Data Specialist Focus: Big Data cloud architecture, data management platforms
  • Specialist in Big Data platforms on Amazon Web Services & Microsoft Azure
  • ETL processes/pipelines & data engineering
  • Architecture of data management platforms in large enterprises
  • Building data lakes & data lakehouse
  • Application migrations using cloud services
  • Consulting & implementation of automation concepts, especially DataOps & DevOps
  • Integration of Active Directory security concepts and compliance requirements
  • Python, SQL, TypeScript, Golang
  • Big Data cloud architectures (AWS & Microsoft Azure)
  • Data Engineering (Databricks, Synapse Analytics, Fabric, Apache Spark, AWS Glue, Athena, Redshift & EMR)
  • Infrastructure as Code (Terraform, Pulumi, AWS CDK, ARM)

Languages

German
Native
English
Advanced
Polish
Advanced

Certifications & licenses

AWS Business Professional

AWS

AWS Certified Cloud Practitioner

AWS

AWS Certified Machine Learning – Specialty

AWS

AWS Certified Solutions Architect – Associate

AWS

AWS Technical Professional

AWS

AZ-300: Microsoft Azure Architect Technologies

Microsoft

AZ-301: Microsoft Azure Architect Design

Microsoft

Azure Solutions Architect Expert

Microsoft

Databricks Certified Associate Developer for Apache Spark 3.0

Databricks

HashiCorp Certified: Terraform Associate

HashiCorp

Need a freelancer? Find your match in seconds.
Try FRATCH GPT
More actions