Jan K.

Development of a comprehensive data strategy and governance framework for a data management platform on Databricks

Berlin, Germany

Experience

Mar 2023 - May 2024
1 year 3 months

Development of a comprehensive data strategy and governance framework for a data management platform on Databricks

Intralogistik

In this "lighthouse" project, I led the development of a robust data strategy and governance framework aimed at optimizing and enhancing the organization's data processing capabilities. The core of the project was to build a high-performance data management platform on Databricks, complemented by the design and implementation of an efficient data hub ingest platform.

  • Leading the design and establishment of an enterprise-wide data strategy aligned with business goals and technological advances
  • Developing a comprehensive data governance framework to ensure data quality, privacy, and compliance with industry standards
  • Overseeing the deployment and customization of the data management platform on Databricks, improving data processing, analytics, and reporting capabilities with Power BI
  • Building a robust data hub with high-performance ingest pipelines based on AWS EventBridge, optimizing data flow from various sources to centralized storage systems (Data Lakehouse on Azure)
  • Collaborating with cross-functional teams to integrate the data management platform into existing IT infrastructure and business processes
  • Conducting training and workshops for new teams, promoting a data-driven culture and enhancing data literacy across the organization
  • Azure Databricks
  • Databricks Data Catalog
  • AWS EventBridge
  • Kinesis Event Hub
  • Structured Streaming (Apache Spark)
Jan 2022 - Feb 2023
1 year 2 months

Innovative integration and analysis of logistics data streams with PySpark Structured Streaming and Data Mesh implementation

Logistik

This project focused on the challenging integration of logistics data streams with Event Hub and Kafka using PySpark Structured Streaming. Our approach revolutionized how logistics data could be captured, processed, and linked in real time using a graph-based approach. By leveraging technologies like GraphFrame, Azure Synapse Analytics, Apache Spark, and PowerBI, we established a robust system that not only ensured high data quality and seamless transmission but also met IT governance principles.

  • Integrating a logistics data stream with Kafka using PySpark Structured Streaming
  • Defining the necessary data structures for the data stream
  • Building a robust and efficient integration of the logistics data stream with Event Hubs
  • Using logistics data in real time for analysis and further processing
  • Designing and implementing pipelines to capture, process, and forward the data stream
  • Efficient data processing with PySpark Structured Streaming
  • Configuring and initializing the PySpark streaming job
  • Implementing comprehensive testing and monitoring mechanisms
  • Ensuring smooth data transmission and high data quality
  • Azure Synapse Analytics
  • Purview Data Catalog
  • Event Hub
  • GraphFrame
  • PowerBI
Sep 2021 - Jan 2022
5 months

Enhanced data processing and integration systems for e-commerce with serverless and distributed Data Mesh architectures

E-Commerce

In this project, my main role was to lead and support various internal e-commerce product teams in developing, implementing, and maintaining powerful data processing and integration systems. The focus was on migrating existing data services and pipelines to a new, improved architecture, emphasizing the development of an event-based system using serverless technologies and big data frameworks.

  • Supporting and guiding the migration of existing data services, pipelines, and assets to a new and enhanced architecture
  • Developing an event-based system
  • Using Lambda functions and PySpark
  • Integrating with Kafka
  • Design and architecture planning
  • Implementing Lambda functions and PySpark jobs
  • Configuring and connecting to Kafka
  • Serverless architecture for scalability and availability
  • Processing and analyzing event data in real time
  • PySpark transformations, filtering, and aggregations
  • Efficient and reliable Kafka integration
  • Configuration, security settings, and integration with other components
  • Extensive testing and monitoring mechanisms
  • High-performance and scalable event system
  • Extracting valuable insights from event data
  • Data-driven decision making
  • AWS Glue
  • Apache Spark
  • Data Catalog
  • Athena
  • Redshift
  • Lambda
  • ECS
  • Step Functions
  • Implementing distributed Data Mesh architectures so that different product teams can work with data efficiently
  • Data processing with big data frameworks and database technologies
  • Developing serverless/elastic cloud architecture (AWS)
  • Deploying the architecture following DevOps best practices and Infrastructure as Code (AWS CDK & Terraform)
Apr 2020 - Sep 2021
1 year 6 months

Migration and enhancement of the e-commerce data platform to an AWS Data Lakehouse architecture

E-Commerce

This project involved the strategic development and migration of existing analytics data pipelines into a Data Lakehouse architecture using AWS services. A key aspect was improving the big data lake environment and ensuring strict data quality and compliance standards, especially regarding GDPR.

  • Evolving the big data lake environment in AWS
  • Designing and implementing a Data Lakehouse
  • Exploratory analysis and algorithm development through data provisioning and preparation (AWS Glue, Spark, Lambda)
  • Data ingestion
  • Developing data pipelines and ETL jobs to provide ready-to-use data sources (AWS Glue, AWS Redshift, Spark, PySpark)
  • Regression testing and quality checks in data flows and the data lake
  • Orchestrating and connecting data sources
  • Implementing automated deployments using DevOps best practices (AWS CodeBuild & CodePipeline, GitHub Actions)
  • Building the infrastructure using IaC (AWS CDK)
  • System maintenance (including cost monitoring)
Feb 2019 - Apr 2020
1 year 3 months

Development of an architecture and implementation of a big data environment for company-wide standardized platform services

Transport & Logistik

This project included the development and implementation of a standardized big data architecture for company-wide platform services in the transport and logistics sector using various Azure services. My role was critical in ensuring the integration of data transparency, data quality, DataOps, compliance with data regulations, and the implementation of agile methodologies.

  • Developing solutions in Azure and automation projects and presenting/discussing them
  • Azure services: Azure Data Catalogue, Azure Synapse Analytics, Azure Data Factory, Azure Databricks
  • Automated infrastructure provisioning with Infrastructure as Code (Terraform) and Ansible
  • Scrum, JIRA, GitLab, Docker
  • Implementing real-time data streaming with Apache Kafka
  • Advising on Azure platform strategy regarding reference architectures
  • Developing mechanisms and automations for proactive remediation of vulnerabilities in Azure and Kubernetes components based on standardized clusters (security by default)
  • Conceptual development of the architectural and technological platform for container orchestration based on Kubernetes, Continuous Integration & Continuous Deployment
  • Creating user and permission concepts in line with corporate policies
  • Operating the provided services
  • Agile team
  • Azure Data Catalogue (Purview)
  • Azure Synapse Workspace Analytics
  • Azure Data Factory
  • Azure Databricks
  • Terraform
  • GitLab Runner
  • Azure DevOps
Sep 2018 - Feb 2019
6 months

AWS infrastructure consulting and implementation for global process operations in transport and logistics

Transport & Logistics

This project involved consulting and hands-on implementation of an AWS infrastructure to support a process operations team responsible for several international applications in the transport and logistics sector. My role was key in identifying and implementing optimizations, developing and maintaining the critical system infrastructure, and providing comprehensive support and training to internal teams.

  • Provision and operation of servers, operating system environments, and database systems on AWS
  • Identifying optimization opportunities from both business and technical perspectives
  • Developing and presenting optimized processes
  • Implementing optimizations (AWS Lambda boto3)
  • Acting autonomously and presenting the results to the team and project managers/clients
  • Administering and maintaining the provided systems
  • Creating maintenance and monitoring concepts for these systems
  • Supporting and advising development projects on the use, configuration, and optimization of the provided systems
  • Consulting on architectures and operating concepts using AWS cloud infrastructures
  • Training internal staff on new AWS services and changed workflows
  • Application migration for a business unit (Transport & Logistics), including setting up Active Directory
  • Providing AWS infrastructure: databases (SQL) & EC2 instances, as well as Lambda services
  • Deploying using Terraform
  • Planning and executing the application migration
  • Rolling out permissions
  • Provisioning infrastructure with AWS CloudFormation

Setup and consulting for a marketplace platform based on Microsoft Azure services

  • Integrating payment providers
  • Planning and architecting Microsoft Azure services
  • Implementing/advising on back end and front end
  • Creating user management using Active Directory
  • Implementing an upload tool for uploading very large files directly from the web browser
  • Security engineering

Process automation implementation: building an automatic ticket generator based on vulnerability scans

  • Designing and planning the most cost-efficient infrastructure components
  • Implementing Python logic in AWS Lambda
  • Provisioning infrastructure with AWS CloudFormation using YAML templates
  • Code optimization
  • Sending notification emails via AWS SNS

Implementation and technical project support of a web application for managing the certification process under

Well-known automotive manufacturer

  • Infrastructure provisioning on AWS (MySQL Server)
  • Front end development in React.js
  • Back end implementation with Java EE | GlassFish
  • Refactoring and code optimization
  • Agile methodology using Scrum with Jira
  • CI/CD with Jenkins
  • Documentation in Confluence

Summary

  • Big Data Specialist focus: Big Data cloud architecture, data management platforms
  • Specialist in Big Data platforms on Amazon Web Services & Microsoft Azure
  • ETL processes/pipelines & data engineering
  • Architecture of data management platforms in large enterprises
  • Building data lakes & Data Lakehouse
  • Application migrations using cloud services
  • Consulting & implementation of automation concepts, especially DataOps & DevOps
  • Integration of Active Directory security concepts and compliance requirements
  • Python, SQL, TypeScript, Golang
  • Big Data cloud architectures (AWS & Microsoft Azure)
  • Data Engineering (Databricks, Synapse Analytics, Fabric, Apache Spark, AWS Glue, Athena, Redshift & EMR)
  • Infrastructure as Code (Terraform, Pulumi, AWS CDK, ARM)

Languages

German
Native
English
Advanced
Polish
Advanced

Certifications & licenses

AWS Business Professional

AWS

AWS Certified Cloud Practitioner

AWS

AWS Certified Machine Learning – Specialty

AWS

AWS Certified Solutions Architect – Associate

AWS

AWS Technical Professional

AWS

AZ-300: Microsoft Azure Architect Technologies

Microsoft

AZ-301: Microsoft Azure Architect Design

Microsoft

Azure Solutions Architect Expert

Microsoft

Databricks Certified Associate Developer for Apache Spark 3.0

Databricks

HashiCorp Certified: Terraform Associate

HashiCorp

Need a freelancer? Find your match in seconds.
Try FRATCH GPT
More actions