Jan Krol

Development of a comprehensive data strategy and governance framework for a data management platform on Databricks

Berlin, Germany

Experience

Mar 2023 - May 2024
1 year 3 months

Development of a comprehensive data strategy and governance framework for a data management platform on Databricks

Intralogistics

In this “lighthouse” project, I led the development of a robust data strategy and governance framework to optimize and enhance the organization’s data processing capabilities. The core of the project was building a high-performance data management platform on Databricks, complemented by designing and implementing an efficient data hub ingest platform.

  • Led the design and establishment of an enterprise-wide data strategy aligned with business objectives and technological advancements
  • Developed a comprehensive data governance framework to ensure data quality, privacy, and compliance with industry standards
  • Oversaw deployment and tuning of the Databricks data management platform, improving data processing, analytics and reporting capabilities with Power BI
  • Built a robust data hub with high-performance ingest pipelines based on AWS EventBridge, optimizing data flow from multiple sources to centralized storage systems (Data Lakehouse on Azure)
  • Collaborated with cross-functional teams to integrate the data management platform into existing IT infrastructure and business processes
  • Conducted training and workshops for new teams, fostering a data-driven culture and enhancing data literacy across the organization
  • Azure Databricks
  • Databricks Data Catalog
  • AWS EventBridge
  • Kinesis Event Hub
  • Structured Streaming (Apache Spark)
Jan 2021 - Feb 2023
1 year 2 months

Innovative integration and analysis of logistics data streams with PySpark Structured Streaming and Data Mesh implementation

Logistics

This project focused on the complex integration of logistics data streams with Event Hub and Kafka using PySpark Structured Streaming. Our approach transformed how logistics data could be captured, processed, and linked in real time through a graph-based method. By leveraging technologies such as GraphFrame, Azure Synapse Analytics, Apache Spark, and Power BI, we established a robust system that ensured high data quality and seamless transmission while adhering to IT governance principles.

  • Integrated a logistics data stream with Kafka using PySpark Structured Streaming
  • Defined required data structures for the data stream
  • Robust and efficient integration of the logistics data stream with Event Hubs
  • Real-time utilization of logistics data for analysis and further processing
  • Designed and implemented pipelines to collect, process, and route the data stream
  • Efficient data processing with PySpark Structured Streaming
  • Configured and initialized the PySpark streaming job
  • Executed comprehensive testing and monitoring mechanisms
  • Ensured smooth data transmission and high data quality
  • Azure Synapse Analytics
  • Purview Data Catalog
  • Event Hub
  • GraphFrame
  • Power BI
Sep 2021 - Jan 2022
5 months

Enhanced data processing and integration systems for e-commerce with serverless and distributed Data Mesh architectures

E-Commerce

In this project, my primary responsibility was to lead and support various internal e-commerce product teams in developing, implementing, and maintaining powerful data processing and integration systems. The focus was on migrating existing data services and pipelines to a new, improved architecture, emphasizing the development of an event-based system using serverless technologies and big data frameworks.

  • Guided and supported the migration of existing data services, pipelines, and assets to a new, advanced architecture
  • Developed an event-based system
  • Used Lambda functions and PySpark
  • Integrated with Kafka
  • Design and architecture planning
  • Implemented Lambda functions and PySpark jobs
  • Configured and connected to Kafka
  • Serverless architecture for scalability and availability
  • Real-time event data processing and analytics
  • PySpark transformations, filtering, and aggregations
  • Efficient and reliable Kafka integration
  • Configuration, security settings, and integration with other components
  • Extensive testing and monitoring mechanisms
  • High-performance and scalable event system
  • Derived valuable insights from event data
  • Data-driven decision making
  • AWS Glue
  • Apache Spark
  • Data Catalog
  • Athena
  • Redshift
  • Lambda
  • ECS
  • Step Functions
  • Implemented distributed Data Mesh architectures to enable efficient collaboration across product teams
  • Data processing with big data frameworks and database technologies
  • Developed serverless/elastic cloud architecture (AWS)
  • Deployed architecture following DevOps best practices and Infrastructure-as-Code (AWS CDK & Terraform)
Apr 2020 - Sep 2021
1 year 6 months

Migration and enhancement of the e-commerce data platform to an AWS Data Lakehouse architecture

E-Commerce

This project involved the strategic development and migration of existing analytics data pipelines into a Data Lakehouse architecture using AWS services. A key aspect was improving the big data lake environment and ensuring strict data quality and compliance standards, particularly regarding GDPR.

  • Evolved the AWS Big Data Lake environment
  • Designed and implemented a Data Lakehouse
  • Exploratory analysis and algorithm development through data provisioning and preparation (AWS Glue, Spark, Lambda)
  • Data ingestion
  • Developed data pipelines and ETL jobs to deliver consumption-ready data sources (AWS Glue, AWS Redshift, Spark, PySpark)
  • Regression testing and quality checks in data pipelines and the data lake
  • Orchestrated and connected data sources
  • Implemented automated deployments using DevOps best practices (AWS CodeBuild + CodePipeline, GitHub Actions)
  • Built infrastructure using IaC (AWS CDK)
  • System administration (including cost monitoring)
Feb 2019 - Apr 2020
1 year 3 months

Development of architecture and implementation of a Big Data environment for company-wide, standardized platform services

Transport & Logistics

This project covered the development and implementation of a standardized Big Data architecture for company-wide platform services in the transport and logistics sector using various Azure services. My role was crucial in ensuring data transparency, data quality, DataOps, regulatory compliance, and implementing agile methodologies.

  • Developed solutions in Azure and automation projects and presented/discussed them
  • Azure services: Azure Data Catalogue, Azure Synapse Analytics, Azure Data Factory, Azure Databricks
  • Automated infrastructure provisioning with Infrastructure as Code (Terraform) and Ansible
  • Scrum, JIRA, GitLab, Docker
  • Implemented real-time data transfer with Apache Kafka
  • Advised on Azure platform strategy regarding reference architectures
  • Developed mechanisms and automations for proactive remediation of Azure and Kubernetes component vulnerabilities based on standardized clusters (Security by default)
  • Conceptual advancement of the architectural and technological platform in container orchestration based on Kubernetes, continuous integration & continuous deployment
  • Created user and permissions concepts following corporate guidelines
  • Operated the offered services
  • Agile team
  • Azure Data Catalogue (Purview)
  • Azure Synapse Workspace Analytics
  • Azure Data Factory
  • Azure Databricks
  • Terraform
  • GitLab Runner
  • Azure DevOps
Sep 2018 - Feb 2019
6 months

AWS infrastructure consulting and implementation for global process operations in the transport and logistics sector

Transport & Logistics

This project involved consulting and hands-on implementation of an AWS infrastructure to support a process operations team responsible for multiple international applications in the transport and logistics sector. My role was pivotal in identifying and implementing optimizations, developing and maintaining critical system infrastructure, and providing comprehensive support and training to internal teams.

  • Provisioned and operated servers, operating system environments, and database systems in AWS
  • Identified optimization opportunities from both business and technical perspectives
  • Developed and presented optimized processes
  • Implemented optimizations (AWS Lambda boto3)
  • Acted autonomously and represented deliverables within the team and to project managers/stakeholders
  • Administered and maintained the provisioned systems
  • Developed maintenance and monitoring concepts for these systems
  • Supported and advised development projects on using, configuring, and optimizing the provisioned systems
  • Consulted on architectures and operations concepts leveraging AWS cloud infrastructures
  • Trained internal staff on new AWS services adoption and changed workflows
  • Application migration for a business unit (Transport & Logistics), including Active Directory setup
  • Provisioned AWS infrastructure: databases (SQL) & EC2 instances, as well as Lambda services
  • Deployed via Terraform
  • Planned and executed the application migration
  • Rolled out permission models
  • Infrastructure provisioning with AWS CloudFormation

Marketplace platform setup and consulting based on Microsoft Azure services

  • Payment provider integration
  • Planning and architecture of Microsoft Azure services
  • Back end and front end consulting/implementation
  • User management setup using Active Directory
  • Implemented an upload tool for uploading very large files directly from the web browser
  • Security engineering

Process automation implementation: automatic ticket generator based on vulnerability scans

  • Designed and planned the most cost-efficient infrastructure components
  • Implemented Python logic in AWS Lambda
  • Provisioned infrastructure via AWS CloudFormation using YAML templates
  • Code optimization
  • Sent notification emails via AWS SNS

Implementation and technical project support of a web application for managing the certification process

Renowned Automotive Manufacturer

  • Provisioned infrastructure on AWS (MySQL Server)
  • Front end development in React.js
  • Back end implementation with Java EE | GlassFish
  • Refactoring and code optimization
  • Agile methodology using Scrum with Jira
  • CI/CD with Jenkins
  • Documentation in Confluence

Summary

  • Big Data Specialist focus: Big Data cloud architecture, data management platforms
  • Specialist in Big Data platforms on Amazon Web Services & Microsoft Azure
  • ETL processes/pipelines & data engineering
  • Architecture of data management platforms in large enterprises
  • Building data lakes & data lakehouse
  • Application migrations using cloud services
  • Consulting & implementation of automation concepts, especially DataOps & DevOps
  • Integration of Active Directory security concepts and compliance requirements
  • Python, SQL, TypeScript, Golang
  • Big Data cloud architectures (AWS & Microsoft Azure)
  • Data engineering (Databricks, Synapse Analytics, Fabric, Apache Spark, AWS Glue, Athena, Redshift & EMR)
  • Infrastructure as Code (Terraform, Pulumi, AWS CDK, ARM)

Languages

German
Native
English
Advanced
Polish
Advanced

Certifications & licenses

AWS Business Professional

AWS

AWS Certified Cloud Practitioner

AWS

AWS Certified Machine Learning – Specialty

AWS

AWS Certified Solutions Architect – Associate

AWS

AWS Technical Professional

AWS

AZ-300: Microsoft Azure Architect Technologies

Microsoft

AZ-301: Microsoft Azure Architect Design

Microsoft

Azure Solutions Architect Expert

Microsoft

Databricks Certified Associate Developer for Apache Spark 3.0

Databricks

HashiCorp Certified: Terraform Associate

HashiCorp