Jan Krol - Development of a comprehensive data strategy and governance framework for a data management platform on Databricks

Berlin, Germany

Experience

Mar 2023 - May 2024

1 year 3 months

Development of a comprehensive data strategy and governance framework for a data management platform on Databricks

Intralogistics

In this “lighthouse” project, I led the development of a robust data strategy and governance framework to optimize and enhance the organization’s data processing capabilities. The core of the project was building a high-performance data management platform on Databricks, complemented by designing and implementing an efficient data hub ingest platform.

Led the design and establishment of an enterprise-wide data strategy aligned with business objectives and technological advancements
Developed a comprehensive data governance framework to ensure data quality, privacy, and compliance with industry standards
Oversaw deployment and tuning of the Databricks data management platform, improving data processing, analytics and reporting capabilities with Power BI
Built a robust data hub with high-performance ingest pipelines based on AWS EventBridge, optimizing data flow from multiple sources to centralized storage systems (Data Lakehouse on Azure)
Collaborated with cross-functional teams to integrate the data management platform into existing IT infrastructure and business processes
Conducted training and workshops for new teams, fostering a data-driven culture and enhancing data literacy across the organization
Azure Databricks
Databricks Data Catalog
AWS EventBridge
Kinesis Event Hub
Structured Streaming (Apache Spark)

Jan 2021 - Feb 2023

1 year 2 months

Innovative integration and analysis of logistics data streams with PySpark Structured Streaming and Data Mesh implementation

Logistics

This project focused on the complex integration of logistics data streams with Event Hub and Kafka using PySpark Structured Streaming. Our approach transformed how logistics data could be captured, processed, and linked in real time through a graph-based method. By leveraging technologies such as GraphFrame, Azure Synapse Analytics, Apache Spark, and Power BI, we established a robust system that ensured high data quality and seamless transmission while adhering to IT governance principles.

Integrated a logistics data stream with Kafka using PySpark Structured Streaming
Defined required data structures for the data stream
Robust and efficient integration of the logistics data stream with Event Hubs
Real-time utilization of logistics data for analysis and further processing
Designed and implemented pipelines to collect, process, and route the data stream
Efficient data processing with PySpark Structured Streaming
Configured and initialized the PySpark streaming job
Executed comprehensive testing and monitoring mechanisms
Ensured smooth data transmission and high data quality
Azure Synapse Analytics
Purview Data Catalog
Event Hub
GraphFrame
Power BI

Sep 2021 - Jan 2022

5 months

Enhanced data processing and integration systems for e-commerce with serverless and distributed Data Mesh architectures

E-Commerce

In this project, my primary responsibility was to lead and support various internal e-commerce product teams in developing, implementing, and maintaining powerful data processing and integration systems. The focus was on migrating existing data services and pipelines to a new, improved architecture, emphasizing the development of an event-based system using serverless technologies and big data frameworks.

Guided and supported the migration of existing data services, pipelines, and assets to a new, advanced architecture
Developed an event-based system
Used Lambda functions and PySpark
Integrated with Kafka
Design and architecture planning
Implemented Lambda functions and PySpark jobs
Configured and connected to Kafka
Serverless architecture for scalability and availability
Real-time event data processing and analytics
PySpark transformations, filtering, and aggregations
Efficient and reliable Kafka integration
Configuration, security settings, and integration with other components
Extensive testing and monitoring mechanisms
High-performance and scalable event system
Derived valuable insights from event data
Data-driven decision making
AWS Glue
Apache Spark
Data Catalog
Athena
Redshift
Lambda
ECS
Step Functions
Implemented distributed Data Mesh architectures to enable efficient collaboration across product teams
Data processing with big data frameworks and database technologies
Developed serverless/elastic cloud architecture (AWS)
Deployed architecture following DevOps best practices and Infrastructure-as-Code (AWS CDK & Terraform)

Apr 2020 - Sep 2021

1 year 6 months

Migration and enhancement of the e-commerce data platform to an AWS Data Lakehouse architecture

E-Commerce

This project involved the strategic development and migration of existing analytics data pipelines into a Data Lakehouse architecture using AWS services. A key aspect was improving the big data lake environment and ensuring strict data quality and compliance standards, particularly regarding GDPR.

Evolved the AWS Big Data Lake environment
Designed and implemented a Data Lakehouse
Exploratory analysis and algorithm development through data provisioning and preparation (AWS Glue, Spark, Lambda)
Data ingestion
Developed data pipelines and ETL jobs to deliver consumption-ready data sources (AWS Glue, AWS Redshift, Spark, PySpark)
Regression testing and quality checks in data pipelines and the data lake
Orchestrated and connected data sources
Implemented automated deployments using DevOps best practices (AWS CodeBuild + CodePipeline, GitHub Actions)
Built infrastructure using IaC (AWS CDK)
System administration (including cost monitoring)

Feb 2019 - Apr 2020

1 year 3 months

Development of architecture and implementation of a Big Data environment for company-wide, standardized platform services

Transport & Logistics

This project covered the development and implementation of a standardized Big Data architecture for company-wide platform services in the transport and logistics sector using various Azure services. My role was crucial in ensuring data transparency, data quality, DataOps, regulatory compliance, and implementing agile methodologies.

Developed solutions in Azure and automation projects and presented/discussed them
Azure services: Azure Data Catalogue, Azure Synapse Analytics, Azure Data Factory, Azure Databricks
Automated infrastructure provisioning with Infrastructure as Code (Terraform) and Ansible
Scrum, JIRA, GitLab, Docker
Implemented real-time data transfer with Apache Kafka
Advised on Azure platform strategy regarding reference architectures
Developed mechanisms and automations for proactive remediation of Azure and Kubernetes component vulnerabilities based on standardized clusters (Security by default)
Conceptual advancement of the architectural and technological platform in container orchestration based on Kubernetes, continuous integration & continuous deployment
Created user and permissions concepts following corporate guidelines
Operated the offered services
Agile team
Azure Data Catalogue (Purview)
Azure Synapse Workspace Analytics
Azure Data Factory
Azure Databricks
Terraform
GitLab Runner
Azure DevOps

Sep 2018 - Feb 2019

6 months

AWS infrastructure consulting and implementation for global process operations in the transport and logistics sector

Transport & Logistics

This project involved consulting and hands-on implementation of an AWS infrastructure to support a process operations team responsible for multiple international applications in the transport and logistics sector. My role was pivotal in identifying and implementing optimizations, developing and maintaining critical system infrastructure, and providing comprehensive support and training to internal teams.

Provisioned and operated servers, operating system environments, and database systems in AWS
Identified optimization opportunities from both business and technical perspectives
Developed and presented optimized processes
Implemented optimizations (AWS Lambda boto3)
Acted autonomously and represented deliverables within the team and to project managers/stakeholders
Administered and maintained the provisioned systems
Developed maintenance and monitoring concepts for these systems
Supported and advised development projects on using, configuring, and optimizing the provisioned systems
Consulted on architectures and operations concepts leveraging AWS cloud infrastructures
Trained internal staff on new AWS services adoption and changed workflows
Application migration for a business unit (Transport & Logistics), including Active Directory setup
Provisioned AWS infrastructure: databases (SQL) & EC2 instances, as well as Lambda services
Deployed via Terraform
Planned and executed the application migration
Rolled out permission models
Infrastructure provisioning with AWS CloudFormation

Marketplace platform setup and consulting based on Microsoft Azure services

Payment provider integration
Planning and architecture of Microsoft Azure services
Back end and front end consulting/implementation
User management setup using Active Directory
Implemented an upload tool for uploading very large files directly from the web browser
Security engineering

Process automation implementation: automatic ticket generator based on vulnerability scans

Designed and planned the most cost-efficient infrastructure components
Implemented Python logic in AWS Lambda
Provisioned infrastructure via AWS CloudFormation using YAML templates
Code optimization
Sent notification emails via AWS SNS

Implementation and technical project support of a web application for managing the certification process

Renowned Automotive Manufacturer

Provisioned infrastructure on AWS (MySQL Server)
Front end development in React.js
Back end implementation with Java EE | GlassFish
Refactoring and code optimization
Agile methodology using Scrum with Jira
CI/CD with Jenkins
Documentation in Confluence

Summary

Big Data Specialist focus: Big Data cloud architecture, data management platforms
Specialist in Big Data platforms on Amazon Web Services & Microsoft Azure
ETL processes/pipelines & data engineering
Architecture of data management platforms in large enterprises
Building data lakes & data lakehouse
Application migrations using cloud services
Consulting & implementation of automation concepts, especially DataOps & DevOps
Integration of Active Directory security concepts and compliance requirements
Python, SQL, TypeScript, Golang
Big Data cloud architectures (AWS & Microsoft Azure)
Data engineering (Databricks, Synapse Analytics, Fabric, Apache Spark, AWS Glue, Athena, Redshift & EMR)
Infrastructure as Code (Terraform, Pulumi, AWS CDK, ARM)

Languages

German

Native

English

Advanced

Polish

Advanced

Certifications & licenses

AWS Business Professional

AWS

AWS Certified Cloud Practitioner

AWS

AWS Certified Machine Learning – Specialty

AWS

AWS Certified Solutions Architect – Associate

AWS

AWS Technical Professional

AWS

AZ-300: Microsoft Azure Architect Technologies

Microsoft

AZ-301: Microsoft Azure Architect Design

Microsoft

Azure Solutions Architect Expert

Microsoft

Databricks Certified Associate Developer for Apache Spark 3.0

Databricks

HashiCorp Certified: Terraform Associate

HashiCorp

Similar Freelancers

Discover other experts with similar qualifications and experience

Experience

Development of a comprehensive data strategy and governance framework for a data management platform on Databricks

Intralogistics

Innovative integration and analysis of logistics data streams with PySpark Structured Streaming and Data Mesh implementation

Logistics

Enhanced data processing and integration systems for e-commerce with serverless and distributed Data Mesh architectures

E-Commerce

Migration and enhancement of the e-commerce data platform to an AWS Data Lakehouse architecture

E-Commerce

Development of architecture and implementation of a Big Data environment for company-wide, standardized platform services

Transport & Logistics

AWS infrastructure consulting and implementation for global process operations in the transport and logistics sector

Transport & Logistics

Marketplace platform setup and consulting based on Microsoft Azure services

Process automation implementation: automatic ticket generator based on vulnerability scans

Implementation and technical project support of a web application for managing the certification process

Renowned Automotive Manufacturer

Summary

Languages

Certifications & licenses

AWS Business Professional

AWS

AWS Certified Cloud Practitioner

AWS

AWS Certified Machine Learning – Specialty

AWS

AWS Certified Solutions Architect – Associate

AWS

AWS Technical Professional

AWS

AZ-300: Microsoft Azure Architect Technologies

Microsoft

AZ-301: Microsoft Azure Architect Design

Microsoft

Azure Solutions Architect Expert

Microsoft

Databricks Certified Associate Developer for Apache Spark 3.0

Databricks

HashiCorp Certified: Terraform Associate

HashiCorp

Similar Freelancers

AR/VR/XR Architect

Jitsi - Video conferencing system with AI-powered simultaneous translation (prototype)

Data Engineer

Data Scientist & AI Engineer & AI Architect

Senior Data/ML Consultant & Technical Lead

Azure Cloud Solution architecture with focus on data lakehouse, data analytics, machine learning, and GenAI

Solution Architect / Project Manager

Backend Microservices Migration Specialist

Freelancer, Solution Architect

Senior Golang Engineer

Lead Cloud Engineer

Senior Fullstack Developer

Lead Product Owner

AI Engineer, Cloud Solution Architect, Backend Developer

Data Solution Architect, Founder