Development of Data Integration Hub platform (DIH) as part of a Data Governance project. DIH is the central architecture component for sharing data between tenants. It is based on data product descriptions (specifications), data catalogs and services representing shared data.
A typical workflow includes:
Inject data product description via REST API or Swagger UI.
Metadata is written into Kafka topics.
Kafka consumers read the data and perform actions such as creating metadata in Datahub, creating tables in Trino, creating predefined file structures on S3, setting up policies, etc.
My key tasks:
Implement single sign-on in services based on JWT tokens.
Develop REST APIs.
Build integration tools between software components (SelfService, Trino, S3, Datahub, Great Expectations, etc.).
Data catalogs and lineage: Datahub, OpenLineage; integrated with Spark, Pandas (implemented in Python).
SQL engines: Trino with Starburst Web UI, PostgreSQL, Hadoop, DB2, Delta Lake.
Data quality: Great Expectations.
REST API: Java, Swagger, Spring Boot.
Authentication: JWT, OAuth2, Single Sign-On.
Apache Ranger for access policy management.
Monitoring: Prometheus, Grafana.
Certification: AWS Certified Data Engineer - Associate.
Oct 2021 - Apr 2023
1 year 7 months
Hamburg, Germany
Senior DevOps (external)
Otto GmbH & Co KG
Design and implementation of data-driven microservices for Google search engine optimization using AWS services. These services follow ETL patterns: a typical service takes data from a source (REST API, SQS, DynamoDB, etc.), transforms it (e.g., calculates changes in a list compared to previous days) and uploads results to a backend (S3, database).
Service I (MLOps): assess OTTO pages by extracting keywords that describe page content and matching them with Google searches. Migrated data transformation, model training, retraining, and deployment from GCP to AWS. Designed and implemented workflows.
Use GitHub Actions for CI/CD pipelines.
Use Terraform to manage cloud resources (container creation, load balancing model instances, etc.).
Implement model validation and testing with Python.
Implement model monitoring with Grafana.
Service S:
Handle millions of REST API calls per hour using AsyncIO.
Designed and implemented hybrid environments on Google Cloud Platform.
Provisioned GCP infrastructure with Terraform and later with Ansible.
Set up redundant connectivity and data encryption between GCP and on-premise systems.
Provisioned MapR and Spark environments on GCP.
Configured real-time data replication from on-premise tables to GCP.
Integrated with REWE services (Active Directory, DNS, Instana, etc.).
Developed REST APIs for machine learning models using Flask.
Implemented persistent storage based on MapR for Kubernetes clusters.
Operated MapR clusters: upgrades, scaling, troubleshooting services and applications.
Synchronized a Kafka cluster with MapR streams using Kafka Connect.
Designed and implemented ETL pipelines, synchronization and integration of MapR clusters with various data sources (e.g., DB2 and Teradata warehouses).
Onboarded new internal REWE customers to MapR platforms.
Advised management on technical topics and future developments in big data.
Proposed security solutions (e.g., constrained delegation on F5 or authentication for OpenTSDB) and conducted PoCs.
Developed solutions in data science projects.
Built market classification models.
Visualized data and predictions with Jupyter and Grafana.
Integrated with JIRA.
Provided 3rd-level support.
Sep 2016 - May 2018
1 year 9 months
Munich, Germany
Senior Big Data Architect
Allianz Technology SE
Managed large-scale, multi-tenant, secure and highly available Hadoop infrastructure supporting rapid data growth for a diverse customer base.
Pre-sales: onboarded new customers.
Provided architectural guidance, planned and estimated cluster capacity, and created roadmaps for Hadoop deployments.
Designed, implemented and maintained enterprise-level secure Hadoop environments (Kerberos, LDAP/AD, Apache Sentry, encryption in transit, encryption at rest).
Installed and configured multi-tenant Hadoop environments, applying updates, patches and version upgrades.
Created runbooks for troubleshooting, cluster recovery and routine maintenance.
Troubleshot Hadoop applications, components and infrastructure at large scale.
Provided 3rd-level support (DevOps) for business-critical applications and use cases.
Evaluated and recommended new tools and technologies to meet the needs of the Allianz Group.
Worked closely with infrastructure, network, database, application, business intelligence and data science teams.
Contributed to Fraud Detection projects, including machine learning.
Designed and set up a Microsoft R data science model training platform (Microsoft R Open) on Azure and on-premise for Fraud Detection using Docker and Terraform.
Contributed to Supply Chain Analytics projects (e.g., GraphServer for executing graph queries on data in HDFS).
Transformed internal team processes according to Agile/SCRUM framework.
Developed Kafka-based use cases.
ClickStream:
Producer: aggregator for streamed URLs clicked on web pages via REST API or other sources (e.g., Oracle).
Consumer: Flink job that, after pre-processing (sanity checks, extracting time information), writes data to HDFS in XML files.
Producer: custom producer reading documents from a shared file system and writing them into Kafka.
Consumer: Spark Streaming job that, after pre-processing, sends documents to the UIMA platform for classification. After classification, data is stored on HDFS for further batch processing.