Development of Data Integration Hub platform (DIH) in scope of a Data Governance project. DIH is the central architecture element for sharing data between tenants. It is primarily based on data product descriptions (specifications), data catalogs and services used to represent shared data. A typical workflow has the following steps:
The focus of my tasks was:
The following software stack is used for development:
Certification: AWS Certified Data Engineer - Associate
Design and implementation of data driven microservices for search engine (Google) optimization using AWS services. Services have mainly ETL patterns. A typical service gets data from a source (REST API, SQS, DynamoDB,etc), transforms it (e.g. calculates changes in a list with respect to previous days) and uploads results to a backend (S3, database).
Service I (MLOps). Assessment of OTTO pages by extracting related keywords that describe the content of the pages and matching them with searches on Google. Migration of data transformation, model training and retraining, and model deployment from GCP to AWS. Design and implementations of workflows:
Service S:
Technologies used:
Conceptualization and implementation of a hybrid environments on Google Cloud Platform:
Development of REST API for machine learning models using Flask
Implementation of persistent storage based on MapR for Kubernetes cluster
Operating of MapR clusters: upgrades, extensions, troubleshooting of services and applications
Synchronization of a Kafka cluster with MapR streams using Kafka connect
Design and implementation of ETL pipelines, synchonization and integration of MapR clusters with different data sources (e.g. DB2 and TerraData warehouses)
Onboarding of new internal REWE customers to MapR platforms
Consulting of management by technical topics and future developments in Big Data fields
Proposals for solutions of security topics (e.g. contrained delegation on F5 or authentication for OpenTSDB) and PoCs
Developer in data science projects:
3rd-level support
Management of large-scale, multi-tenant and secure, highly-available Hadoop infrastructure supporting rapid data growth for a wide spectrum of innovative customers
Pre-sales: onboarding of new customers
Providing architectural guidance, planning, estimating cluster capacity, and creating roadmaps for Hadoop cluster deployments
Design, implemention and maintainance of enterprise-level security Hadoop environments (Kerberos, LDAP/AD, Sentry, encryption-in-motion, encryption-at-rest)
Install and configuration of Hadoop multi-tenant environments, updates, patches, version upgrades
Creating run books for troubleshooting, cluster recovery and routine cluster maintenance
Troubleshooting Hadoop-related applications, components and infrastructure issues at large scale
3rd-Level-Support (DevOps) for business-critical applications and use cases
Evaluation and proposals of new tools and technologies to meet the needs of the global organization (Allianz Group)
Work closely with infrastructure, network, database, application, business intelligence and data science units
Developer in Fraud Detection projects including machine learning
Design and setup of a Microsoft Revolution (Microsoft R Open) data science model training platform on Microsoft Azure and on premise for Fraud Detection using Docker and Terraform
Developer in Supply Chain Analytics projects (e.g. GraphServer that allows to execute graph queries on data stored on HDFS)
Transformation of team's internal processes according to Agile/SCRUM framework
Developer of Kafka-based use cases:
ClickStream:
Classification of documents:
Graph database (PoC): manage graphs via Kafka interface: