Developed Python scripts to automate processes, perform data analysis, consume streaming APIs, process data streams using pandas DataFrame, and stage data for aggregation, cleansing, and building data marts
Implemented analytic predictions on machine learning data, data visualization, and business logic integration
Created ELT pipelines with a visual editor in DynamoDB and Kinesis, manipulated statistics using Python and Spark Streaming
Used AWS Lambda with Snowflake engine in ECR templates, built AWS Glue transformations for Redshift Spectrum moving data from S3 and external sources
Enabled data scientists to leverage GCP and Azure Data Lake pipelines for research and experimentation
Built automated Glue templates and Lambda scripts on EC2 for batch data streaming platforms for global partners
Separated data transfer files from S3, enabling ML and BI components for research and analysis
Environment: Python, Spark, AWS Glue, S3, Databricks, Kinesis, Lambda, CloudFormation, DynamoDB, CodePipeline, CodeBuild, Step Functions, Athena, Snowflake, Autosys, Airflow, NiFi, Glue DataBrew
Analyzed multiple source systems and extracted data using Apache Spark on Databricks
Transformed and loaded data to S3, built ELT pipelines for clients like UMG, Realtor, KFC, McD, and investment partners
Built AWS Glue transformations for Redshift Spectrum and reverse pipelines, enabling data scientists to leverage GCP environments
Coordinated with BI teams to provide reporting data, designed and developed complex data pipelines, and wrote production code for logging and querying
Constructed ETL and ELT pipelines with productivity and data quality checks
Built automated Glue templates and Lambda scripts on EC2 and RDS for batch data streaming platforms
Exported data catalog, CloudWatch metrics, Step Functions workflows, and versioned code with GitHub and GitLab
Environment: Python, Spark, AWS Glue, S3, Lambda, CloudFormation, DynamoDB, CodePipeline, CodeBuild, Pytest, Step Functions, Athena, Snowflake, Autosys, Shell Scripting
Extracted data from SQL and Oracle sources and bulk-loaded into AWS S3
Built ETL pipelines for retail clients on big data architecture, migrated metadata and Glue schemas into business layer
Used AWS Glue for transformations, scalable data load into processed layer on data lake, exposed data via Athena views
Coordinated with BI teams for reporting and analysis, designed models and complex data pipelines, wrote production code in Visual Studio Code
Constructed ETL workflows with productivity and data quality checks
Technologies: Python, Spark, AWS Glue, S3, Athena, KMS, RDS
Extracted data from SQL sources and bulk-loaded into AWS S3
Migrated metadata and Glue schemas into business layer and used AWS Glue for transformations and data load into processed layer
Exposed processed data via Athena views
Coordinated with BI teams to deliver reporting data, designed models and complex data pipelines
Technologies: Python, Spark, AWS Glue, S3, Athena
Designed and implemented scalable big data solutions with Hadoop ecosystem tools: Hive, MongoDB, Spark Streaming
Engineered real-time data pipelines using Kafka and Spark Streaming, stored data in Parquet on HDFS
Implemented data transformations with Pig, Hive scripts, Sqoop, and Java MapReduce jobs
Integrated analytics using Apache NiFi and Neo4J, applied Agile methodologies with daily scrums and sprint planning
Architected data solutions leveraging AWS Glue, S3, Redshift, and Athena for real-time analytics
Developed and optimized AWS Glue jobs for ETL, implemented data cataloging and metadata management
Reduced ETL execution time by 35% and processing costs by 20%
Mentored junior engineers on AWS Glue best practices
Built ELT pipelines with Airflow, Python, dbt, Stitch, and GCP solutions and guided analysts on dbt modeling and incremental views
Managed ETL processes with AWS Glue, Lambda, Kinesis, and Snowflake using dbt and Matillion
Utilized AWS Glue DataBrew for visual data preparation and self-service wrangling
Worked on MongoDB CRUD, indexing, replication, and sharding
Extensive experience with Apache Airflow and scripting for scheduling and automation
Designed Wherescape RED data flows and mappings, implemented Azure Data Factory and Databricks solutions
Built real-time log pipelines with Cribl, extracted feeds with Kafka and Spark Streaming, wrote Hive and Sqoop jobs on petabyte data
Implemented Apache NiFi topologies, MapReduce jobs, Oozie workflows, and applied Agile/DataOps
Technologies: HIPAA, Hadoop, Hive, Sqoop, Pig, Java, NiFi, MongoDB, Python, Scala, Spark, Oozie, HBase, Cassandra, Trifacta
Implemented CI/CD processes with GitLab, Python, and Shell scripting for automation
Developed AWS Lambda functions for nested JSON processing and constructed scalable AWS data pipelines with VPC, EC2, S3, ASG, EBS, Snowflake, IAM, CloudFormation, Route 53, CloudWatch, CloudFront, CloudTrail
Configured ELBs and Auto Scaling for fault tolerance and cost efficiency
Managed metadata and lineage in AWS Data Lake using Lambda and Glue
Integrated Hadoop jobs with Autosys and developed sessionization algorithms for website analytics
Developed RESTful and SOAP APIs with Swagger and tested with Postman
Led data migration projects using HVR, StreamSets, and Oracle GoldenGate for real-time replication
Managed ETL with Informatica PowerCenter and built StreamSets pipelines
Configured AWS DMS and designed AWS API Gateway and Lambda integrations with Snowflake and DynamoDB
Built ETL pipelines from S3 to DynamoDB and Snowflake, performed data format conversions
Used Trifacta for data wrangling and modeled data with star and snowflake schemas, SCD
Created ML POCs, Sqoop imports to HDFS, Hive tables, and Spark applications in Scala
Supported SIT, UAT, and production
Technologies: Hadoop, Hive, Zookeeper, MapR, Teradata, Spark, Kafka, NiFi, MongoDB, Python, AWS, Scala, Oozie
Designed and implemented end-to-end data pipelines on GCP and AWS using Airflow, Docker, and Kubernetes
Built ETL/ELT processes for GCP data ingestion and transformation, deployed cloud functions to load CSVs into BigQuery
Developed Informatica PowerExchange and Data Quality solutions, improving data accuracy by 50%
Processed Google Pub/Sub data to BigQuery with Dataflow and Python
Performed data analysis, migration, cleansing, and integration with Python and PL/SQL
Developed logistic regression models and near real-time Spark pipelines
Implemented Apache Airflow for pipeline orchestration
Technologies: GCP (BigQuery, Cloud Functions, Dataflow, Pub/Sub), AWS, Airflow, Python, Spark, SQL, Docker, Kubernetes, Pandas, NumPy, Scikit-learn
Holding 13.1 years of experience in the field of Data Engineering.
Database development including Design architecture, development, system integration and infrastructure readiness, development, implementation, maintenance, and support with experience in Cloud platforms like AWS, and Microsoft Security and Azure Data Factory.
Worked on functionalities of project upgrade and migration in modern -tool API.
Expert in understanding the data and designing and implementing enterprise platforms like Data Lake and Data warehouses.
Years of experience in Databricks along with AWS and GCP framework tools in AWS Glue Studio, Athena, and Spark cluster.
Good understanding of relational databases and hands-on working experience in writing applications on databases with performance tuning and optimization of view on modern on premises tool framework knowledge.
Extensive experience working on AWS EMR Clusters and building optimized Glue Jobs as per business requirements.
Developed Spark applications with the help of APIs Spark SQL, data frame, and dataset with API Gateway.
Creating a Glue job or reference implementation for de-identifying PHI columns using glue data.
The objective is to provide a worked-through reference implementation of PHI de identification for Data Operations to provide de-identified data for integration.
PHI De-identification guide of Glue databrew recipes or jobs that de-identify a large sample for reference implementation for an identified integration client.
Definition of Databrew recipe stored in git. Data ingested in the main HAP-DEV stack according to integration for the reference implementation solution to read from de-identified bucked and write to the proper ingestion location in ingress for the client type data script is required for Dbt models to run as expected for the given above integration.
Review and accept for reference implementation by DataOps.
Documentation of reference implementation and PHI guidelines in the ADO wiki. Sample client integration identified for the reference implementation. It has hia-hoc in the title. The AWS bucket is hia-hoc-ingress partitioned in AWS node helpful to migrate in Databricks data warehouse as a solution to build, train, and data business to build the training dataset providing spark cluster environment to analyze the amount of data and gigabytes real-time batch data to process with its streaming analytics.
Kafka connects to reflections avoid code duplications, annotations, databind in scala clusters in tools serialization of errors in AWS event consuming client source records sinking and transformation for Lambda pytest use cases for data processing pipeline building and deploying large scale data processing pipelines using distributed storage platforms like HDFS, S3, NoSQL databases in a production CI/CD environment.
Distributed the processing platforms like Hadoop, Spark or PySpark.
Hive tables on end-to-end big data solutions covering data ingestion, data cleansing, ETL, data mart creation, and exposing data for consumers. Handle complex data sets from different sources and converged them onto a single compute platform on both static real-time data ingestion methodologies.
Query authoring (Advanced SQL) , working familiarity with NoSQL Databases in exchanging data via microservices, API gateway, and across languages R or Python and scripting in Unix Commands, Unix shells, and serversScala feature ETL/ELT extraction, data modeling, and optimal integrations with internal or external business key platform in missing data,dataprocessing templates, and categorial data R programming.
Discover other experts with similar qualifications and experience
2025 © FRATCH.IO GmbH. All rights reserved.