Amit P.

Senior AWS Data Engineer

Bokaro Steel City, India

Experience

Jan 2023 - Feb 2025
2 years 2 months
Bengaluru, India

Senior AWS Data Engineer

Keeno Technologies

  • Developed Python scripts to automate processes, perform data analysis, consume streaming APIs, process data streams using pandas DataFrame, and stage data for aggregation, cleansing, and building data marts

  • Implemented analytic predictions on machine learning data, data visualization, and business logic integration

  • Created ELT pipelines with a visual editor in DynamoDB and Kinesis, manipulated statistics using Python and Spark Streaming

  • Used AWS Lambda with Snowflake engine in ECR templates, built AWS Glue transformations for Redshift Spectrum moving data from S3 and external sources

  • Enabled data scientists to leverage GCP and Azure Data Lake pipelines for research and experimentation

  • Built automated Glue templates and Lambda scripts on EC2 for batch data streaming platforms for global partners

  • Separated data transfer files from S3, enabling ML and BI components for research and analysis

  • Environment: Python, Spark, AWS Glue, S3, Databricks, Kinesis, Lambda, CloudFormation, DynamoDB, CodePipeline, CodeBuild, Step Functions, Athena, Snowflake, Autosys, Airflow, NiFi, Glue DataBrew

Oct 2021 - Dec 2022
1 year 3 months
United States

Analytical Data Engineer

Brillio

  • Analyzed multiple source systems and extracted data using Apache Spark on Databricks

  • Transformed and loaded data to S3, built ELT pipelines for clients like UMG, Realtor, KFC, McD, and investment partners

  • Built AWS Glue transformations for Redshift Spectrum and reverse pipelines, enabling data scientists to leverage GCP environments

  • Coordinated with BI teams to provide reporting data, designed and developed complex data pipelines, and wrote production code for logging and querying

  • Constructed ETL and ELT pipelines with productivity and data quality checks

  • Built automated Glue templates and Lambda scripts on EC2 and RDS for batch data streaming platforms

  • Exported data catalog, CloudWatch metrics, Step Functions workflows, and versioned code with GitHub and GitLab

  • Environment: Python, Spark, AWS Glue, S3, Lambda, CloudFormation, DynamoDB, CodePipeline, CodeBuild, Pytest, Step Functions, Athena, Snowflake, Autosys, Shell Scripting

Jul 2018 - Sep 2021
3 years 3 months
Bengaluru, India

Senior Data Engineer

Enum Informatics Private Ltd

  • Extracted data from SQL and Oracle sources and bulk-loaded into AWS S3

  • Built ETL pipelines for retail clients on big data architecture, migrated metadata and Glue schemas into business layer

  • Used AWS Glue for transformations, scalable data load into processed layer on data lake, exposed data via Athena views

  • Coordinated with BI teams for reporting and analysis, designed models and complex data pipelines, wrote production code in Visual Studio Code

  • Constructed ETL workflows with productivity and data quality checks

  • Technologies: Python, Spark, AWS Glue, S3, Athena, KMS, RDS

Jul 2017 - Jun 2018
1 year
Bengaluru, India

Senior Data Engineer

KPIT

  • Extracted data from SQL sources and bulk-loaded into AWS S3

  • Migrated metadata and Glue schemas into business layer and used AWS Glue for transformations and data load into processed layer

  • Exposed processed data via Athena views

  • Coordinated with BI teams to deliver reporting data, designed models and complex data pipelines

  • Technologies: Python, Spark, AWS Glue, S3, Athena

Jun 2016 - Jun 2017
1 year 1 month
Oakland, United States

Senior Data Engineer

Kaiser Permanente

  • Designed and implemented scalable big data solutions with Hadoop ecosystem tools: Hive, MongoDB, Spark Streaming

  • Engineered real-time data pipelines using Kafka and Spark Streaming, stored data in Parquet on HDFS

  • Implemented data transformations with Pig, Hive scripts, Sqoop, and Java MapReduce jobs

  • Integrated analytics using Apache NiFi and Neo4J, applied Agile methodologies with daily scrums and sprint planning

  • Architected data solutions leveraging AWS Glue, S3, Redshift, and Athena for real-time analytics

  • Developed and optimized AWS Glue jobs for ETL, implemented data cataloging and metadata management

  • Reduced ETL execution time by 35% and processing costs by 20%

  • Mentored junior engineers on AWS Glue best practices

  • Built ELT pipelines with Airflow, Python, dbt, Stitch, and GCP solutions and guided analysts on dbt modeling and incremental views

  • Managed ETL processes with AWS Glue, Lambda, Kinesis, and Snowflake using dbt and Matillion

  • Utilized AWS Glue DataBrew for visual data preparation and self-service wrangling

  • Worked on MongoDB CRUD, indexing, replication, and sharding

  • Extensive experience with Apache Airflow and scripting for scheduling and automation

  • Designed Wherescape RED data flows and mappings, implemented Azure Data Factory and Databricks solutions

  • Built real-time log pipelines with Cribl, extracted feeds with Kafka and Spark Streaming, wrote Hive and Sqoop jobs on petabyte data

  • Implemented Apache NiFi topologies, MapReduce jobs, Oozie workflows, and applied Agile/DataOps

  • Technologies: HIPAA, Hadoop, Hive, Sqoop, Pig, Java, NiFi, MongoDB, Python, Scala, Spark, Oozie, HBase, Cassandra, Trifacta

Oct 2014 - May 2016
1 year 8 months
Atlanta, United States

Senior Data Engineer

The Home Depot

  • Implemented CI/CD processes with GitLab, Python, and Shell scripting for automation

  • Developed AWS Lambda functions for nested JSON processing and constructed scalable AWS data pipelines with VPC, EC2, S3, ASG, EBS, Snowflake, IAM, CloudFormation, Route 53, CloudWatch, CloudFront, CloudTrail

  • Configured ELBs and Auto Scaling for fault tolerance and cost efficiency

  • Managed metadata and lineage in AWS Data Lake using Lambda and Glue

  • Integrated Hadoop jobs with Autosys and developed sessionization algorithms for website analytics

  • Developed RESTful and SOAP APIs with Swagger and tested with Postman

  • Led data migration projects using HVR, StreamSets, and Oracle GoldenGate for real-time replication

  • Managed ETL with Informatica PowerCenter and built StreamSets pipelines

  • Configured AWS DMS and designed AWS API Gateway and Lambda integrations with Snowflake and DynamoDB

  • Built ETL pipelines from S3 to DynamoDB and Snowflake, performed data format conversions

  • Used Trifacta for data wrangling and modeled data with star and snowflake schemas, SCD

  • Created ML POCs, Sqoop imports to HDFS, Hive tables, and Spark applications in Scala

  • Supported SIT, UAT, and production

  • Technologies: Hadoop, Hive, Zookeeper, MapR, Teradata, Spark, Kafka, NiFi, MongoDB, Python, AWS, Scala, Oozie

Feb 2012 - Sep 2014
2 years 8 months
Peoria, United States

Data Engineer

Caterpillar

  • Designed and implemented end-to-end data pipelines on GCP and AWS using Airflow, Docker, and Kubernetes

  • Built ETL/ELT processes for GCP data ingestion and transformation, deployed cloud functions to load CSVs into BigQuery

  • Developed Informatica PowerExchange and Data Quality solutions, improving data accuracy by 50%

  • Processed Google Pub/Sub data to BigQuery with Dataflow and Python

  • Performed data analysis, migration, cleansing, and integration with Python and PL/SQL

  • Developed logistic regression models and near real-time Spark pipelines

  • Implemented Apache Airflow for pipeline orchestration

  • Technologies: GCP (BigQuery, Cloud Functions, Dataflow, Pub/Sub), AWS, Airflow, Python, Spark, SQL, Docker, Kubernetes, Pandas, NumPy, Scikit-learn

Summary

Holding 13.1 years of experience in the field of Data Engineering.

Database development including Design architecture, development, system integration and infrastructure readiness, development, implementation, maintenance, and support with experience in Cloud platforms like AWS, and Microsoft Security and Azure Data Factory.

Worked on functionalities of project upgrade and migration in modern -tool API.

Expert in understanding the data and designing and implementing enterprise platforms like Data Lake and Data warehouses.

Years of experience in Databricks along with AWS and GCP framework tools in AWS Glue Studio, Athena, and Spark cluster.

Good understanding of relational databases and hands-on working experience in writing applications on databases with performance tuning and optimization of view on modern on premises tool framework knowledge.

Extensive experience working on AWS EMR Clusters and building optimized Glue Jobs as per business requirements.

Developed Spark applications with the help of APIs Spark SQL, data frame, and dataset with API Gateway.

Creating a Glue job or reference implementation for de-identifying PHI columns using glue data.

The objective is to provide a worked-through reference implementation of PHI de identification for Data Operations to provide de-identified data for integration.

PHI De-identification guide of Glue databrew recipes or jobs that de-identify a large sample for reference implementation for an identified integration client.

Definition of Databrew recipe stored in git. Data ingested in the main HAP-DEV stack according to integration for the reference implementation solution to read from de-identified bucked and write to the proper ingestion location in ingress for the client type data script is required for Dbt models to run as expected for the given above integration.

Review and accept for reference implementation by DataOps.

Documentation of reference implementation and PHI guidelines in the ADO wiki. Sample client integration identified for the reference implementation. It has hia-hoc in the title. The AWS bucket is hia-hoc-ingress partitioned in AWS node helpful to migrate in Databricks data warehouse as a solution to build, train, and data business to build the training dataset providing spark cluster environment to analyze the amount of data and gigabytes real-time batch data to process with its streaming analytics.

Kafka connects to reflections avoid code duplications, annotations, databind in scala clusters in tools serialization of errors in AWS event consuming client source records sinking and transformation for Lambda pytest use cases for data processing pipeline building and deploying large scale data processing pipelines using distributed storage platforms like HDFS, S3, NoSQL databases in a production CI/CD environment.

Distributed the processing platforms like Hadoop, Spark or PySpark.

Hive tables on end-to-end big data solutions covering data ingestion, data cleansing, ETL, data mart creation, and exposing data for consumers. Handle complex data sets from different sources and converged them onto a single compute platform on both static real-time data ingestion methodologies.

Query authoring (Advanced SQL) , working familiarity with NoSQL Databases in exchanging data via microservices, API gateway, and across languages R or Python and scripting in Unix Commands, Unix shells, and serversScala feature ETL/ELT extraction, data modeling, and optimal integrations with internal or external business key platform in missing data,dataprocessing templates, and categorial data R programming.

Languages

English
Advanced
Hindi
Advanced
Marathi
Advanced

Education

Oct 2009 - Jun 2012

AIET College, Rajasthan Technical University

BCA, Specialisation in Computer Science and Mathematics · India

Certifications & licenses

AWS 2.0 Cloud

GCP

Python

Need a freelancer? Find your match in seconds.
Try FRATCH GPT
More actions