-
Data Pipeline Development: Design, build, and maintain scalable ETL/ELT pipelines for batch and real-time data processing.
-
Data Ingestion & Integration: Collect and integrate data from various structured and unstructured sources (APIs, databases, IoT, logs).
-
Database & Warehouse Management: Optimize data storage solutions, ensuring efficient querying and retrieval.
-
Data Transformation: Implement data cleaning, transformation, and enrichment processes to support analytics and ML workloads.
-
Performance Tuning: Optimize data processing performance by improving query execution, indexing, and storage formats.
-
Monitoring & Troubleshooting: Identify and resolve pipeline failures, data inconsistencies,and system bottlenecks.
-
Data Security & Compliance: Ensure encryption, masking, and governance policies are applied to protect sensitive data.
-
Collaboration: Work closely with AI/ML engineers, analysts, and business teams to define data needs and solutions.
-
Automation & CI/CD: Implement automated testing, deployment pipelines, and infrastructure-as-code practices for data workflows.
-
Documentation: Maintain detailed documentation on data schemas, workflows, and best practices.
-
Continuous Learning: Stay updated with new technologies, frameworks, and industry best practices to enhance data engineering capabilities.
-
5–8 years of hands-on experience in designing, building, and managing scalable data pipelines.
-
Proficiency in Python, SQL, or Java for data processing.
-
Technical experience in big data, data science, and public cloud
-
Strong experience with distributed computing frameworks like Apache Spark, Hadoop or similar frameworks
-
Expertise in designing and managing data warehouses using platforms like Redshift, or BigQuery.
-
Optimized database queries and stored procedures for improved performance of AWS Batch Jobs.
-
Hands-on experience in building and optimizing ETL/ELT pipelines using tools like Apache Airflow, or similar tools
-
Experience with AWS (Glue, Redshift, S3), GCP (BigQuery, Dataflow), or Azure. Cloud certifications are a plus.
-
Knowledge of real-time data processing tools like Apache Kafka, Flink, or Kinesis.
-
Experience with relational (PostgreSQL, MySQL) and NoSQL databases for ex. Dynamodb
-
Understanding of data privacy, compliance (GDPR, HIPAA), and best practices for secure data handling.
-
Experience with DevOps practices, version control (Git), and infrastructure as code
-
Ability to optimize SQL queries, storage formats (Parquet, ORC), and processing frameworks for efficiency.
-
Strong ability to work with data scientists, ML engineers, and business teams to meet data needs.