Data Engineer - Streaming

Steneral Consulting • Contract • New York, NY, US • 4d ago

Onsite role, must be local

Need valid LinkedIn

For this platform we are seeking an experienced Data Engineer with 7+ years of expertise in building scalable data pipelines, data streaming, and batch processing solutions.

Key Responsibilities

Design, build, and maintain real-time data pipelines using Apache Kafka and Spark Streaming for processing large volumes of streaming data.
Develop and optimize scalable data ingestion and data processing pipelines leveraging AWS data services such as Kinesis, Glue, Redshift, and S3 for both batch and streaming data architectures.
Work with AWS Kinesis to capture, process, and analyze real-time streaming data, integrating with Kafka and Spark for seamless processing.
Build and manage data lakes on AWS S3, designing storage layers to support both structured and unstructured data in real-time.
Implement event-driven architectures using Kafka and AWS Lambda to trigger processing pipelines based on incoming data events.
Collaborate with data scientists, data analysts, and backend developers to ensure proper data modeling and pipeline design for real-time analytics and business intelligence on AWS.
Use AWS Redshift for data warehousing and Athena for serverless querying of data, integrating with streaming data pipelines.
Monitor and optimize the performance of streaming applications and data pipelines to ensure high efficiency, scalability, and low-latency processing using AWS cloud-native monitoring tools like CloudWatch and AWS X-Ray.
Ensure data governance, security, and compliance within AWS services, particularly when working with sensitive data and real-time processing.
Deploy and manage infrastructure as code using AWS CloudFormation or Terraform to automate the provisioning of data streaming environments.

Requirements

Strong experience with Apache Kafka, Kafka Streams, and Spark Streaming for building real-time data pipelines.
Expertise in AWS data services such as Kinesis, S3, Glue, Lambda, Redshift, and Athena for building and scaling data solutions in the cloud.
Hands-on experience with big data frameworks like Apache Spark for processing large-scale data sets.
Proficiency in Python, Pyspark and Java for building and maintaining data pipelines.
Strong understanding of real-time data processing architectures and distributed systems.
Experience with stream processing frameworks such as Apache Flink, Storm, or NiFi .
Familiarity with AWS data lake architectures and best practices for storing and querying data in S3.
Knowledge of database technologies, including NoSQL (e.g., DynamoDB, Cassandra) and SQL databases.
Experience with containerization and orchestration tools like Docker and Kubernetes for deploying applications in AWS environments.
Strong experience in monitoring, troubleshooting, and optimizing real-time streaming pipelines using AWS services like CloudWatch, X-Ray, and AWS Step Functions.
Experience with data governance, security, and compliance when working with real-time data in cloud environments.