Job Description
Job Description:
We are seeking a detail-oriented Machine Learning Data Engineer to join our client’s team. As an ML Data Engineer, you will be responsible for designing, building, and maintaining scalable data pipelines that ingest, transform, and load data from various sources into cloud-based systems. You will work closely with machine learning teams to ensure that data is accurate, enriched, reliable, and readily available for analytics and model training.
Responsibilities:
- Design and Build Data Pipelines: Create efficient, reliable, streamable, and scalable data pipelines using industry-standard tools and techniques such as TorchData, WebDataset, Apache Parquet., Python,and SQL.
- Data Ingestion: Develop strategies for ingesting data from data providers while ensuring data quality and consistency.
- Data Pre-processing: Implement parallel pre-processing techniques to clean transform de-duplicate combine normalize the data.
- Data Curation and Enrichment: Curate augment enrich existing datasets to improve their quality and provide valuable insights to stakeholders.
- Synthetic Data Generation: Collaborate with synthetic data teams to generate artificial datasets that can be incorporated into existing pipelines.
Your Qualifications Include:
- Bachelor’s degree in Computer Science Information Technology or a related field
- A minimum of three years experience as a Software Engineer or Data Engineer
- A strong proficiency in Python language along with excellent software engineering skills
- Experience with data processing tools and formats such as Apache Parquet, WebDataset, TorchData, Pandas, Shell Scripting Protobuf and TFRecordJob Description: We are seeking a detail-oriented Machine Learning Data Engineer to join our client’s team. As an ML Data Engineer, you will be responsible for designing, building, and maintaining scalable data pipelines that ingest, transform, and load data from various sources into cloud-based systems. You will work closely with machine learning teams to ensure that data is accurate, enriched, reliable, and readily available for analytics and model training.
Responsibilities:
- Design and Build Data Pipelines: Create efficient, reliable, streamable, and scalable data pipelines using industry-standard tools and techniques such as TorchData, WebDataset, Apache Parquet., Python,and SQL.
- Data Ingestion: Develop strategies for ingesting data from data providers while ensuring data quality and consistency.
- Data Pre-processing: Implement parallel pre-processing techniques to clean transform de-duplicate combine normalize the data.
- Data Curation and Enrichment: Curate augment enrich existing datasets to improve their quality and provide valuable insights to stakeholders.
- Synthetic Data Generation: Collaborate with synthetic data teams to generate artificial datasets that can be incorporated into existing pipelines.
- Your qualifications include:
- Bachelor’s degree in Computer Science Information Technology or a related field
- A minimum of three years experience as a Software Engineer or Data Engineer
- A strong proficiency in Python language along with excellent software engineering skills
- Experience with data processing tools and formats such as Apache Parquet, WebDataset, TorchData, Pandas, Shell Scripting Protobuf and TFRecord