ML Infrastructure Engineer [28681]

Stealth • Full-time • Remote (San Francisco Bay Area, US) • 2d ago

About Us:

We are a cutting-edge AI company focused on building robust, scalable machine learning solutions that drive innovation across industries. As we grow, we are looking for an ML Infrastructure Engineer to design and maintain the infrastructure that supports our machine learning operations, enabling our data scientists and engineers to build and deploy high-performance models at scale.

Key Responsibilities:

Architect, build, and maintain scalable machine learning infrastructure to support large-scale data processing, model training, and deployment.
Optimize ML workflows, from data ingestion to feature engineering, model training, and model serving in production.
Collaborate with data scientists, ML engineers, and DevOps teams to ensure the seamless integration and deployment of models in a production environment.
Automate and manage ML pipelines, ensuring continuous training and monitoring of models with minimal manual intervention.
Ensure that infrastructure supports the rapid experimentation, tuning, and evaluation of new models.
Stay up-to-date with the latest cloud technologies and best practices to improve the performance and reliability of ML systems.

Qualifications:

Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field from a top-tier institution (e.g., Ivys, Stanford, Cal, Waterloo).
2-4 years of experience in designing and managing machine learning infrastructure.
Strong experience with cloud platforms (AWS, GCP, or Azure), including knowledge of managed ML services like SageMaker, Vertex AI, or Azure ML.
Proficiency in Python, with experience in ML libraries (TensorFlow, PyTorch, Scikit-learn).
Experience with containerization technologies (e.g., Docker, Kubernetes) and microservice architecture for deploying scalable ML models.
Strong knowledge of MLOps principles and tools (e.g., MLflow, Kubeflow, Airflow).
Experience with distributed computing frameworks (e.g., Apache Spark, Ray) and parallelization for large-scale model training.
Knowledge of continuous integration/continuous deployment (CI/CD) best practices in ML environments.

Preferred Skills:

Experience with data versioning and experiment tracking tools (e.g., DVC, Weights & Biases).
Background in managing high-performance computing resources (GPUs, TPUs) for training deep learning models.
Familiarity with modern database systems and data lakes (e.g., Snowflake, Delta Lake) for large-scale data management.
Experience with security and compliance for ML systems, ensuring data privacy and regulatory adherence.
Strong problem-solving skills and the ability to work effectively in a fast-paced, dynamic environment.

Why Join Us?

Lead the development of cutting-edge ML infrastructure that powers AI solutions with global impact.
Work alongside a world-class team of engineers and data scientists at the forefront of AI technology.
Competitive salary and benefits package, with opportunities for career growth.
Flexible, remote work environment focused on innovation, collaboration, and work-life balance.