Lead Data Architect/Engineer - Machine Learning
- Hybrid On-Site, South San Francisco
- Please include your visa status, if applicable, when applying.
- My email is shane@alldus.com
About My Client:
They are a cutting-edge biotech startup focused on drug discovery, building on years of world-class research from the co-founders as well as top academic institutions in cellular stress. To that end, they are developing new custom robotic high-throughput wet-lab(s) integrated with AI-powered software solutions.
About the Role:
We are looking for an experienced, driven Lead Data Architect/Engineer - ML to join our rapidly growing AI and ML team. In this role, you will be responsible for designing and develop scalable data model / lakhouse infrastructure and collaborating with ML scientists to develop data pipelines and manage data curation. The technical stack you'll be working on will support our unique, phenotypic, target-agnostic drug discovery platform.
Key Responsibilities:
- Architect and maintain a cloud-based data lakehouse using a modern stack (AWS preferred), including Python, S3, Batch, Lambda, EKS, IAM, and REST (knowledge of Redshift, Glue, Athena, ECR, and Parquet is advantageous)
- Build and manage ETL processes and real-time data pipelines to collect, organize, and process data from internal and public sources (experience with biopharma-related data like imaging, omics, and molecular datasets is preferred)
- Take ownership of the complete data model lifecycle, including managing both structured and unstructured data for ML training, inference, and statistical analyses
Required Qualifications:
- A Bachelor’s degree in computer science, engineering, statistics, or a related field (Master’s degree or equivalent professional experience is a plus)
- Proven experience and expertise with large-scale datasets, data visualization, and developing optimized data processes for machine learning
- 5+ years of experience working with cloud-based services and data systems, with skills in:
- Python for data modeling, ETL, and warehousing
- SQL and NoSQL databases (experience with non-relational databases such as object, graph, or columnar stores is a plus)
- Automated CI/CD processes and cloud-based build pipelines
- Strong understanding of production environments, including Agile development, version control, and regression testing
- Familiarity with data governance and managing ML workflows (experience with tools like Spark, Databricks, and MLflow is preferred)
Preferred Qualifications:
- Experience developing software in a startup environment
- Background in biotechnology or drug discovery
- Ability to work independently while excelling in a team environment, with a strong focus on data-driven decision-making and clear communication