Lead Data Architect / Engineer

Alldus • Full-time • San Francisco Bay Area, US • $200k - $240k / year • 7h ago

Lead Data Architect/Engineer - Machine Learning

Hybrid On-Site, South San Francisco
Please include your visa status, if applicable, when applying.
My email is shane@alldus.com

About My Client:

They are a cutting-edge biotech startup focused on drug discovery, building on years of world-class research from the co-founders as well as top academic institutions in cellular stress. To that end, they are developing new custom robotic high-throughput wet-lab(s) integrated with AI-powered software solutions.

About the Role:

We are looking for an experienced, driven Lead Data Architect/Engineer - ML to join our rapidly growing AI and ML team. In this role, you will be responsible for designing and develop scalable data model / lakhouse infrastructure and collaborating with ML scientists to develop data pipelines and manage data curation. The technical stack you'll be working on will support our unique, phenotypic, target-agnostic drug discovery platform.

Key Responsibilities:

Architect and maintain a cloud-based data lakehouse using a modern stack (AWS preferred), including Python, S3, Batch, Lambda, EKS, IAM, and REST (knowledge of Redshift, Glue, Athena, ECR, and Parquet is advantageous)
Build and manage ETL processes and real-time data pipelines to collect, organize, and process data from internal and public sources (experience with biopharma-related data like imaging, omics, and molecular datasets is preferred)
Take ownership of the complete data model lifecycle, including managing both structured and unstructured data for ML training, inference, and statistical analyses

Required Qualifications:

A Bachelor’s degree in computer science, engineering, statistics, or a related field (Master’s degree or equivalent professional experience is a plus)
Proven experience and expertise with large-scale datasets, data visualization, and developing optimized data processes for machine learning
5+ years of experience working with cloud-based services and data systems, with skills in:
Python for data modeling, ETL, and warehousing
SQL and NoSQL databases (experience with non-relational databases such as object, graph, or columnar stores is a plus)
Automated CI/CD processes and cloud-based build pipelines
Strong understanding of production environments, including Agile development, version control, and regression testing
Familiarity with data governance and managing ML workflows (experience with tools like Spark, Databricks, and MLflow is preferred)

Preferred Qualifications:

Experience developing software in a startup environment
Background in biotechnology or drug discovery
Ability to work independently while excelling in a team environment, with a strong focus on data-driven decision-making and clear communication