Data Engineering & Pipelining Lead

Arrayo • Full-time • Greater Boston, US • 3d ago

We are excited to be expanding our Life Science Data Analytics Platform group in our Boston and Cambridge offices. We are looking for a Data Science & Pipelining Lead to join our Life Science Data Analytics Platform Group.

At Arrayo we make sure that data assets are available and accessible for advanced analytics, and so that the inherent value of data assets can be realized more readily. Arrayo assists top tier clients in life sciences in implementing effective Data Analytics strategies.

The Data Science & Pipelining Lead will be responsible for driving the socialization and utilization of technologies, algorithms, models, and methods for science driven data analytics R&D projects. As an Arrayo Team member, you will work to understand users’ requirements, and drive the definition, design, implementation and validation of cutting-edge pipelines and models used to process and analyze diverse sources of data.

Responsibilities:

Develop data flow pipelines to extract, transform, and load data from various data sources in various forms.
Work in collaboration with key scientific personnel to build, test, adapt, support and validate pipelines with integration into production systems
Manage the definition, design, implementation, and validation of data pipelines and models to analyze data from diverse sources to achieve targeted outcomes
Write custom scripts to extract data from unstructured/semi-structured sources.
Make great use of advanced pipeline technologies incl. Metaflow, Prefect, Nextflow, Airflow, Cromwell, KNIME, Databricks, Luigi, petl, AWS Data Pipeline.
Leverage big-data technologies for data processing, including Apache Spark, Kubernetes, Apache Pulsar, AWS (Lambda, S3, Athena)
Deliver solutions in a rapid, agile cycle
Contribute to many different projects in a dynamic, fast-moving environment
Deliver on data-driven research projects.
Collaboratively translate scientific and business questions into data and analytics requirements
Drive rapid prototyping for further implementation of analytical products
Partner with SME to translate modeling outputs into business language
Work with IT resources to enable appropriate data flow/data model

Requirements:

· B.S. with 6+ years of experience in Bioinformatics, Genomics, Genetics, Computer Sciences or a related field; M.S. with 4+, or Ph.D. with 2+ years of experience, or equivalent is preferred

· Knowledge of a subset of analytical approaches (ex. machine learning, statistical analysis, predictive modeling, visual analytics)

· Proficiency building, running and monitoring pipelines on cloud computing environments

· Experienced in commonly used command-line NGS tools is a plus (BWA, SAMTools, Bowtie2, Picard, PINDEL, GATK, etc.)

· Ability to understand and communicate statistical measures for interrogating the quality of data manipulation preferred

· Demonstrated ability to communicate efficiently and work effectively with a team of scientists·

· Experience with SQL and modeling relational databases. PostgreSQL experience preferred

· Experience using / designing web services and REST APIs

· Knowledge of software development best practices: agile, unit/integration testing, Git

· Experience working in Cloud Computing environments (ex AWS, Azure, etc) is preferred

· Preferred: Seasoned data engineer and/or bioinformatician with experience in large-scale healthcare and/or life sciences data and applications.

Apply