Backend Software Engineer / Distributed Systems
Distributed ML Training
Location: Fully Remote
Type: Full-time
Join an innovative Series A Deep Tech company at the forefront of AI and blockchain technology! Backed by top investors with over $50 million in funding, our client is a team of 20 industry experts, looking to grow to 35. They are leveraging blockchain to provide globally accessible computing resources for AI platforms and are seeking world-class engineers to accelerate AI progress. This is a fully remote role offering a high level of autonomy.
Responsibilities:
- ML Orchestration System Design: Develop systems for orchestrating ML execution across decentralized and heterogeneous infrastructure.
- Performance Optimization: Profile and optimize training algorithms continually.
- Implement Novel Research: Build new mechanisms and algorithms to solve unprecedented problems.
- Engineering Support: Collaborate on broader ML issues, such as reproducible training.
- Technical Writing and Engagement: Contribute to technical reports and papers, and engage with the community.
Minimum Requirements:
- Distributed Foundation Model Training: Experience designing or working with training systems on large clusters.
- Networking Proficiency: Understanding and troubleshooting experience with IP, TCP, UDP, HTTP, and communication backends like NCCL, GLOO, and MPI.
- Open Source Contributions: Experience with large open-source codebases as a maintainer or trusted contributor.
- Rust Enthusiasm: Willingness to learn Rust to work across the codebase.
- Computer Science Background: Solid understanding of computational complexity and broad knowledge of algorithms and data structures.
- Self-motivation and Communication: Highly self-motivated with excellent verbal and written communication skills.
- Applied Research Comfort: Comfortable working in a high-autonomy, unpredictable applied research environment.
Bonus Skills:
- Rust Expertise: Strong experience with systems programming in Rust, understanding lifetimes, and the purpose of Pin.
- Research Experience: Published research in distributed systems or ML domains.
- Blockchain Knowledge: Understanding of blockchain fundamentals.
Be part of a team dedicated to democratizing AI, where you can leverage your expertise in distributed ML training, networking, and open-source contributions to make a significant impact. Embrace autonomy, continuous learning, and the drive to push innovative solutions in a highly collaborative and flexible environment.
Apply now to join this cutting-edge team and contribute to the future of AI!