About Bagel:
Imagine a breakthrough like the discovery of fire or the invention of gravitational theory. Artificial General Intelligence (AGI) has the potential to reshape our future, accelerating solutions to challenges like poverty and climate change. Achieving AGI with proper alignment is the most important problem to solve in our lifetime.
Bagel leads in privacy-preserving machine learning. We provide a decentralized AI ecosystem that lets enterprises use AI while ensuring data privacy and compliance. By enabling the secure trade, training, and fine-tuning of AI models with public and private datasets, Bagel is paving the way for the future of AI development.
Job Overview:
We seek a Site Reliability Engineer (SRE) with proficiency in AWS, Google Vertex AI, Terraform and AWS CDK. The role requires expertise in cloud infrastructure, monitoring, and automation, as well as experience with HuggingFace and machine learning model deployment. With over 5 years of experience and a strong foundation in computer science, you will join our team to ensure the reliability and performance of our innovative AI solutions.
Key Responsibilities:
- Cloud Infrastructure: Lead the deployment and management of cloud infrastructure on AWS and Google Vertex AI. Ensure high availability, scalability, and security of the platform.
- Automation: Develop and maintain infrastructure as code using Terraform. Implement automated deployment pipelines and monitoring solutions.
- System Monitoring: Set up and manage monitoring and alerting systems to proactively identify and resolve issues. Ensure system performance and reliability.
- AI Model Deployment: Collaborate with AI engineers to deploy and manage machine learning models using HuggingFace and other relevant tools.
- Collaboration: Work closely with software engineers, AI researchers, and product managers to ensure seamless integration and performance of AI solutions.
- Technical Leadership: Provide technical guidance and mentorship to junior engineers. Foster a culture of excellence and continuous improvement.
- Problem Solving: Troubleshoot and resolve technical issues related to cloud infrastructure and AI model deployment. Provide expert support and solutions.
- Innovation: Stay updated on advancements in cloud technologies and AI infrastructure. Implement new technologies and best practices to enhance Bagel's platform.
Qualifications:
- Experience: Minimum of 5 years in site reliability engineering or a related field, with significant experience in AWS, Google Vertex AI, and Terraform. Experience with HuggingFace and AI model deployment is a plus.
- Education: Bachelor's degree in Computer Science, Engineering, or a related field. An advanced degree is preferred.
- Skills:Proficiency in cloud technologies (AWS, Google Vertex AI) and infrastructure as code (Terraform).
- Experience with monitoring and alerting tools (e.g., Prometheus, Grafana).
- Excellent problem-solving and analytical skills.
- Proven ability to design and implement reliable and scalable infrastructure solutions.
- Strong collaboration and communication skills.
- Ability to adapt to changing priorities in a dynamic environment.
Why Join Bagel:
- Innovative Environment: Join a company that is revolutionizing the AI landscape with decentralized technologies.
- Impactful Work: Contribute to a platform that addresses challenges in AI development and data privacy.
- Growth Opportunities: Join a growing startup with opportunities for career advancement and professional growth.
- Collaborative Culture: Work with a dedicated team committed to driving technological innovation.