Site Reliability Engineer (ClickHouse Specialist) - Remote, USA

LangDB • Full-time • Remote (San Francisco Bay Area, US) • 4w ago

Location: Remote (USA or Europe Timezones)

Job Type: Full-Time

About Us

LangDB is an innovative platform designed to enable businesses to build Retrieval-Augmented Generation (RAG) and Generative AI applications using SQL. Our solution seamlessly integrates with existing infrastructures like ClickHouse, Snowflake, and Databricks, and interfaces with AI models from OpenAI, Anthropic, Llama, and others. LangDB is proudly backed by Sequoia and Google, positioning us as a leader in the emerging AI and data space.

Job Description

As a Site Reliability Engineer (SRE) specializing in ClickHouse, you will be at the forefront of maintaining and optimizing our data infrastructure. Your primary responsibility will be to ensure the reliability, availability, and performance of our ClickHouse-based systems. You will collaborate closely with our engineering, DevOps, and product teams to develop, deploy, and manage highly scalable and resilient systems.

Key Responsibilities:

Design, implement, and maintain scalable ClickHouse clusters to ensure high availability and fault tolerance.
Monitor the performance and reliability of ClickHouse systems, identifying and resolving bottlenecks and issues.
Automate deployment, scaling, and management of ClickHouse infrastructure using tools like Terraform, Ansible, or Kubernetes.
Collaborate with developers to optimize application queries and data models for better performance.
Implement best practices for monitoring, alerting, and incident response related to ClickHouse.
Participate in on-call rotations to manage and resolve production incidents, ensuring minimal downtime.
Conduct root cause analysis for incidents and implement solutions to prevent recurrence.
Continuously improve the observability and resilience of our systems, focusing on ClickHouse operations.
Document and share knowledge on ClickHouse deployment, monitoring, and optimization practices.

Qualifications:

Proven experience as a Site Reliability Engineer or in a similar role, with a strong focus on ClickHouse.
Deep understanding of ClickHouse architecture, including clustering, replication, and sharding.
Experience with infrastructure-as-code tools (e.g., Terraform, Ansible) and container orchestration platforms (e.g., Kubernetes, Docker).
Proficiency in scripting languages (e.g., Python, Bash) for automation and system management.
Strong knowledge of monitoring tools and best practices (e.g., Prometheus, Grafana).
Experience with cloud platforms (AWS, GCP, Azure) and managing ClickHouse in cloud environments.
Excellent problem-solving skills and the ability to perform under pressure.
Strong communication skills and the ability to work effectively in a collaborative team environment.
Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent work experience).

Preferred Qualifications:

Experience with other columnar databases or data warehouses (e.g., Snowflake, Redshift).
Knowledge of CI/CD pipelines and best practices for deploying ClickHouse in production.
Familiarity with security best practices in data infrastructure management.

Why Join Us?