Site Reliability Engineer (Kubernetes)

TCWGlobal • Full-time • San Jose, CA, US • 1m ago

Site Reliability Engineer (Kubernetes)

US citizenship or Greencard holder- W2 Contract

San Jose, CA 95134 (Hybrid **Local candidates)

$80-110hr ( Weekly pay + benefits)

6 month contract (Excellent potential for extension)

Full-time: M-F 8am-5pm ( Onsite 2 days a week)

***Please note: This role is only accepting candidates that currently live in San Jose, CA.

Our client is a cloud security company. They offer enterprise cloud security services for the worlds most established companies. Named a Best Workplace in Technology by Fortune and others, they fosters an inclusive and supportive culture that is home to some of the brightest minds in the industry. The team is looking for someone who can thrive in an environment that is fast-paced and collaborative, and you are passionate about building and innovating for the greater good.

About the Role:

We are seeking a skilled and experienced Site Reliability Engineer (SRE) to join our team. The primary focus of this role is to develop and maintain a comprehensive observability solution for our Kubernetes-based applications. The ideal candidate will be proficient in using various monitoring and logging tools to ensure the reliability and scalability of our services.

Key Responsibilities:

● Design and Implementation: Develop and implement observability solutions for Kubernetes based applications using Fluentbit, Cloud Watch, StackDriver, Grafana Loki, Grafana Tempo, Prometheus, Envoy Health Probes, Open Telemetry, and ArgoCD.

● Monitoring and Logging: Configure and maintain logging pipelines using Fluentbit to collect, process, and route logs for storage and analysis.

● Metrics and Tracing: Set up Prometheus for metrics collection and Grafana Tempo for distributed tracing. Integrate these with Grafana for real-time monitoring and alerting via open telemetry.

● Telemetry: Utilize Open Telemetry to instrument applications for better traceability and observability.

● CI/CD: Use ArgoCD for continuous deployment and ensure observability tools are integrated into the CI/CD pipeline to deploy the observability suite.

● Observability Optimization: Analyze and optimize the performance of the observability stack to ensure minimal overhead and maximum efficiency.

● Troubleshooting: Proactively identify and resolve issues related to the observability infrastructure. Collaborate with development and operations teams to troubleshoot and resolve incidents.

● Documentation and Training: Document observability processes and best practices.

Provide training and support to other team members on the observability tools and techniques.

Required Qualifications:

4+ yrs experience as a Site reliability engineer, Product Reliability or similar in Kubernetes environment
*US Citizenship or Greencard holder
***Please note: This role is only accepting candidates that currently live in San Jose, CA.
Experience with observatory stacks in Kubernetes in multiple workloads
Experience implementing stacks in Kubernetes
Understanding of API gateway files
Experience running multiple app stacks
Understanding of SLI (Service level indicators) in product teams
Experience in Telemetry: Utilize Open Telemetry to instrument applications for better traceability and observability.
Experience with strong focus on observability in Kubernetes environments supporting applications in EKS in AWS.
Kubernetes: In-depth knowledge of Kubernetes and container orchestration.
Experience to develop and maintain a comprehensive observability solution for our Kubernetes-based applications.
Technologies: Hands-on experience with Fluentbit, Cloud Watch, StackDriver, Grafan

Loki, Grafana Tempo, Prometheus, Envoy Health Probes, Open Telemetry, and ArgoCD.

Scripting and Automation: Proficiency in scripting languages such as Python, Bash, or similar for automation tasks.
Monitoring and Logging: Strong understanding of monitoring, logging, and tracing concepts and best practices.
Collaboration: Strong communication skills and the ability to work effectively in a team environment.

Bonus Qualifications:

Certifications: Relevant certifications such as Certified Kubernetes Administrator
(CKA) or Certified Kubernetes Application Developer (CKAD)
Cloud Platforms: Experience with cloud platforms such as AWS and EKS.
DevOps Practices: Familiarity with DevOps practices and tools.

Please send your resume. Thank you!