Senior Site Reliability Engineer

Lexicon Solutions • Full-time • Portland, Oregon Metropolitan Area, US • $115k - $145k / year • 3w ago

Site Reliability Engineer III - Kubernetes Administration

Full-Time / Hybrid schedule (fully remote possible if outside the area)
The target salary for this role is $115-145K/year + a 10% bonus (paid out annually).
Preferably near a physical office. Preference is Portland (OR), but also Atlanta (GA), Pleasanton (CA), Raleigh (NC), New York City). Will also consider fully remote for strong candidates in other markets.

The ideal candidate would have the following:

MUST have Strong Kubernetes experience.
Experience building Kubernetes clusters, who is active at the admin level.
MUST have experience with Terraform.
This will be on-prem Kubernetes work, using Linux servers.
This role is part of the Chief Technology Office (so it rolls up to the CTO). There are 4 SREs on this team, and there are 3 teams of 4 SREs in total.
As far as career advancement goes, Principle SRE would be next step on the development side, unless you moved into management with advancement.
Company currently has 30% market share in their captive market. They are growing rapidly and looking to be the major player in their space.
Like many SaaS companies, they are majority owned by private equity sponsors. They brought in a new CEO around 2 years back, who brought in a new C-Suite for the business. Things have been pretty stable since that transition and they are on a nice upward trajectory. They are striving to be a “Rule of 40 SaaS Company”.
SREs don’t get laid off, so there is really strong job stability in this role. They are a high demand skill-set and a foundational focus to their business. SREs keep the platform going. 99.9% uptime, and that’s a part of the SRE function.
Would love to find candidates with experience in: Kubernetes (EKS Anywhere), Istio, Flux, GitOps.

Summary:

As a Sr. Site Reliability Engineer, you are instrumental in helping make our Petabyte scale Kubernetes-centric ProArchive application resilient. This position will coordinate with multiple teams to develop a migration plan for various components and services as well as implement best practices for our tech stack. A person in this position will have a passion for getting things done for various functions, including automation, CI/CD, infra components, middleware, etc. You’ll work closely with our Dev Engineering, QA, and Platform Engineering groups to manage our current on-prem deployments and on-prem & cloud-native infrastructures.

How will you contribute?

Help define technology choices, best practices and process for the team.
Develop and maintain documentation standard for the team.
Develop new tools and libraries for broader use by SaaS Operations and Engineering teams. Enable engineering teams to discover and understand problems quicker.
Work with product architects and make suggestions for architectural changes and design platform component roadmaps.
Act as a subject matter expert (SME) for components and functions desired. Develop the skill as required, to become SME for components in need.
Assist engineering teams in deep troubleshooting and application code review to find opportunities to improve performance and scalability.
Work closely with Engineering and peer SRE teams to design and use our coding standards and best practices.
Respond to incidents coordinated by SRE and Incident Response teams. Act as a Incident Commander during incidents.
Participate in escalation and off-hours on-call schedule.
Adopt and embrace qualities of an SRE as defined in the team charter. Help set them for the rest of the team.
Mentor and train junior members of the team. Design training curriculum for the team.

What will you bring?

Minimum 7+ years industry experience.
BS in CS or equivalent combination of education and experience.
Strong experience operating Kubernetes in production environments – EKS Anywhere is preferred
Experience with middleware systems (Kafka, AMQ, Redis, Memcache, etc.)
Experience managing CI/CD systems (Flux, Concourse)
Experience deploying and/or operating Observability stack (Splunk, Datadog, Grafana)
Experience with large scale systems
Familiarity with working with PostgreSQL and MongoDB
Background working in a multi-platform environment (Linux, Windows)
Familiarity of programming/scripting languages (i.e. Python, Bash, PowerShell, Go, etc.)
Familiarity with Agile/Scrum/Kanban methodologies
Strong interpersonal skills with a can-do attitude and sense of urgency for a high growth/fast paced environment
Curious mind, wanting to learn new technologies and share with others.
The ability to think outside of the box to resolve issues and create solutions