Senior Manager - DevOps

Stanley David and Associates • Full-time • New Jersey, United States, US • $125k - $155k / year • 6d ago

Purpose of your role:

As a Senior Devops, you’ll play a pivotal role in enhancing the reliability, scalability, and efficiency of our Intelligent Interactions platform along with curating the pipeline to maximize developer productivity and execute smooth deployments. Specifically, you'll be working on the innovation edge of our product portfolio, namely Workflow engine, the core product of our Intelligent Interactions initiative. You’ll collaborate with our engineering teams to architect, design, and implement solutions that ensure the availability and stability of the platform, considering the critical aspect of constant recovery in a Kubernetes based environment.

You will be working amidst a set of existing products and infrastructures, however within the teams the expectation is that of a startup - fast paced and individual initiative oriented. You will be expected to be responsible for your work and bring forth insights and proposals to constantly improve our services.

Key Responsibilities:

Develop: Develop (think infrastructure as code) robust, scalable, and secure infrastructure architecture on AWS, focusing on a Kubernetes-based environment, that is tailored to a messaging product.

Architect for scalability: Architect infrastructure that can scale according to our applications (horizontal scalability in a Kubernetes infrastructure modeled to reflect the application design), planning for both current requirements and future growth, with a deep understanding of the challenges in managing large-scale distributed systems with real time requirements and low latencies (voice, WebSocket product).

Constant Recovery: Design and implement systems designed for near stateless operation with services expected to be ephemeral along with support for DR and multi-region deployments.

Monitoring and Performance: Implement monitoring, logging, and alerting tools to identify and address issues proactively. Analyze system performance, identifying bottlenecks and optimizing for peak performance. Understand enough of the software to be able to support developers in engineering services that are optimized for the available infrastructure and vice versa.

Automation and Tooling: Develop and implement CI/CD pipelines, automation scripts, and infrastructure as code (IaC) to streamline deployment and management processes.

Incident Management: Lead incident response and post-mortems, driving continuous improvement in reliability and availability, including on-call rotation.

Collaboration: Work closely with development, QA, and product teams to enhance the reliability and performance of the platform.

Experience & Qualification:

5+ years of experience in Site Reliability Engineering, DevOps, or similar roles matured over at least 2 different companies

In-depth experience with AWS services, Kubernetes and linux systems.

Proven experience designing and implementing large-scale, highly available, and fault-tolerant systems.

Demonstrable experience managing Java based applications, ideally with spring boot framework

Extensive experience managing and scaling Kafka clusters

Extensive demonstrable experience with Jenkins CI/CD pipelines

Deep understanding of scalability issues and solutions in distributed environments.

Advanced knowledge of infrastructure automation tools (Terraform, Ansible, etc.).

Experience with search and log analytics tools (Graylog, Elasticsearch, Kibana), enhancing our capability to monitor, analyze, and optimize system performance.

Proficiency in scripting languages (Python, Bash, etc.).