Site Reliability Engineer

Amtex Systems Inc. • Contract • Plano, TX, US • 1w ago

Title: Site Reliability Engineer

Location: Plano, TX

Duration: 6+ months

Locals ONLY

Experience Level : 10 + years

• Should be strong SRE, experience with java, AWS / DevOps / deployment strategy and monitoring tools. Candidates should be with more hands-on experience with Dynatrace / Splunk / CICD / Grafana etc.

• Looking for resource with very good application trouble shooting experience. More on core SRE metrics before going to Prod. uptime vs availability, monitoring vs Observability, and incident and outage etc.

• Should be familiar with SLO, SLA, SLI or other SRE keywords or terms.

• Experience with deploying using CICD pipeline and debugging/troubleshooting issues and coordinate with the application team such as Java, Spring Boot, Python, .Net, etc.

• Ability to perform API performance testing using tools such as JMeter / Blazemeter.

• Experience on identifying RCA for any production issues on AWS environment with multiple microservices.

• Expertise in Terraform to manage infrastructure as code would be highly desirable.

Job responsibilities:

• Demonstrates and champions site reliability culture and practices and exerts technical influence throughout your team.

• Leads initiatives to improve the reliability and stability of your team’s applications and platforms using data-driven analytics to improve service levels.

• Collaborates with team members to identify comprehensive service level indicators and stakeholders to establish reasonable service level objectives and error budgets with customers.

• Demonstrates a high level of technical expertise within one or more technical domains and proactively identifies and solves technology-related bottlenecks in your areas of expertise.

• Acts as the main point of contact during major incidents for your application and demonstrates the skills to identify and solve issues quickly to avoid financial losses.

• Documents and shares knowledge within your organization via internal forums and communities of practice Required qualifications, capabilities, and skills.

• Formal training or certification on Software engineering concepts and 5+ years of applied experience.

Required Qualifications, Capabilities, and Skills:

• Deep proficiency in reliability, scalability, performance, security, enterprise system architecture, toil reduction, and other site reliability best practices with the ability to implement these practices within an application or platform.

• Fluency in JAVA programming.

• Proficiency and experience in observability such as white and black box monitoring, SLO alerting, and telemetry collection using tools such as Splunk, Grafana, Dynatrace, Prometheus, Datadog.

• Proficiency in continuous integration and continuous delivery tools (e.g., Jenkins, GitLab, Terraform, etc.)

• Experience with container and container orchestration (e.g., ECS, Kubernetes, Docker) Preferred qualifications, capabilities, and skills.

• Experience with infrastructure as code tools such as Terraform. also experience managing/supporting Cloud based applications, AWS preferred.

• Excellent communications desired.