Title: Site Reliability Engineer
Location: Plano, TX
Duration: 6+ months
Locals ONLY
Experience Level : 10 + years
• Should be strong SRE, experience with java, AWS / DevOps / deployment strategy and monitoring tools. Candidates should be with more hands-on experience with Dynatrace / Splunk / CICD / Grafana etc.
• Looking for resource with very good application trouble shooting experience. More on core SRE metrics before going to Prod. uptime vs availability, monitoring vs Observability, and incident and outage etc.
• Should be familiar with SLO, SLA, SLI or other SRE keywords or terms.
• Experience with deploying using CICD pipeline and debugging/troubleshooting issues and coordinate with the application team such as Java, Spring Boot, Python, .Net, etc.
• Ability to perform API performance testing using tools such as JMeter / Blazemeter.
• Experience on identifying RCA for any production issues on AWS environment with multiple microservices.
• Expertise in Terraform to manage infrastructure as code would be highly desirable.
Job responsibilities:
• Demonstrates and champions site reliability culture and practices and exerts technical influence throughout your team.
• Leads initiatives to improve the reliability and stability of your team’s applications and platforms using data-driven analytics to improve service levels.
• Collaborates with team members to identify comprehensive service level indicators and stakeholders to establish reasonable service level objectives and error budgets with customers.
• Demonstrates a high level of technical expertise within one or more technical domains and proactively identifies and solves technology-related bottlenecks in your areas of expertise.
• Acts as the main point of contact during major incidents for your application and demonstrates the skills to identify and solve issues quickly to avoid financial losses.
• Documents and shares knowledge within your organization via internal forums and communities of practice Required qualifications, capabilities, and skills.
• Formal training or certification on Software engineering concepts and 5+ years of applied experience.
Required Qualifications, Capabilities, and Skills:
• Deep proficiency in reliability, scalability, performance, security, enterprise system architecture, toil reduction, and other site reliability best practices with the ability to implement these practices within an application or platform.
• Fluency in JAVA programming.
• Proficiency and experience in observability such as white and black box monitoring, SLO alerting, and telemetry collection using tools such as Splunk, Grafana, Dynatrace, Prometheus, Datadog.
• Proficiency in continuous integration and continuous delivery tools (e.g., Jenkins, GitLab, Terraform, etc.)
• Experience with container and container orchestration (e.g., ECS, Kubernetes, Docker) Preferred qualifications, capabilities, and skills.
• Experience with infrastructure as code tools such as Terraform. also experience managing/supporting Cloud based applications, AWS preferred.
• Excellent communications desired.