Key Responsibilities:
· Automate Infrastructure & Operations:
o Develop and implement automation strategies to manage large-scale infrastructure (e.g., provisioning, configuration management, patch management).
o Build and maintain Infrastructure-as-Code (IaC) solutions.
· AI-Driven Monitoring & Incident Response:
o Integrate AI and machine learning models into monitoring systems to predict potential failures and optimize response times.
o Use AI tools and techniques to improve anomaly detection, system health predictions, and proactive incident resolution.
· CI/CD Pipeline Management:
o Automate the CI/CD processes using tools such as Jenkins, Bitbucket Pipelines, GitLab CI, or similar.
o Incorporate AI/ML into CI/CD workflows for optimizing build/test times and enhancing code quality predictions.
o Collaborate with the development team to enhance and optimize deployment pipelines.
· AI-Powered Optimization:
o Utilize AI to perform predictive scaling, system optimization, and capacity planning.
o Implement self-healing capabilities through AI-based predictive analysis and automation tools.
· Monitoring & Alerting Automation:
o Automate monitoring and alerting solutions to detect anomalies, failures, and capacity issues early.
o Implement observability tools like Prometheus, Grafana, and Dynatrace for efficient system monitoring.
· Reliability & Scalability:
o Design and build self-healing, scalable systems that reduce manual intervention.
o Perform capacity planning and optimize system performance through automation.
· Incident Management & Response:
o Build automated runbooks and workflows to address incidents quickly.
o Set up automated playbooks for incident detection, troubleshooting, and remediation.
· Security & Compliance Automation:
o Implement automated security checks and audits within the CI/CD pipeline.
o Automate compliance reports, vulnerability scans, and patches.
Required Skills & Qualifications:
· Technical Expertise:
o Hands-on experience with on-premise machines and cloud platforms like PCF, AWS, Azure.
o Proficiency in programming languages such as Java, Python, Bash for scripting automation tasks.
o Strong knowledge of CI/CD tools (e.g., Jenkins, Bitbucket, GitLab, etc.) and version control systems.
o Ability to integrate machine learning models into infrastructure for automation and predictive monitoring.
· Infrastructure Automation:
o Expertise in containerization and orchestration tools (e.g., Docker, Kubernetes).
· Monitoring & Observability:
o Familiarity with monitoring tools like Prometheus, Grafana, Dynatrace, Splunk and alerting frameworks.
· Reliability Engineering:
o Experience with building and automating scalable, reliable, and self-healing systems.
o Strong troubleshooting skills.
· F5 Knowledge: (Good to have and not a mandatory requirement)
o Understanding with F5 BIG-IP, including LTM (Local Traffic Manager), GTM (Global Traffic Manager), and iRules scripting.
o Understanding of load balancing strategies, SSL termination, and traffic management for high availability systems.
· Collaboration & Communication:
o Excellent communication and collaboration skills to work cross-functionally with development, operations, and QA teams.
Preferred Qualifications:
· Familiarity with Agile and DevOps practices.
· Experience with automation in large-scale distributed systems.
· Experience working with both microservices and monolith architecture.
· Familiar with AI/ML-driven infrastructure optimization
Soft Skills:
· Problem-solving mindset and analytical thinking.
· Ability to thrive in a fast-paced and high-pressure environment.
- · Team player with excellent collaboration skills.