Senior Site Reliability Engineer

Instructor
Talent Insider
30 October 2023
Jakarta, Jakarta Pusat

About the Company:

Talent Insider is an upcoming HR Consultancy Service, founded in 2021. Our clients have been some of the leading brands in Indonesia, and this service continues to expand.

Registered in Singapore & Indonesia, we can assist with your growth plans and strategies, and continue to expand our regional presence with strong regional partners to assist our client in recruitment and branding strategy.

Job Description:

  • System And Service Reliability: Take Ownership Of The Overall System And Service Reliability, Ensuring That Our Digital Platforms Are Available And Performant For Users.
  • Automation: Develop And Maintain Automation Tools And Scripts To Streamline Operations, Provisioning, And Scaling Of Systems And Services.
  • Incident Management: Lead Or Participate In Incident Response, Troubleshoot Complex Issues, And Perform Root Cause Analysis To Prevent Recurrence. Implement Corrective Actions And Preventive Measures.
  • Monitoring And Alerting: Design And Implement Monitoring Systems To Proactively Detect Issues And Set Up Alerting Mechanisms For Rapid Response.
  • Capacity Planning: Analyze System Performance And Usage Patterns To Anticipate And Plan For Capacity Needs, Ensuring Seamless Scaling Of Services.
  • Infrastructure As Code: Work With Infrastructure-as-code Tools (e.g., Terraform, Ansible, Puppet) To Manage And Provision Infrastructure Resources.
  • Deployment And Release Management: Collaborate With Development Teams To Implement Continuous Integration And Continuous Deployment (CI/CD) Practices To Ensure Safe And Reliable Software Releases.
  • Security: Implement Security Best Practices, Conduct Vulnerability Assessments, And Participate In Security Incident Response Efforts.
  • Documentation: Maintain Comprehensive Documentation Of System Architecture, Configurations, Procedures, And Incident Reports.
  • On-Call Support: Participate In An On-call Rotation To Address Incidents And Perform Maintenance Outside Of Regular Business Hours.
  • Performance Optimization: Continuously Identify And Implement Performance Improvements, Efficiency Gains, And Cost Optimization.
  • Disaster Recovery: Plan And Test Disaster Recovery Procedures To Ensure Data Integrity And Service Continuity.
  • Collaboration: Collaborate With Cross-functional Teams To Align On Objectives, Share Knowledge, And Improve System Reliability.

Job Requirements:

  • Bachelor's degree in Computer Science, Information Technology, or a related field (Master's degree is a plus).
  • Extensive experience in a similar role, with a strong background in systems engineering, DevOps, or site reliability.
  • Proficiency in scripting and automation using languages like Python, Shell, or Go.
  • Deep knowledge of cloud computing platforms (e.g., AWS, Azure, Google Cloud) and virtualization technologies.
  • Experience with containerization and orchestration tools like Docker and Kubernetes.
  • Strong understanding of network and infrastructure components (e.g., load balancers, firewalls, DNS).
  • Familiarity with configuration management tools (e.g., Puppet, Ansible, Chef).
  • Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
  • Knowledge of security best practices and the ability to address security concerns.
  • Strong problem-solving and communication skills.
  • Previous experience with incident management and post-incident reviews.

Skils: