Senior Site Reliability Engineer

Instructor
Talent Insider
26 October 2023
Jakarta, Jakarta Selatan

About the Company:

Talent Insider is an upcoming HR Consultancy Service, founded in 2021. Our clients have been some of the leading brands in Indonesia, and this service continues to expand.

Registered in Singapore & Indonesia, we can assist with your growth plans and strategies, and continue to expand our regional presence with strong regional partners to assist our client in recruitment and branding strategy.

Job Description:

  • Lead Efforts To Improve The Reliability And Availability Of Our Systems Through Automation, Proactive Monitoring, And Capacity Planning.
  • Respond To And Manage Incidents, Identifying The Root Cause And Implementing Preventive Measures To Minimize Future Incidents.
  • Develop And Maintain Automation Tools And Scripts To Streamline Operational Tasks, Configuration Management, And Deployment Processes.
  • Analyze System Performance And Identify Bottlenecks, Making Recommendations For Improvements And Optimizations.
  • Work On Designing And Implementing Scalable Architectures To Accommodate Growth And Increased User Demand.
  • Utilize IaC Tools (e.g., Terraform, Ansible) To Manage And Provision Infrastructure Components.
  • Set Up And Maintain Monitoring Systems To Track System Health And Performance Metrics. Configure Alerting And Notifications To Respond To Anomalies.
  • Collaborate With Development Teams To Ensure That New Applications And Features Are Designed With Reliability And Operability In Mind.
  • Provide Guidance, Mentorship, And Technical Leadership To Junior Members Of The SRE Team, Fostering Their Professional Growth And Ensuring Team Cohesion
  • Create And Maintain Documentation For Systems, Processes, And Best Practices.
  • Implement And Maintain Security Best Practices And Participate In Security Reviews And Audits.
  • Participate In An On-call Rotation To Provide 24/7 Support And Incident Response.

Job Requirements:

  • Bachelor's degree in Computer Science, Information Technology, or a related field.
  • Several years of experience in a Site Reliability Engineer or DevOps role.
  • Proficiency in scripting and programming languages like Bash, Python, or Go.
  • Strong knowledge of containerization technologies (e.g., Docker, Kubernetes) and cloud platforms (e.g., AWS & Google Cloud).
  • Expertise in Kafka, including setting up, configuring, and managing Kafka clusters for real-time data streaming.
  • Hands-on experience with designing, implementing, and maintaining distributed systems and microservices architectures.
  • Experience with configuration management tools (e.g., Terraform and Ansible).
  • Deep understanding of networking, databases, and web services.
  • Familiarity with monitoring and logging tools (e.g., Prometheus, Grafana, Datadog, ELK Stack).
  • Excellent problem-solving skills and the ability to work well in high-pressure situations.
  • Strong communication and collaboration skills.
  • Relevant certifications (e.g., AWS Certified DevOps Engineer, Google Professional DevOps Engineer) are a plus.

Skils: