SAFE Security Logo

SAFE Security

Site Reliability Engineer II

Posted 17 Days Ago
Be an Early Applicant
New Delhi, Delhi
Senior level
New Delhi, Delhi
Senior level
The Site Reliability Engineer II will ensure the uptime and scalability of the cloud platform, troubleshoot production issues, automate deployments, and collaborate with development teams for system reliability. Responsibilities include incident management, capacity planning, and promoting SRE best practices across the organization.
The summary above was generated by AI

Our vision is to be the Champions of a Safer Digital Future and the Champions of Change. We believe in empowering individuals and teams with freedom and responsibility to align their goals such that we all row in the same direction. We are uncomfortably transparent, autonomous & accountable; we have zero tolerance for brilliant jerks; we have an unlimited vacation policy and more. For us, our Culture Is Our Strategy - check out our Culture Memo for more details and surprises.


Job Overview:


As a Site Reliability Engineer, you will be responsible for providing the platform for our mission-critical cloud platform, which must maintain constant uptime, scale seamlessly, and allow new services and features to flourish.


The successful candidate will be highly self-motivated with a passion for excellence, quality and detail. SRE will not only support operations but also work closely with the developers and architects within SAFE to aid in product design and assist with the implementation to improve stability, security, and scalability.

Core Responsibilities:

  • Operate, monitor, and triage all aspects of our production environments to achieve our SLA and SLOs as part of a 24x7 on-call team.
  • Troubleshoot complicated, cross-platform issues handling OS, Networking, and databases in a cloud-based SaaS environment, handle live production incidents, debug/troubleshoot application and infrastructure issues, and follow and implement SRE best practices.
  • Design, build, and implement innovative solutions for previous, present, and future issues.
  • Prepare alert handling procedures, runbooks, etc., for common tasks and Incidents.
  • Automate deployment and orchestration of services into the cloud environment as well as other routine processes.
  • Actively participate in capacity planning, scale testing, and disaster recovery exercises.
  • Interact with and support partner teams, including engineering, QA, and CSE, to improve system reliability.
  • Conduct thorough RCA (Root Cause Analysis) for all production incidents: Identify root causes, document findings, publish incident summaries, and develop preventative actions to mitigate future occurrences.
  • Contribute to Infra architecture and non-functional requirements, ensuring they fit into a cohesive vision aligned with the rest of the platform's Technology roadmap for the launch.
  • Propagate SRE culture across the organization by sharing industry best practices, standards, approaches, documentation, and code with other engineering teams.

Qualifications/ Essentials Skills/ Experience:

  • Demonstrable experience in managing and maintaining high availability services based on AWS cloud infrastructure (minimum 5+ years).
  • Demonstrable Experience in cloud environments AWS and container technology, Docker and Kubernetes.
  • Demonstrable experience in managing and monitoring large-scale queueing technologies such as RabbitMQ or Kafka.
  • Hands-on experience in provisioning Infrastructure as Code (IaC) using Terraform Enterprise/OpenTofu/CDK.
  • Experience in CI/CD pipelines using GitHub Actions and Jenkins.
  • Valid AWS Associate level or higher certification
  • Experience in AWS Networking (VPC, Network Firewall, NACLs, SGs, TGW, DirectConnect), Route 53, HAProxy, Fargate Firewalls.
  • Experience in programming/scripting in Python for at least 3+ years.
  • Experience in monitoring and analyzing infrastructure performance using standard performance monitoring tools - Grafana/Prometheus, DataDog, Splunk, New Relic, etc.
  • Experience with Operational tools such as PagerDuty, Jira Service Management / ZenDesk, etc.

Join our rocket ship if you want to learn, make your mark and work with incredible talent!

Top Skills

AWS
Docker
Kubernetes
Python

Similar Jobs

Yesterday
New Delhi, Delhi, IND
Mid level
Mid level
Information Technology • Cybersecurity
As a Site Reliability Engineer at AlgoSec, you will ensure the reliability and performance of production environments, collaborate with cross-functional teams, manage AWS infrastructure, implement CI/CD procedures, and enhance monitoring capabilities. Responsible for resolving service issues and supporting on-call rotations, you will create automation tools for proactive problem mitigation.
Top Skills: AWSBashCloudFormationGitJenkinsKubernetesLinuxPythonTerraform
2 Days Ago
3 Locations
Senior level
Senior level
Cloud • Information Technology • Consulting
The Senior Site Reliability Engineer at Kyndryl will manage and optimize storage solutions using ZFS and iSCSI, maintain Linux-based systems, implement automation tools, and ensure system reliability. Responsibilities include performance tuning, incident response, and documentation while collaborating with software development teams.
Top Skills: AnsibleBashLinuxPythonRustUbuntu
10 Days Ago
Remote
Delhi, Connaught Place, New Delhi, Delhi, IND
Senior level
Senior level
Information Technology • Internet of Things • Marketing Tech
The Lead Site Reliability Engineer will ensure the availability, performance, and scalability of our systems, collaborating with development and operations teams to enhance reliability and observability, automate processes, and drive cost optimization efforts.
Top Skills: BashPython

What you need to know about the Delhi Tech Scene

Delhi, India's capital city, is a place where tradition and progress co-exist. While Old Delhi is known for its rich history and bustling markets, New Delhi is defined by its modern architecture. It's clear the region places a strong emphasis on preserving its cultural heritage while embracing technological advancements, particularly in artificial intelligence, which plays a central role in shaping the city's tech landscape, fueled by investments in research and development.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account