Parallel Domain Logo

Parallel Domain

Principal Site Reliability Engineer

Posted Yesterday
Be an Early Applicant
In-Office or Remote
Hiring Remotely in Vancouver, BC
Senior level
In-Office or Remote
Hiring Remotely in Vancouver, BC
Senior level
The Principal Site Reliability Engineer will oversee cloud infrastructure, enhance reliability, manage AWS/EKS operations, and lead incident response efforts.
The summary above was generated by AI
About the Role

Parallel Domain is looking for a Principal Site Reliability Engineer to own the reliability, scalability, and security of our cloud infrastructure - the backbone that runs simulation workloads for some of the most demanding customers in autonomous vehicle development.

This is a hands-on, high-ownership role. You'll be the primary infrastructure owner across our multi-region AWS/EKS platform, working closely with a small platform engineering team, partnering with engineering leads across simulation and ML, and our customer-facing teams.

What You'll Do

    Infrastructure Ownership & Cloud Operations

    • Own and evolve our AWS-based infrastructure, improving platform performance and availability today, and building toward deployable configurations that support enterprise customer environments tomorrow.

    • Own EKS cluster operations across production regions: node pool strategy, AMI lifecycle, autoscaling, and Kubernetes workload health.

    • Support the GitOps deployment pipeline - define, deploy, and manage applications across clusters using infrastructure-as-code.

    • Manage complex networking: VPC design, cross-region connectivity, DNS, and load balancing.

    • Lead infrastructure deprecation and migration efforts with minimal disruption.

    • Reliability Engineering & Incident Response

      • Own SLO measurement infrastructure; enable proactive triage of emerging issues before they impact customers.

      • Lead incident investigation, root cause analysis and postmortems, driving systemic fixes rather than one-off patches.

      • Design and improve automated remediation systems to reduce MTTR.

      • Security & Access Management

        • Review and provide security-conscious feedback on platform architecture decisions.

        • Own cloud IAM governance - roles, policies, and access boundaries across accounts and services.

        • Lead compliance-adjacent work including audit-readiness, partner certification requirements, and supporting responses to customer security questionnaires.

        • Cross-Functional Collaboration

          • Partner with application development teams to build an inherently secure platform and drive next-generation deployment architecture.

            • Partner with customer teams to ensure availability for expected utilization.

            • Partner with Finance on cloud cost optimization - lifecycle policies, right-sizing, and spend visibility.

            • Support GPU and batch workloads in collaboration with simulation and ML engineering teams.

            • Platform Tooling & Developer Experience

              • Improve CI/CD pipelines and automated infrastructure validation.

              • Support engineering teams with infra-side debugging, log analysis, and environment configuration.

What We're Looking For

    Technical Depth

    • 5+ years in SRE, DevOps, or infrastructure engineering roles.

    • Infrastructure-as-code proficiency - Terraform modules, state management, and multi-environment patterns.

    • Deep AWS experience - EKS, EC2, IAM, S3, Storage Gateway, VPC networking, Transit Gateway, CloudFront, KMS, and IRSA.

    • Kubernetes expertise - cluster operations, node pools, probes, cordoning, pod scheduling, RBAC, Helm, node autoscaling (Karpenter experience a plus); solid understanding of containerization and AMI lifecycle management.

    • CI/CD - experience with GitOps workflows and pipeline tooling (ArgoCD, GitHub Actions, Jenkins)

    • Solid networking fundamentals - CIDR design, security groups, DNS, load balancing, VPN, cross-region connectivity.

    • Experience with monitoring and observability tooling - Prometheus, Grafana, Elasticsearch.

    • Comfort with Python and Bash for tooling and automation.

    • Familiarity working across Linux and Windows environments. Operational familiarity with Windows Server is a meaningful advantage.

    • Communication & Ownership

      • You communicate clearly across engineering, product, and customer-facing teams, flagging issues with urgency proportional to customer impact.

      • You advocate for SRE best practices and can effectively operationalize an informed and principled view on security.

        • You take end-to-end ownership of complex, multi-team efforts - from planning through execution and post-change verification.

        • You know when to push for a clean solution vs. when to accept a pragmatic one, and you communicate that tradeoff clearly.

Nice to Have

    • Experience with Windows-based workloads on EKS.

    • Experience supporting simulation, ML, or rendering workloads in cloud infrastructure; running GPU workloads on Kubernetes, including NVIDIA and DirectX device plugin configuration.

    • Experience with AWS Storage Gateway or Transfer Family integrations.

    • Familiarity with Envoy Gateway or similar.

    • Experience with container-optimized OS images (e.g., Bottlerocket, Packer).

    • Experience with cloud cost optimization at scale.

Core Tools

    Terraform · AWS · Kubernetes · Helm · ArgoCD · Kustomize · Grafana · Prometheus · Elasticsearch · VictoriaLogs · Fluent Bit · GitHub Actions · Jenkins · Docker · Python · Bash

Top Skills

AWS
Bash
Ci/Cd
Eks
Elasticsearch
Gitops
Grafana
Jenkins
Kubernetes
Prometheus
Python
Terraform

Similar Jobs

17 Hours Ago
In-Office or Remote
Senior level
Senior level
Artificial Intelligence • Productivity • Software • Automation
The Automation Strategist will guide customers in automating processes, help identify use cases, and promote AI-enabled transformation, focusing on value delivery and relationship building.
Top Skills: AIAutomation
17 Hours Ago
Remote or Hybrid
Senior level
Senior level
Computer Vision • Healthtech • Information Technology • Logistics • Machine Learning • Software • Manufacturing
Lead architecture and integration of high-precision mechatronic systems for special-purpose manufacturing machines. Drive prototyping, motor control, sensor fusion, vendor co-development, documentation (CAD, BOMs), testing, and production transition, with frequent supplier travel.
Top Skills: 3D Printing3D Vision SystemsAcs Motion ControllersAutomated Optical Inspection (Aoi)Beckhoff TwincatCanopenCfdCnc MillingElmo Motion ControllersEmbedded SystemsEthercatFeaHigh-Resolution Optical EncodersLaser InterferometryPlc ProgrammingSensor FusionSolidworks
17 Hours Ago
Remote
Senior level
Senior level
Artificial Intelligence • Cloud • Consumer Web • Productivity • Software • App development • Data Privacy
Lead design for CompanyOS, collaborating with cross-functional teams to create user experiences, conduct research, and iterate on designs based on feedback. Responsible for the product lifecycle from concept to execution.
Top Skills: Ai/Ml TechnologiesDesign SystemsPrototyping Tools

What you need to know about the Delhi Tech Scene

Delhi, India's capital city, is a place where tradition and progress co-exist. While Old Delhi is known for its rich history and bustling markets, New Delhi is defined by its modern architecture. It's clear the region places a strong emphasis on preserving its cultural heritage while embracing technological advancements, particularly in artificial intelligence, which plays a central role in shaping the city's tech landscape, fueled by investments in research and development.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account