World Wide Technology

Senior HPC Engineer

Posted 14 Days Ago

Be an Early Applicant

Remote

Hiring Remotely in IND

Senior level

Remote

Hiring Remotely in IND

Senior level

Lead hands-on deployment and automation of NVIDIA GPU-based HPC/AI clusters. Build and maintain Ansible/Terraform IaC, configure Slurm and Kubernetes integrations, run HPL/NCCL validation, debug GPU/kernel/fabric issues, and lead/mentor an offshore squad ensuring code quality and shift-based delivery aligned with client timezones.

The summary above was generated by AI

Job Summary & Responsibilities

Technical Competencies

Essential Skills

HPC & AI Infrastructure:

Expert-level knowledge of NVIDIA Base Command Manager (BCM) or Metal-as-a-Service (MaaS) provisioning tools.
Deep understanding of Slurm configuration (cgroups, plugin development, accounting).
Proficiency with NVIDIA DGX/HGX hardware architecture and the associated software stack (Drivers, CUDA, DCGM).

Linux & Automation (DevOps for Hardware):

Mastery of Red Hat Enterprise Linux (RHEL) / Ubuntu internals (Systemd, Kernel Tuning, Hugepages).
Advanced proficiency in Ansible (writing custom modules/roles) and Python (automating admin tasks).
Experience with Git workflows (Branching, PRs, CI/CD).

Containerisation:

Hands-on experience with Docker, Singularity/Apptainer, and Kubernetes (specifically NVIDIA GPU Operator and NVIDIA Network Operator).

Desirable Experience

Network Awareness: Ability to troubleshoot basic InfiniBand/RoCEv2 issues (ibstat, perf query) to distinguish between a "Node Issue" and a "Network Issue."
Storage Integration: Experience mounting high-performance parallel file systems (VAST/Lustre/WEKA/GPFS) and tuning client-side performance.
Certifications:

NVIDIA Certified Associate - AI in the Data Center.
Red Hat Certified Engineer (RHCE).
CKA (Certified Kubernetes Administrator).

Success Metrics (KPIs)

Deployment Velocity: Reduction in "Time-to-Hello-World" (time from power-on to running the first successful GPU job) for new clusters.
Code Quality: >95% of Pull Requests pass automated linting and require fewer than 2 review cycles before merge.
Stability: Zero "Configuration Drift" incidents in production (e.g., manual changes breaking the cluster) due to strict IaC enforcement.

Preferred Qualifications

Role Title: Senior HPC Engineer

Reports To: Domain Architect - AI Compute

Location: India (Must align with Client Time Zone)

Employment Type: Full-Time

About the Role

The Senior HPC Engineer is the "Foreman" of the “AI Factory”. While the Domain Architect defines the architectural vision, you are responsible for the hands-on build and deployment. You act as the Technical Squad Lead for our offshore engineering teams, bridging the gap between the onshore architectural vision and the hands-on execution.

As a System Integrator, we thrive on velocity and precision. You will not just "maintain" clusters; you will lead the automated deployment of NVIDIA SuperPOD and BasePOD infrastructure for global enterprise clients. You are the "Lieutenant" to the Domain Architect, translating High-Level Designs (HLDs) into executable Ansible playbooks and ensuring your squad of HPC Engineers delivers defect-free infrastructure.

In this role, you are 100% Delivery-Focused, split between Technical Leadership (40%) and Hands-on Engineering (60%). You are the escalation point for complex kernel panics, the guardian of our Infrastructure-as-Code (IaC) repository, and the mentor who unblocks junior engineers when a Slurm job fails to schedule.

CRITICAL REQUIREMENT: This role typically operates on Shift Hours to align with the onshore client's time zone (e.g., early shifts for Australian clients, or split shifts for European clients).

Key Responsibilities

Hands-on Engineering & Automation (60%)

Cluster Provisioning Factory:

Lead the deployment of NVIDIA Base Command Manager (BCM) (formerly Bright Cluster Manager) to provision bare-metal DGX/HGX nodes at scale.
Develop and maintain the Ansible / Terraform library used to configure OS settings, user authentication (LDAP/AD), and storage mounts across hundreds of nodes.
Execute HPL (High-Performance Linpack) and NCCL-tests to validate cluster performance, tuning BIOS and OS parameters to hit "Gold Standard" benchmarks.

Scheduler & Workload Orchestration:

Configure complex Slurm Workload Manager policies, including Fair Share, Preemption, and GPU Partitioning (MIG).
Integrate Kubernetes-based orchestrators (e.g., NVIDIA Base Command, Run:AI, or Red Hat OpenShift) with the underlying HPC hardware.

Deep-Dive Troubleshooting:

Debug "Silent Data Corruption" and "Xid Errors" on GPUs, analysing nvidia-smi logs and kernel message buffers (dmesg).
Diagnose fabric-related performance drops (e.g., Identifying a specific flapping link causing global slowdowns) in collaboration with the Network Squad.

Squad Leadership & Quality Assurance (40%)

Technical Direction (The "Foreman"):

Translate the Low-Level Design (LLD) provided by the Domain Architect into granular Jira tasks for your squad of HPC Engineers.
Conduct daily stand-ups to unblock engineers, clarifying requirements and making technical decisions on the fly (e.g., "Use Ansible roles, not shell scripts for this").

Code Quality & Governance:

Act as the Primary Gatekeeper for the code repository. Perform mandatory Code Reviews on all Pull Requests (PRs) to ensure idempotency and error handling.
Enforce "Config-as-Code" discipline, ensuring no manual changes are made to production clusters without a committed playbook.

Mentorship:

Guide mid-level and junior engineers on best practices for Linux Systems Administration and HPC environments.

Top Skills

Nvidia Base Command Manager (Bcm),Metal-As-A-Service (Maas),Slurm,Nvidia Dgx,Nvidia Hgx,Cuda,Dcgm,Red Hat Enterprise Linux (Rhel),Ubuntu,Systemd,Hugepages,Ansible,Python,Git,Terraform,Docker,Singularity/Apptainer,Kubernetes,Nvidia Gpu Operator,Nvidia Network Operator,Infiniband,Rocev2,Ibstat,Perf,Hpl,Nccl,Run:Ai,Red Hat Openshift,Vast,Lustre,Weka,Gpfs,Ldap,Active Directory,Nvidia-Smi,Dmesg,Jira,Ci/Cd

Similar Jobs

CrowdStrike

Automation Engineer

3 Minutes Ago

Remote or Hybrid

India

Mid level

Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity

The role involves designing and implementing automation solutions using N8N and Tray.ai, developing AI-powered workflows, and integrating various applications while collaborating with cross-functional teams.

Top Skills: Ci/CdCoupaDevOpsGemini EnterpriseJSONMcpN8NNetSuitePythonRest ApisSalesforceSAPSnowflakeSQLTray.IoWorkdayXML

Motorola Solutions

Devops Engineer

5 Hours Ago

Remote or Hybrid

India

Mid level

Artificial Intelligence • Hardware • Information Technology • Security • Software • Cybersecurity • Big Data Analytics

As a CDH DevOps Developer, you will automate CI/CD pipelines, manage platform support and security, enhance system performance, and advocate for DevOps best practices.

Top Skills: APIsCdhMulesoftOicOracle

Motorola Solutions

Devops Engineer

5 Hours Ago

Remote or Hybrid

India

Mid level

Artificial Intelligence • Hardware • Information Technology • Security • Software • Cybersecurity • Big Data Analytics

As a CDH DevOps Developer, you will automate CI/CD pipelines, manage the platform, improve system performance, and support developers in optimizing workflows.

Top Skills: APIsCdhMulesoftOicOracle

What you need to know about the Delhi Tech Scene

Delhi, India's capital city, is a place where tradition and progress co-exist. While Old Delhi is known for its rich history and bustling markets, New Delhi is defined by its modern architecture. It's clear the region places a strong emphasis on preserving its cultural heritage while embracing technological advancements, particularly in artificial intelligence, which plays a central role in shaping the city's tech landscape, fueled by investments in research and development.