The HPC Engineer - Storage focuses on deploying high-performance storage systems, managing configurations, automating installations, and maintaining I/O performance benchmarks in a cluster environment.
Job Summary & Responsibilities
Technical Competencies
Essential Skills
High-Performance Storage:
- Parallel Filesystems: Hands-on operational experience with at least one major AI storage platform: VAST Data, Weka.io, DDN Lustre (Exascaler), or IBM GPFS (Spectrum Scale).
- Linux I/O Stack: Deep understanding of the Linux VFS (Virtual File System), block devices, and how to debug I/O performance using tools like iostat, iotop, and strace.
- RDMA Storage: Experience configuring NVMe-over-Fabrics (NVMe-oF) or NFS-over-RDMA, understanding the dependency on the underlying InfiniBand/RoCE network.
Automation & Containerisation:
- Ansible Storage: Proficiency in writing Ansible playbooks to automate the installation of storage clients and configuration of mount points.
- Kubernetes Storage: Understanding of StorageClasses, PVCs, and how to debug CSI Driver pods (checking logs for mount failures).
- GPUDirect: Conceptual understanding of NVIDIA GPUDirect Storage (GDS) and the ability to verify if GDS is active.
Desirable Experience
- Vendor Specifics: Deep certification or experience with Pure Storage (FlashBlade) or NetApp ONTAP AI configurations.
- Object Storage: Experience interacting with S3-compatible object stores via CLI for model weight retrieval.
- Data Migration: Experience using tools like fpsync or rclone to move petabyte-scale datasets between tiers.
Certifications
Highly Desirable:
- NVIDIA-Certified Associate: AI Infrastructure and Operations (NCA-AIIO)
- Vendor Certifications:
- VAST Certified Administrator (VCP-AD1)
- WEKA Technical Xpert Certification
- Red Hat Certified Specialist in Storage Administration
Success Metrics (KPIs)
- I/O Performance: Achieving >95% of the theoretical line-rate throughput on IOR/FIO benchmarks for provisioned clients.
- Mount Stability: Zero "Stale File Handles" or disconnected mounts across the cluster during the 72-hour burn-in period.
- Ticket Velocity: Consistently meeting SLAs for storage-related support tickets.
Similar Jobs
Fintech • Information Technology • Insurance • Financial Services • Big Data Analytics
The VP, Data and Analytics Officer will lead data strategy, analytics, and innovation across Asia, driving impactful insights and business decisions.
Fintech • Professional Services • Consulting • Energy • Financial Services • Cybersecurity • Generative AI
The role involves managing program conflicts, facilitating communication across programs, process mapping, conducting quality assurance deep dives, and optimizing resource management, ensuring alignment and efficiency in financial transformation projects.
Top Skills:
Azure DevopsClarityMS OfficeTableau
Artificial Intelligence • Edtech • Mobile • Natural Language Processing • Productivity • Software
Lead the engineering team focusing on the data platform's architecture and scaling, ensuring reliability and performance while mentoring engineers and collaborating with cross-functional teams.
Top Skills:
AWSAzureEltETLGCPJavaScriptNode.jsReactSpark
What you need to know about the Delhi Tech Scene
Delhi, India's capital city, is a place where tradition and progress co-exist. While Old Delhi is known for its rich history and bustling markets, New Delhi is defined by its modern architecture. It's clear the region places a strong emphasis on preserving its cultural heritage while embracing technological advancements, particularly in artificial intelligence, which plays a central role in shaping the city's tech landscape, fueled by investments in research and development.



