World Wide Technology Logo

World Wide Technology

HPC Engineer - Storage

Posted 15 Days Ago
Be an Early Applicant
Remote
Hiring Remotely in IND
Entry level
Remote
Hiring Remotely in IND
Entry level
The HPC Engineer - Storage focuses on deploying high-performance storage systems, managing configurations, automating installations, and maintaining I/O performance benchmarks in a cluster environment.
The summary above was generated by AI
Job Summary & Responsibilities

Technical Competencies

Essential Skills

High-Performance Storage:

  • Parallel Filesystems: Hands-on operational experience with at least one major AI storage platform: VAST Data, Weka.io, DDN Lustre (Exascaler), or IBM GPFS (Spectrum Scale).
  • Linux I/O Stack: Deep understanding of the Linux VFS (Virtual File System), block devices, and how to debug I/O performance using tools like iostat, iotop, and strace.
  • RDMA Storage: Experience configuring NVMe-over-Fabrics (NVMe-oF) or NFS-over-RDMA, understanding the dependency on the underlying InfiniBand/RoCE network.

Automation & Containerisation:

  • Ansible Storage: Proficiency in writing Ansible playbooks to automate the installation of storage clients and configuration of mount points.
  • Kubernetes Storage: Understanding of StorageClasses, PVCs, and how to debug CSI Driver pods (checking logs for mount failures).
  • GPUDirect: Conceptual understanding of NVIDIA GPUDirect Storage (GDS) and the ability to verify if GDS is active.

Desirable Experience

  • Vendor Specifics: Deep certification or experience with Pure Storage (FlashBlade) or NetApp ONTAP AI configurations.
  • Object Storage: Experience interacting with S3-compatible object stores via CLI for model weight retrieval.
  • Data Migration: Experience using tools like fpsync or rclone to move petabyte-scale datasets between tiers.

Certifications

Highly Desirable:

  • NVIDIA-Certified Associate: AI Infrastructure and Operations (NCA-AIIO)
  • Vendor Certifications:
    • VAST Certified Administrator (VCP-AD1)
    • WEKA Technical Xpert Certification
  • Red Hat Certified Specialist in Storage Administration

Success Metrics (KPIs)

  • I/O Performance: Achieving >95% of the theoretical line-rate throughput on IOR/FIO benchmarks for provisioned clients.
  • Mount Stability: Zero "Stale File Handles" or disconnected mounts across the cluster during the 72-hour burn-in period.
  • Ticket Velocity: Consistently meeting SLAs for storage-related support tickets.
Preferred Qualifications1. Storage Integration & Client Configuration • Client Provisioning: Execute the deployment of high-performance storage clients (VAST, Weka, GPFS/Spectrum Scale, Lustre) on bare-metal DGX/HGX nodes using Ansible. • Protocol Configuration: Configure and tune RDMA-based protocols (NVMe-oF, NFS over RDMA, GPUDirect Storage) to bypass the CPU and deliver data directly to GPU memory. • Kubernetes Integration: Install and troubleshoot CSI (Container Storage Interface) drivers to ensure dynamic provisioning of Persistent Volumes (PVs) for AI workloads running in K8s. • Mount Management: Manage complex mount maps and automounter configurations to ensure consistent namespace views across thousands of compute nodes. 2. Validation & Performance Benchmarking • Throughput Testing: Execute standard I/O benchmarks to validate that the storage subsystem meets the "Gold Standard" read/write targets (e.g., 400GB/s read throughput). • Latency Tuning: Tune client-side kernel parameters (read-ahead buffers, queue depths, sysctl settings) to minimize latency for small-file random I/O patterns common in checkpointing. • Acceptance Reporting: Generate "As-Built" storage validation reports, documenting effective throughput and IOPS for client sign-off. 3. Operations & Support • Capacity & Quotas: Implement project-level quotas and monitor usage trends to prevent "Disk Full" outages on critical scratch filesystems. • Ticket Resolution: Handle L2 support tickets for storage issues, such as "Stale file handles," "Slow dataset loading," or "CSI Driver crashes." • Lifecycle Management: Execute non-disruptive client-side driver upgrades and firmware patches during maintenance windows.

Similar Jobs

An Hour Ago
Remote or Hybrid
New Delhi, Delhi, IND
Expert/Leader
Expert/Leader
Fintech • Information Technology • Insurance • Financial Services • Big Data Analytics
The VP, Data and Analytics Officer will lead data strategy, analytics, and innovation across Asia, driving impactful insights and business decisions.
4 Hours Ago
Remote or Hybrid
India
Senior level
Senior level
Fintech • Professional Services • Consulting • Energy • Financial Services • Cybersecurity • Generative AI
The role involves managing program conflicts, facilitating communication across programs, process mapping, conducting quality assurance deep dives, and optimizing resource management, ensuring alignment and efficiency in financial transformation projects.
Top Skills: Azure DevopsClarityMS OfficeTableau
5 Hours Ago
Easy Apply
Remote
India
Easy Apply
Senior level
Senior level
Artificial Intelligence • Edtech • Mobile • Natural Language Processing • Productivity • Software
Lead the engineering team focusing on the data platform's architecture and scaling, ensuring reliability and performance while mentoring engineers and collaborating with cross-functional teams.
Top Skills: AWSAzureEltETLGCPJavaScriptNode.jsReactSpark

What you need to know about the Delhi Tech Scene

Delhi, India's capital city, is a place where tradition and progress co-exist. While Old Delhi is known for its rich history and bustling markets, New Delhi is defined by its modern architecture. It's clear the region places a strong emphasis on preserving its cultural heritage while embracing technological advancements, particularly in artificial intelligence, which plays a central role in shaping the city's tech landscape, fueled by investments in research and development.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account