← all jobs

[Remote] Staff Site Reliability Engineer

Work from home Full-time role Hiring

Note: The job is a remote job and is open to candidates in USA. Hamilton Barnes is a fast-growing AI infrastructure business, operating as a compute liquidity layer for global AI labs and cloud providers. They are seeking a Staff Site Reliability Engineer to lead incident response and own the production health of multi-thousand GPU fleets, shaping how GPU infrastructure is operated at scale.

Responsibilities

  • Lead P0/P1 incident response across the full GPU stack, owning triage, postmortems and systemic fixes
  • Own production health of multi-thousand GPU fleets across providers including node lifecycle, firmware rollouts and driver upgrades
  • Build and maintain GPU health checks, fabric monitoring, observability and automated remediation tooling
  • Define on-call practices including rotations, runbooks, escalation paths and blameless incident reviews
  • Act as the senior reliability voice in customer-facing incident reviews, architecture deep-dives and partner with product engineering on SLOs and error budgets

Skills

  • Multiple years hands-on building and operating large-scale GPU infrastructure
  • Deep expertise with NVIDIA H100/H200/B200/GB200 including NVLink/NVSwitch topology and hardware failure modes
  • Production experience with InfiniBand, RoCE and NVLink fabrics alongside NCCL, CUDA and PyTorch distributed training
  • Production-grade Go, Python or Rust with strong Kubernetes and/or Slurm/HPC experience
  • Expert Linux internals covering kernel tuning, NVIDIA driver/CUDA lifecycle and BPF tooling

Company Overview

  • Hamilton Barnes are an award-winning recruitment consultancy, here to help you secure the best talent and opportunities in the networking, telecommunications, data center and security space. It was founded in 2014, and is headquartered in London, England, GBR, with a workforce of 51-200 employees. Its website is https://hamilton-barnes.co.uk.
  • More open positions

    [Remote] Engineering QA Reviewer | Remote

    Work from home Full-time role

    [Remote] Senior UX Designer

    Work from home Full-time role

    [Remote] Project Manager

    Work from home Full-time role

    [Remote] Executive Recruiter

    Work from home Full-time role

    [Remote] Content & Operations Lead

    Work from home Full-time role

    Remote Data Entry & Customer Support Specialist – Full‑Time Home‑Based Administrative & Communication Role

    Work from home Full-time role

    Data Operations Manager

    Work from home Full-time role

    Special Education Teaching Job - Work Remotely from Gaffney, SC

    Work from home Full-time role

    [Remote] Principal Technical Consultant, Platform Engineering

    Work from home Full-time role

    IT and Database Manager

    Work from home Full-time role

    FREELANCE WEB DEVELOPER II (REMOTE)

    Work from home Full-time role

    Project Manager-Data Center

    Work from home Full-time role

    Customer Experience Escape Consultant

    Work from home Full-time role

    Senior Customer Service Representative – Deliver Exceptional Experiences at careerzynith

    Work from home Full-time role

    Marketing Science Analyst, Retail Media Networks

    Work from home Full-time role

    Experienced Customer Experience Manager – Remote Workforce Management

    Work from home Full-time role

    Remote Customer Service Representative – Work‑From‑Home, Part‑Time, $22/hr – careerzynith E‑commerce Support

    Work from home Full-time role

    Remote Dental Insurance Biller

    Work from home Full-time role

    [Remote] Sr Data Engineer

    Work from home Full-time role

    Instructional Designer job at Gifthealth in US National

    Work from home Full-time role

    Manager, Client Delivery

    Work from home Full-time role