Published

June 4, 2026

Location

Dallas, TX

Responsibilities:

·
Maintain reliability of GPU clusters and AI
workloads

·
Monitor systems (Prometheus, Grafana)

·
Automate provisioning and recovery workflows

·
Troubleshoot performance bottlenecks

Requirements:

·
Strong Linux + scripting (Python/Bash)

·
Experience with Kubernetes (production
environments)

·
Observability tools experience

Preferred:

·
GPU workloads / HPC clusters

·
Slurm or distributed training systems

JOBID: 12358

Apply Online

Apply

Your name *

Your e-mail address *

Message

Attachments

Drop files here browse files ...

AI Infrastructure SRE (GPU Cloud / Kubernetes)