AI Infrastructure SRE (GPU Cloud / Kubernetes)

Virtual Tech Gurus
Published
June 4, 2026
Location
Dallas, TX
Category
Default

Description

Responsibilities:

·        
Maintain reliability of GPU clusters and AI
workloads

·        
Monitor systems (Prometheus, Grafana)

·        
Automate provisioning and recovery workflows

·        
Troubleshoot performance bottlenecks

Requirements:

·        
Strong Linux + scripting (Python/Bash)

·        
Experience with Kubernetes (production
environments)

·        
Observability tools experience

Preferred:

·        
GPU workloads / HPC clusters

·        
Slurm or distributed training systems

JOBID: 12358

Apply
Drop files here browse files ...
Are you sure you want to delete this file?
/