Virtual Tech Gurus
Description
Responsibilities:
·
Maintain reliability of GPU clusters and AI
workloads
·
Monitor systems (Prometheus, Grafana)
·
Automate provisioning and recovery workflows
·
Troubleshoot performance bottlenecks
Requirements:
·
Strong Linux + scripting (Python/Bash)
·
Experience with Kubernetes (production
environments)
·
Observability tools experience
Preferred:
·
GPU workloads / HPC clusters
·
Slurm or distributed training systems
JOBID: 12358
