Job Description
NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s a unique legacy of innovation that’s fueled by great technology—and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, robots, and self-driving cars that can understand the world. Doing what’s never been done before takes vision, innovation, and the world’s best talent. As an NVIDIAN, you’ll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join the team and see how you can make a lasting impact on the world.
NVIDIA is seeking a passionate, motivated, and technical Engineer to join its multifaceted and fast-paced Infrastructure, Planning, and Processes organization as a Senior SRE Engineer. You will own and scale our internal CI as a Service platform. This platform includes the shared GitLab CI and GitHub Actions infrastructure used daily by thousands of engineers. You will manage this platform like a product: highly available, self-service, observable, and elastic to handle build and test workloads across the company. The position is part of a fast-paced team that develops and maintains complex build and test environments. These environments support various hardware platforms, including NVIDIA GPUs and Tegra Processors, as well as multiple operating systems like Windows, Linux, and Android. The team collaborates with other NVIDIA Software units such as Graphics Processors, Mobile Processors, Deep Learning, Artificial Intelligence, Robotics, and Autonomous cars to meet their infrastructure and system needs.
What you'll be doing:
Develop, handle, and expand a multi-tenant CI platform built on GitLab’s CI framework and GitHub’s action-based automation, encompassing runner fleets, shared caches, artifact storage, and secrets brokering.
Own the underlying Kubernetes substrate end-to-end. This includes cluster lifecycle, upgrades, and autoscaling. Manage node pools for GPU, CPU, and ARM workloads. Handle network and storage policy. Operate the controllers and operators that schedule runner pods on demand.
Drive reliability and capacity engineering: SLOs and error budgets for queue time, job success, and runner availability; on-call, incident response, postmortems, and structural fixes that keep toil flat as usage grows.
Build the self-service layer pipeline templates, reusable workflows, golden images, policy-as-code, and guardrails so product teams onboard in hours, not weeks, with secure-by-default pipelines.
Improve developer experience continuously: faster cold-starts, smarter caching, hermetic builds, test sharding and flakiness reduction, and deep observability into pipeline performance and cost per team.
What we need to see:
5+ years in SRE/platform roles with strong fundamentals — SLO/SLI build, incident command, resource planning, performance tuning, and production Linux administration at scale.
Deep Kubernetes administration experience: CRDs and operators, HPA/VPA/cluster-autoscaling, ingress, service mesh, RBAC, network policies, storage classes and deep problem-solving skills.
Hands-on expertise with GitLab continuous integration and GitHub automated workflows at scale — runner architecture, executor tuning, self-hosted runner controllers (ARC, GitLab-runner Helm chart), cache and artifact strategy, and pipeline development involving DAGs or equivalent experience.
Strong scripting and automation skills in Python, Go, bash scripting or equivalent. You should have production experience with IaC and configuration management tools like Terraform, Helm, and Ansible. Experience with GitOps tools such as Argo CD and Flux is also required.
BS/MS in CS or equivalent experience in building observability tools like Prometheus, Grafana, Loki/ELK, OpenTelemetry or similar products. You have shipped platforms that other specialists enjoy using.
Ways to stand out from the crowd:
Strong understanding of containerization and microservices architecture. Certified Kubernetes Administrator (CKA), Certified Kubernetes Security Specialist (CKS) & Certified Kubernetes Application Developer (CKAD) preferred.
Built or extended the CI control plane itself — custom runner schedulers, autoscaling, webhooks routers, or pipeline orchestration on top of GitLab/GitHub APIs.
Thrives in a multi-tasking environment with continuously evolving priorities.
Ability to analyze complex problems into simple sub problems and then reuse available solutions to implement most of those. Ability to build simple systems that can work efficiently without needing much support.
Prior experience with large scale operations team. Experience with using and improving data centers. Background with computer algorithms and ability to choose the best possible algorithms to meet the scaling challenge.
You will also be eligible for equity and benefits.
This posting is for an existing vacancy.
NVIDIA uses AI tools in its recruiting processes.
NVIDIA is committed to fostering an inclusive work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.Required Skills
Categories
Frequently asked questions
Is the Senior SRE Engineer position at NVIDIA remote?
The Senior SRE Engineer role at NVIDIA is an on-site or hybrid position.
What type of employment is the Senior SRE Engineer role?
NVIDIA is hiring for a full-time Senior SRE Engineer position.
What skills are needed for the Senior SRE Engineer job at NVIDIA?
Key skills for this role include Python, Kubernetes, Go, GPU, Terraform.
How do I apply for the Senior SRE Engineer position at NVIDIA?
You can apply for the Senior SRE Engineer role directly through NVIDIA's official application link provided on this page.
Similar AI jobs
Reliability and FA Engineer
Sesame · fulltime
Recruiting Coordinator (Contract)
Cursor · fulltime
Enterprise Technology Intern - AI and Automation (Fall 2026)
Saronic · internship
Enterprise Technology Intern - Technical Delivery (Fall 2026)
Saronic · internship
Recruiting Coordinator, Research
Thinking Machines Lab · fulltime
Strategic Revenue Partnerships Lead
HeyGen · fulltime