The Path to Level 4 Autonomy

I lead autonomous vehicle research at NVIDIA and I am a professor at Stanford. Over the past few years I have watched the field of autonomous driving shift from steady, incremental advances to a period of rapid transformation. A massive influx of new technologies primarily from the field of artificial intelligence has revolutionized essentially all of the building blocks we use to design self-driving systems. In this article I want to report on where we are, why Level 4 autonomy is within reach, and how we are designing systems that are safe, reliable, and ready for real-world deployment.

My goal in this piece is to give a clear, practical account of the technical and systems-level progress that makes Level 4 feasible, to explain the foundation-model approach we are pursuing, and to describe the safety architecture and validation strategies that must be in place if these systems are to earn public trust. I will be direct and concrete, and I will draw on years of research and development experience to highlight the opportunities and the challenges that lie ahead.

What Level 4 Autonomy Means 🚗

When people ask me what Level 4 autonomy means I start with a simple definition. Level 4 autonomy refers to a self-driving system that can handle all driving tasks within a limited geographical area and within specific operating conditions without any human intervention. In other words, the vehicle can go from origin to destination on its own, make decisions, and respond to unexpected events as long as it stays within a predefined domain of operation.

I often say that "L4 is really the pinnacle of vehicle autonomy" within those constrained domains. It is not the catch-all dream of full unbounded autonomy that people sometimes call Level 5. Instead, L4 is the practical and societally useful milestone: driverless operation in well-defined regions and conditions, such as geo-fenced urban neighborhoods, campus shuttles, designated highways, or commercial fleets operating in known corridors.

To contrast briefly with other levels of automation:

Level 2 and 3 are driver-assist systems where a human driver must be ready to intervene. The system can handle some tasks but not all, and responsibility remains shared or with the human.
Level 4 is the point where the vehicle can take full responsibility for the driving task in its operational design domain, with no human intervention required.
Level 5 would be full driving automation everywhere and in all conditions, and that remains a much longer term goal.

Understanding this distinction is important because it guides engineering priorities. If you aim for L4 you can optimize for safety and efficiency within a known set of conditions. That specialization is not a limitation; it is a practical strategy that enables deployment and real societal benefit sooner rather than later.

Why Now: AI Breakthroughs Fueling a Leap 🤖

If I had spoken to students three or four years ago, I would have told them that progress in autonomy felt incremental. We were improving perception a little, planning a bit, mapping and localization were better, but the advances felt additive. Over the past few years the landscape changed. A massive influx of new technologies from artificial intelligence has reshaped the core building blocks of autonomy.

Key advances include:

Large-scale, self-supervised learning that enables models to extract structure from raw sensory streams without needing thousands of manual labels for every scenario.
Transformer architectures and foundation models that can learn general representations across modalities and then be fine-tuned for specific tasks.
Huge improvements in compute including GPUs and specialized accelerators, making it possible to train and run powerful models in vehicles and in simulation.
Improved simulation and synthetic data, which allow us to explore rare corner cases and to iterate on complex scenarios at scale.

These capabilities combine to make it possible to reason about perception, prediction, planning, and control in ways that were infeasible a few years ago. The integration of data-driven learning with classical control and safety engineering techniques has created an opportunity to build systems that are both intelligent and dependable.

Another crucial element is the rise of multimodal representation learning. When models can ingest camera images, lidar point clouds, radar returns, and map priors and form a coherent representation of the environment, the downstream tasks of predicting other agents behaviors and planning safe maneuvers become much more robust.

Foundation Models and Cosmos 🌌

One of the most exciting developments in AI for physical systems is the emergence of foundation models. These are large, general-purpose models trained on diverse and extensive datasets to create versatile internal representations that can be specialized for many downstream tasks. For autonomous vehicles this idea is extremely powerful: a foundation model can learn a structured understanding of physical scenes that generalizes across locations, sensor types, and tasks.

At NVIDIA we developed a family of foundation models specifically tailored for physical AI applications. We call this family Cosmos. Cosmos models are trained to understand multi-sensor, spatiotemporal data with the kinds of priors that matter for embodied agents. They are not generic language models; they are physically-grounded models that can represent geometry, motion, semantics, and intent across modalities.

The advantages of such foundation models for autonomy are manifold:

Transfer learning: A model trained as a general perceptual and physical reasoning engine can be adapted rapidly to new cities, new sensors, and new operational domains with far less labeled data.
Multimodal fusion: Cosmos is designed to fuse camera, lidar, and radar data into coherent latent representations, improving robustness to sensor failure and adverse conditions.
Prediction and planning synergy: A representation that captures physical dynamics and intent naturally supports better trajectory prediction for other agents and more informed planning decisions.
Consistency and efficiency: A shared foundation model across perception, prediction, and planning reduces inconsistencies that can arise when these components are developed independently.

These models are trained on very large datasets that include real-world driving logs, synthetic data produced from simulation, and diverse geographic and environmental conditions. The key to success here is not simply the size of the model or data, but the care taken to align the modeling approach with the physics and semantics of the problem we are trying to solve.

Co-Development and the Data Flywheel 🔁

One of the things I find most promising is the ability to co-develop the foundation models and the application itself. When you develop the underlying model family and the autonomous driving application together you create a virtuous cycle. The model improves the application and the deployed application generates new data that improves the model. I often refer to this as a data flywheel.

In practice the data flywheel works like this:

We deploy vehicles and simulations that generate sensory data across a variety of conditions.
That data is used to update foundation models like Cosmos, improving perceptual capabilities and generalization.
The improved models are then integrated back into the driving stack, producing better behavior and exposing new edge cases more efficiently.
The new data gathered from those deployments and tests further enriches the training set, and the cycle continues.

This self-reinforcing loop is one of the key opportunities that NVIDIA is seizing to accelerate development toward L4 autonomy. The flywheel is powerful because it allows learning to be continuous and fleet-wide, enabling incremental improvements that compound over time. When models and applications are tightly coupled, you can close the loop fast and target the areas where the deployed system is weakest.

A crucial enabler here is a robust data infrastructure that can collect, filter, label, and simulate scenarios at scale. Without high-quality data pipelines and efficient simulation, the flywheel will stall. Building those pipelines requires attention to privacy, data governance, and the operational logistics of fleet operations.

Full Stack Safety: Principles of Design 🛡️

Achieving L4 autonomy is not just a machine learning problem. It is a systems engineering challenge that requires careful architectural decisions, validation procedures, and safety engineering. The NVIDIA full-stack system for L4 builds safety from the ground up on three core principles: diversity, monitoring, and validation.

These principles are not buzzwords. They are concrete design tenets that guide how we select sensors, design perception algorithms, implement run-time checks, and develop validation pipelines. Let me unpack each principle and give practical examples of how they manifest in a deployed system.

Diversity in sensors, algorithms, and data

Redundancy and diversity reduce the likelihood of common mode failures. For sensing, that means combining cameras, lidar, and radar. Each sensor modality has strengths and weaknesses. Cameras provide rich semantic content and color information but struggle at night or in glare. Lidar provides precise 3D geometry and excels at range measurement but can be affected by weather and reflective surfaces. Radar is robust to weather and provides direct relative velocity estimates, but it offers lower resolution. When fused correctly, these modalities complement each other.

Algorithmic diversity is equally important. Relying on a single model architecture or a single dataset introduces common failure modes. We deploy ensembles of models trained on different objectives and datasets. Some models emphasize geometric accuracy, others emphasize semantic correctness, and still others focus on behavior prediction. When these models disagree, the monitoring system knows to escalate and to activate conservative fallback behaviors.

Monitoring the state of the system as it operates

Monitoring is about knowing what the system knows and what it does not know. It is critical that a vehicle can detect when sensors are degraded, when perception confidence is low, or when the scene falls outside the operational design domain. Runtime monitoring includes health checks at the sensor level, uncertainty estimates from perception and prediction models, and sanity checks on the planned trajectories.

Monitoring is also about human-understandable diagnostics. If a vehicle needs to transition to a minimal risk condition, the system should be able to explain why. Those explanations will be crucial for regulators and for public acceptance.

Validation at scale across real and simulated scenarios

Validation is where statistical guarantees meet the messy reality of driving. You need massive offline validation across recorded datasets and simulation, targeted testing for rare edge cases, and structured deployment plans that gradually expand the operational domain. Validation must be both comprehensive and continuous.

Combining diversity, monitoring, and validation creates a robust safety architecture. Diversity reduces blind spots, monitoring detects anomalies and contextual uncertainty, and validation demonstrates robustness through exhaustive testing. In practice these three principles interact and reinforce each other.

Principle of Diversity 🧩

Diversity is not redundancy for redundancy's sake. It is intentional heterogeneity in sensors, software, and data sources to ensure that a single failure mode cannot compromise the entire system. Let me describe the main forms of diversity we build into an autonomous driving system and why each matters.

Sensor diversity

As I mentioned, we combine cameras, lidar, and radar to cover complementary sensing needs. For example, lidar provides accurate 3D geometry that is invaluable for precise localization and obstacle detection. Cameras provide rich semantic cues that help classify road signs, traffic lights, and pedestrians. Radar offers robustness in adverse weather and can detect the relative speed of objects reliably. The fusion of these sensors gives a system resilience to environmental variation.

Algorithmic diversity

We deploy multiple perception and prediction pipelines that are trained with different objectives. One model might be optimized for accurate bounding boxes, another for dense scene flow, and yet another for long-term trajectory forecasting. The outputs of these models are checked against one another. Discrepancies trigger conservative behavior, so if one method fails we rely on a complementary approach.

Data diversity

Training on diverse datasets matters. This means geographic diversity, weather diversity, and collecting data across different traffic cultures and road typologies. It also means mixing real-world data with synthetic scenes from simulation to stress-test rare conditions. The Cosmos foundation models are designed to absorb and generalize from this diverse data mixture.

Hardware and compute diversity

Even compute heterogeneity has a role. Multiple compute paths and watchdogs ensure that a single hardware fault does not silently disable perception or planning. We design fault-tolerant compute architectures that can degrade gracefully without abrupt disengagements.

Principle of Monitoring 👀

Monitoring is the system's introspective capability. It allows the vehicle to quantify confidence, detect sensor degradation, and recognize when it is handling an unfamiliar situation. I view monitoring as having three complementary aspects: perception confidence, system health monitoring, and operational-domain awareness.

Perception confidence

Modern neural networks can be overconfident. We must complement raw model outputs with calibrated uncertainty estimates. Techniques such as ensemble methods, Bayesian approximations, or explicit uncertainty heads can provide actionable measures of confidence. When confidence falls below thresholds, the planner can choose safer options or request a fallback procedure.

System health monitoring

Every sensor and compute module runs health checks. These include signal-to-noise metrics for cameras, lidar beam integrity checks, cross-checks among sensors, and runtime performance monitoring. Health monitors are designed to detect both gradual degradation and sudden failures.

Operational domain awareness

Monitoring also includes understanding whether the present conditions are within the vehicle's operational design domain. This involves checking weather conditions, road type, GNSS availability, and map coverage. If the vehicle determines it is outside the allowed domain it can initiate a safe transition, reroute to a supported area, or come to a controlled stop.

Monitoring systems are the bridge between perception and action. They translate internal diagnostics into safe behaviors and provide interpretable signals for engineers and regulators.

Principle of Validation ✅

I cannot emphasize enough that validation is the bedrock of safe deployment. Validation answers the question: how do we know the system is safe and ready for the world? The answer is not a single test, but a layered strategy that combines offline metrics, simulation, targeted real-world testing, and statistical reasoning.

Offline dataset validation

We begin with large-scale offline evaluations on diverse datasets to measure performance on standard metrics and to identify failure modes. These datasets include annotated scenes from real-world driving, event logs from deployed vehicles, and synthetic data designed to stress specific scenarios. Offline validation helps us measure detection accuracy, prediction error, planning acceptability, and more.

Simulation at massive scale

Simulation is indispensable for exploring rare events and tuning behavior in a risk-free environment. With high-fidelity simulators we can create millions of scenarios, vary lighting and weather, and inject adversarial agents. Simulation allows us to test how the stack behaves under orchestrated edge cases and to iterate rapidly.

Targeted real-world testing

No amount of simulation fully replaces real-world testing. We perform staged deployments that start in restricted environments and gradually expand coverage. Each deployment is instrumented for data collection and triggers focused retraining and model updates. This staged approach ensures that the system accumulates real-world experience without exposing the public to unnecessary risk.

Statistical validation and safety cases

Validation must be connected to a safety case, a structured argument that the system meets certain safety requirements. This often requires statistical reasoning about rare events. The combination of large-scale simulation, fleet data, and conservative operational constraints allows us to make statistically meaningful claims about safety.

Validation is continuous. Even after deployment we keep validating and updating models using fleet telemetry and newly generated scenarios.

Building Toward an L4 Product: Systems Integration 🏗️

Turning research prototypes into a deployable L4 product requires systems integration at many levels. The software stack needs to harmonize perception, prediction, mapping, localization, planning, control, user interface, and fleet management. The hardware must provide reliable compute, robust sensors, and functional safety measures. The company must have operational processes for fleet rollout, monitoring, and incident handling.

Key components of a practical L4 stack include:

Perception modules that detect and track obstacles and classify road elements.
Prediction modules that forecast the trajectories and intentions of other agents.
Planning and decision making that synthesize safe maneuvers considering dynamic constraints and traffic rules.
Control systems that execute trajectories with smoothness and safety margins.
High-definition maps and localization that provide priors and redundancy for scene understanding.
Vehicle management software for health checks, logging, and telemetry.
Human-machine interfaces that communicate intent and status to riders and operators.

From a product perspective, ease of integration and modularity matter. The foundation model approach simplifies integration because a shared representation can be reused across modules, reducing inconsistency. But modularity remains crucial so that components can be updated independently and validated incrementally.

Functional safety standards such as ISO 26262 and new safety frameworks specific to autonomy shape system design. The vehicle must remain fail-operational, meaning that it can continue to operate safely after certain types of faults. That requirement affects redundancy, graceful degradation, and emergency behaviors.

How We Test Safety at Scale 🧪

Testing safety at scale is a combination of careful experiment design, massive compute, and disciplined product processes. We implement a multi-tiered testing pipeline that grows in complexity and fidelity as a system matures.

The testing pipeline typically includes:

Unit and integration tests for software correctness and robustness.
Offline validation on held-out datasets and corner-case collections.
Hardware-in-the-loop tests that run software on the target hardware to expose latency and timing issues.
Simulation-based stress testing for millions of scenarios and adversarial maneuvers.
Closed-course testing in controlled environments that mimic complex traffic situations.
Incremental public deployments with conservative operational designs, continuous monitoring, and rollback procedures.

Each layer yields data that feeds the model training pipeline and the validation metrics. The data flywheel expedites this process because every test or deployment produces additional training signals that improve the foundation models.

One practical detail worth highlighting is scenario prioritization. Not all scenarios are equally important. We use risk-based prioritization to decide which scenarios to address first. High-probability, high-consequence scenarios are top priority. This risk-oriented approach ensures that engineering effort yields the maximum possible reduction in residual risk.

Challenges and Open Questions ⚠️

Despite the progress, many challenges remain. I want to be candid about them because acknowledging limitations is the first step toward solving them.

Rare events and long-tail distribution

Driving involves a long-tail distribution of events. Most of the time driving is routine, but rare, unexpected events tend to dominate risk. Finding and fixing those rare cases is difficult because they are, by definition, scarce. Using simulation to create synthetic rare events helps, but ensuring realism remains a challenge.

Generalization across geography and culture

A model trained in one city may not transfer seamlessly to another with different traffic patterns, signage, and driver behavior. Foundation models like Cosmos mitigate this by learning broad priors, but careful collection of geographically diverse data and targeted adaptation strategies are essential.

Regulatory and legal frameworks

Regulation lags behind technology. We are working closely with regulators to craft practical approval and oversight processes. For L4 deployment, regulators will want clear safety cases, explainable monitoring, and robust incident response processes. Global harmonization of rules would help accelerate adoption, but that is a complex political task.

Public trust and human factors

Public acceptance depends on trust. Trust comes from reliability, transparency, and effective communication. When an autonomous vehicle behaves unexpectedly or is involved in a high-profile incident, public perception can shift quickly. Building user interfaces that explain behavior, and mechanisms for clear incident analyses, are essential.

Economics and business models

Deploying L4 systems at scale requires viable economics. That includes sensor cost, compute cost, insurance, maintenance, and fleet operations. The value proposition for different use cases varies. For ride-hailing and logistics there is clear upside. For private ownership, it is more complex. The market will likely prioritize commercial deployments that can achieve scale and recoup investment.

Impacts and Opportunities for Society 🌍

If we get Level 4 right, the benefits for society are profound. I will list some of the most significant potential impacts and then highlight what we need to do to maximize the upside while minimizing harm.

Safety Sensitive to how we design and operate these systems, but the potential is enormous. Reducing human-driver errors could save many lives every year.
Mobility for underserved populations Self-driving shuttles and on-demand services could expand access for the elderly, the vision-impaired, and those without a driver's license.
Urban efficiency and congestion With intelligent routing and platooning, autonomous fleets could improve traffic flow and reduce congestion.
Environmental benefits Better route optimization and the integration of electric vehicles into autonomous fleets can reduce carbon emissions.
Economic transformation Logistics and freight applications are particularly promising. Autonomous trucks and delivery robots can improve productivity and reduce costs.

Realizing these benefits requires careful policy, regulatory engagement, and ethical design. Accessibility, privacy, and fairness must be designed into systems from the start.

A Vision for the Future and Call to Action 🌟

In closing, I remain optimistic. The confluence of foundation models, rich data and simulation, and careful systems engineering has brought Level 4 autonomy within reach. But it will not happen by accident. It will require rigorous engineering, responsible deployment practices, and broad collaboration between industry, academia, regulators, and communities.

"Ensuring that a machine always does the right thing even in the most complex situations is extremely impactful. It will really change for the better the lives of mankind."

That quote captures the motivating force behind much of my work. When an autonomous vehicle can reliably perform complex driving tasks without putting people at risk, the social benefits will be enormous. The engineering challenge is to make that reliability demonstrable and robust.

Here are the priorities I believe we must focus on over the next phase of development:

Invest in foundation models tailored to physical reasoning The better our underlying representations of space, motion, and semantics, the more robust the downstream behavior will be.
Commit to rigorous, continuous validation Use simulation, targeted real-world testing, and statistical safety arguments to build confidence.
Design with safety principles at the core Diversity, monitoring, and validation are not optional add-ons. They must be integral to system architecture.
Collaborate across sectors Regulators, municipalities, and communities must be partners, not passive recipients. Transparent communication builds trust.
Focus on societally valuable deployments Target use cases with clear benefits such as freight corridors, campuses, last-mile delivery, and ride-hailing in constrained urban zones.

The path to Level 4 autonomy is a marathon of engineering, validation, and human-centered design. The new advances in AI give us a competitive advantage unlike anything I have seen in my career. They enable more general, more robust reasoning about the physical world and they accelerate the data flywheel that helps systems get better over time.

I believe the near-term future will be characterized by more targeted deployments that deliver clear value, accompanied by continuous improvement and careful expansion of operational domains. If we stick to the fundamentals of safe design and invest in foundation models that understand the physical world, we can bring Level 4 benefits into real streets and communities in a responsible way.

For anyone building in this space, my invitation is straightforward: focus on safety, build systems that can explain themselves, and lean into data-driven iteration while preserving rigorous validation. The technological pieces are falling into place. Now the hard work is to make them trustworthy and useful at scale.