The performance of machine learning models is typically measured by analyzing:
- Recall: the percentage of objects that the system can correctly detect. For example, if there are nine pedestrians on the road and a system only sees eight of them, the recall of that system for pedestrians is 8/9, or about 89%.
- Precision: the percentage of important objects the system can label correctly. For example, in a scene with five parked cars, a system sees six, but one of them is really a dumpster. In this case, the precision for parked vehicles is 5/6, or about 83%.
Both are critical for self-driving. If there’s low recall, the system might not see something important, which could be a serious safety concern. With low precision, passengers might be subjected to unnecessary maneuvers that make the ride uncomfortable and also present safety concerns. For example, if the system mislabels a cloud of exhaust as an obstacle, the vehicle might suddenly stop.
The long-tail problem
We improve precision by continually fine-tuning the ability of our machine learning models to classify objects. But even if they could perfectly identify all of the more common objects (bicyclists, pedestrians, buses, etc.), it’s virtually impossible to train the system on every unique object that can occur on the road. That means even the most advanced perception systems may encounter things they simply don’t recognize. Trailers pulling stacks of antique living room furniture, boxes falling off the back of a truck–these are just some of the many possible situations that could confuse an otherwise well-trained model.
The occurrence of random, infrequent, and unique road events is called the long-tail phenomenon, and it’s one of the largest hurdles in perception. Systems that aren’t designed to handle the long-tail phenomenon could ignore or misclassify unusual data. This is not a risk we’re willing to accept.
No Measurement Left Behind
To account for the long-tail phenomenon, we developed our perception system with the mantra: No Measurement Left Behind.
Essentially, we designed our system to ensure that all of our sensor measurements have an explanation. We do this by detecting and tracking all of the object types we understand, including known obscurants (e.g., exhaust), sensor artifacts, and parts of the static world. Then, we don’t just ignore what’s left over. We explain remaining data by tagging it as one or more generic objects, and if those objects are moving, we track them. This involves combining state-of-the-art machine learning with advanced state estimation to predict each object’s trajectory without knowing what it is.
This improves safety because when the Aurora Driver knows something is there, it can react (e.g., slowing down). Further, it’s designed to approach unknown objects cautiously, much like a human driver would. You can see this in the video below, which shows how the Driver reacts when a box suddenly falls out of the back of a pickup truck.
Notice that the environment in the video contains a series of cuboids. The colorful ones correspond to objects our perception system recognizes: other vehicles (blue) and pedestrians (red). The gray ones correspond to objects that are a) static scenery or b) what our system has deemed generic moving objects. When the box falls, it’s immediately tagged with tiny gray cuboids, meaning our perception system saw it, couldn’t place it in an existing category, and still tagged it as a generic object. Perception then reports the box’s current and estimated future position to the motion planner, which instructs the vehicle to stop.