MIT Study Showing High "Functional Vigilance" Among Autopilot Users Comes With Massive Caveats

Elon Musk adds the MIT-AVT study to his list of misleading Autopilot safety claims by glossing over key factors, both disclosed by the authors and not.

MIT-AVT Study via arxiv

Ever since the public found out that Josh Brown's Tesla Model S had slammed into a turning semi truck at high speed while the car's Autopilot system was activated, Tesla and its CEO Elon Musk have thrown a variety of statistics at the public in hopes of proving that the driver assistance system is in fact safe. These statistical "proofs" of Autopilot's safety have run the gamut from unaudited numbers tossed out by Musk via Twitter and media calls to an official NHTSA study, but all have had one thing in common: they have been debunked by a variety of critics ranging from Randy Whitfield of Quality Control Systems, Tim Lee of Ars Technica, Tom Simonite of MIT Technology Review, yours truly at The Daily Beast, The Daily Kanban and here at The Drive

Now Musk is touting yet another set of data that purports to prove that Autopilot doesn't prevent drivers from maintaining "a high degree of functional vigilance," seemingly invalidating the concerns articulated by the NTSB's investigation into Josh Brown's fatal crash and other critics of Autopilot's safety.  Once again the latest "proof" of Autopilot's safety comes from a seemingly unimpeachable source, MIT researcher Lex Fridman, but as in the past a critical examination reveals that this latest study falls well short of proving that safety concerns about Autopilot are unfounded. If anything, it only proves how problematic an over-reliance on machine learning and an avoidance of the tricky human factors at the heart of human-automation interaction problems can be.

Before diving into the substance of Fridman's latest paper [PDF], which Musk recently quoted extremely selectively on Twitter as evidence of Autopilot's safety, a quick caveat is in order. The MIT "Autonomous Vehicle Technology Study" [PDF] that Fridman leads and which he drew on for his findings about Autopilot does attempt to introduce new data into an important and complex area of study, and I would not want my criticisms of it to be read as totally invalidating the relevance of its data. Indeed, Fridman and his co-authors are very much upfront about some of its limitations, and it's not their fault that Musk quoted it so selectively and failed to mention that their own clear caveats caution against drawing sweeping conclusions from it about Autopilot's overall safety. Nor is it strictly speaking the fault of the MIT team that Musk has so regularly pushed misleading and scientifically unsound data in an attempt to prove the safety of Autopilot features that Tesla sold as "convenience features" prior to Josh Brown's death, although given Fridman's persistent focus on Tesla and Autopilot this eventuality should have been easily predictable.

It's also important to acknowledge any factors that may affect my own perception of Fridman and his work, particularly in light of the fact that he has not been given space in this piece to respond (I do welcome any response he wishes to provide and would be happy to publish it). For one thing, he blocked me on Twitter some time ago in apparent response to my (admittedly superficial) criticism of what I saw as a simplistic and technocentric approach called "Deep Traffic" [PDF]. More bafflingly, Fridman also blocked NVIDIA's Director of Research Anima Anandkumar for suggesting that he pursue peer review, and Alex Roy for asking pointed questions about his research, all while soliciting notably pro-Tesla forums for their recommendations of "objective" journalists to cover this latest paper. Though I believe it is important to look at the substance of Fridman's work itself, these factors and other data points suggesting that he is biased toward pro-Tesla outcomes have undoubtedly affected my perspective on his work. 

These concerns carry over into even the most high-level assessments of Fridman's work, starting with the questions he seeks to answer in his research. "We launched the MIT Autonomous Vehicle Technology (MIT-AVT) study," Fridman says, "to understand, through large-scale real-world driving data collection and large-scale deep learning based parsing of that data, how human-AI interaction in driving can be safe and enjoyable" [emphasis added]. This is echoed in Fridman's introduction to the Tesla Autopilot survey, in which he describes himself as "part of a team of engineers at MIT who are looking to understand and build semi-autonomous vehicles that save lives while creating an enjoyable driving experience." This goal fundamentally undercuts the study's pretensions of objectivity, particularly when seen in the context of Fridman's obvious fondness for Tesla and the problematic nature of the study's data collection. 

At first glance this data collection regime seems as objective as can be: a set of cameras captures driver behavior and syncs it with data from the vehicle, allowing Fridman and his team to correlate conditions on the road with the driver's state and the Autopilot system's functions. Through this data collection, the authors sought to measure what they call "functional vigilance," which they describe as "incorporat[ing] both the ability of the driver to detect critical signals and the implicit ability of the driver to self-regulate when and where to switch from the role of manually performing the task to the role of supervising the automation as it performs the task." By comparing a driver's decision to take over control from Autopilot with road conditions, specifically what they call "tricky situations" that could theoretically result in a crash, the authors claim that their dataset (which notably had no crashes in it) demonstrated that "(1) drivers elect to use Autopilot for a significant percent of their driven miles and (2) drivers do not appear to over-trust the system to a degree that results in significant functional vigilance degradation in their supervisory role of system operation."

These data collection techniques themselves seem quite promising, given that driver awareness can be fairly well quantified via machine learning and the relative risk of road conditions can be more or less accurately assessed by human labelers. But the data they produce are undercut by unanswered questions about the pool of drivers they draw from. Short-term participants used MIT-owned vehicles, were screened by a criminal background search and a review of their driving record, and were prepared with more extensive training than a typical consumer receives when they purchase a car with Autopilot. This extensive screening and grooming of the MIT-AVT short-term study participants tilts data from the short-term portion of the study toward the outcome found in the Tesla-specific study, to an extent that the authors do not fully explore in their caveats.

Long-term participants used Tesla vehicles that they owned, and according to the authors were recruited "through flyers, social networks, forums, online referrals, and word of mouth." Both actual participants in the instrumented study and validating survey data were collected in a  decidedly non-random fashion from forums like Tesla Motors Club, which is notorious for having an intensely pro-Tesla culture and an active Tesla investor forum. This approach makes the MIT-AVT subject pool anything but a random sampling, especially given that a high percentage of Tesla owners (particularly those most likely to respond to solicitations in Tesla-centric online forums) also own stock in the company. Whether they own stock or are simply fans of the company, self-selected subjects solicited from this pool would be highly likely to "perform" high levels of safety for a study that they saw as validating one of Tesla's key competitive feature. Again, the authors do not appear to have considered these factors in either the results of the Autopilot "functional vigilance" study, or the MIT-AVT study more broadly.

These problems make it highly unlikely that the MIT-AVT study's data set is broadly representative of how Autopilot is used in the real world, undercutting the paper's claim that "this work takes an objective, data-driven approach toward evaluating functional vigilance in real-world AI-assisted driving by analyzing naturalistic driving data in Tesla vehicles" as well as its findings. On top of these specific experimental design issues, which are not directly addressed in any of the resulting papers, the authors do (to their credit) bring up a variety of other issues which further undercut the value of the study as a validation of Autopilot safety.  "Subject sample characteristics and demographic generalizability" concerns are raised by the authors, but only in terms of the "tech savvy" nature of Tesla owners and the lack of "high risk demographics such as teenage drivers" in the study. There is also an acknowledgement that safety issues may creep in over time periods longer than those studied, as well as issues with the annotation of "tricky situations," the way "functional vigilance" was measured and this concept's applicability to actual safety. 

Even more significantly, the authors admit that the imperfection of Autopilot's performance and experimentation by drivers outside of proven operational design domains (ODDs) may provide a level of familiarity with the system's shortcomings that contributes to the high level of "functional vigilance" seen in the study. Between over-the-air updates that improve system performance and the natural increase in user familiarity over time, it stands to reason that "functional vigilance" would decline over a longer time horizon, particularly among a more random sample of users who received no special screening or training. Such a decline in "functional vigilance" would be more in line with peer-reviewed research into human monitoring of automation, the real-world findings of autonomous vehicle developers like Waymo, the results of the NTSB's investigation into the Josh Brown crash, and even Elon Musk's own admission that

When there is a serious accident, it is almost always, in fact, maybe always the case that it is an experienced user. And the issue is what -- more one of complacency, like we get too used to it, that tends to be more of an issue. It's not a lack of understanding of what Autopilot can do. It's actually thinking they know more about Autopilot than they do, like quite a significant understanding of it, but there's less, [maybe less].

Though the authors' awareness of some of their study's shortcomings is encouraging, it still falls well short of encompassing all the issues evident in their experiment. Moreover the entire effort seems to almost completely ignore the actual problems outlined in the NTSB investigation of the Josh Brown crash, which appear to have contributed to several other fatal crashes exhibiting similar characteristics. The lack of hard ODD limits that the NTSB investigation fingers as a contributor to Brown's death is not only not addressed, it is actually presented as something that increases the "functional vigilance" of Autopilot users by allowing them to experiment and find the limits of the system's capabilities. The other factor singled out by the NTSB in the Brown case, the lack of a driver monitoring system (DMS), is alluded to as something that could increase "functional vigilance" but that recommendation is also implicitly undercut by the overall finding highlighted by Musk: that "functional vigilance" levels in the study were high. Given the thoroughness of the NTSB investigation, and the high likelihood that similar factors played a role in other deaths caused by inattentive drivers allowing Autopilot to drive into stationary objects at high speed, it's surprising that the authors decided to forgo testing the value of the NTSB's conclusion in favor of a vague and broad investigation of "functional vigilance" that seems to have invited more scope for experimental bias.

Holistically, both the "Human Side of Tesla Autopilot" paper and the MIT-AVT study on which it draws seem far more like the product of someone setting out to prove that "human-AI interaction in driving," specifically in Tesla's Autopilot, "can be safe and enjoyable" than an actual scientific effort to understand the very real safety concerns that emerged from the Josh Brown incident.  This helps explain Fridman's decision to block Anandkumar for merely suggesting that he pursue peer review for this work, as it seems unlikely that either the study or the Autopilot-specific paper would withstand such scrutiny. At the same time, Fridman's apparent priors, the study's conclusions, his rejection of peer review and blocking of critical voices combine to create an impression that seems far removed from what one might expect from such a vaunted scientific institution as MIT. This impression is all the more troubling, given that it further escalates the pattern of misleading exonerations of Autopilot safety cited by Musk, which now not only centers on a man whose fame ostensibly flows from his scientific acumen but has also embroiled a major scientific institution as well as the nation's auto safety regulatory agency.

Beyond the overwhelming impression that the study's authors hoped to obtain a positive headline finding for Tesla, another conclusion emanates from the MIT-AVT study and related work: artificial intelligence specialists seem highly unlikely to solve the problems they have created. The authors seem to have been hyper-focused on using technology to collect "objective" data about users and feed that data into machine learning algorithms in hopes of finding meaningful correlations, at the expense of critical human research factors like sample selection, understanding potential subject biases like stock ownership, or the effects that their screening and training may have had on the pool of subjects. The result seems to prove, more than anything else, that AI researchers are extremely prone to collecting datasets that seem "objective" to them but which are in fact subject to a wide variety of influences that a technical or engineering background doesn't prepare them to understand. In other words, this study comes across as a classic example of the "AI bias" phenomenon that has produced troubling outcomes in fields like facial recognition and is garnering growing public attention.

After years of incredible advances in artificial intelligence, we seem to be coming up against some of the thorniest limitations of the technology. Specifically, in terms of interactions between humans and automation, we can collect a great wealth of physically-measurable data that can be ingested by machine learning but understanding what goes on inside the human brain continues to be frustratingly opaque and resistant to measurement and analysis by AI systems. Reducing any study of human behavior to purely physically-measurable data, without carefully controlling for subtle psychological factors like education, perceptions and biases risks the entire value of an otherwise-sophisticated study. Especially when studying a hotly-debated subject with broadly-distributed and intense affinities like the safety of Autopilot, researchers need to be extremely vigilant against both their personal biases toward a specific outcome and an overreliance on data collection and machine-learning analysis that can mask strong biases with a veneer of objectivity.