Home / Technology / A Lesson of Tesla Crashes? Computer Vision Can’t Do It All Yet

A Lesson of Tesla Crashes? Computer Vision Can’t Do It All Yet

Yet the recent advances, while impressive, have been mainly in image recognition. The next frontier, researchers agree, is general visual knowledge — the development of algorithms that can understand not just objects, but also actions and behaviors.

Computing intelligence often seems to mimic human intelligence, so computer science understandably invites analogy. In computer vision, researchers offer two analogies to describe the promising paths ahead: a child and the brain.

The model borrowed from childhood, many researchers say, involves developing algorithms that learn as a child does, with some supervision but mostly on its own, without relying on vast amounts of hand-labeled training data, which is the current approach. “It’s early days,” Dr. Malik said, “but it’s how we get to the next level.”

In computing, the brain has served mainly as an inspirational metaphor rather than an actual road map. Airplanes don’t flap their wings, artificial intelligence experts often say. Machines do it differently than biological systems.

But Tomaso Poggio, a scientist at the McGovern Institute for Brain Research at M.I.T., is building computational models of the visual cortex of the brain, seeking to digitally emulate its structure, even how it works and learns from experience.

If successful, the outcome could be a breakthrough in computer vision and machine learning in general, Dr. Poggio said. “To do that,” he added, “you need neuroscience not just as an inspiration, but as a strong light.”


Jitendra Malik of the University of California, Berkeley, and Fei-Fei Li of Stanford, researchers in computer vision, an area of technology being applied in driverless cars. Credit Carlos Chavarria for The New York Times

The big gains in computer vision owe much to all the web’s raw material: countless millions of online photos used to train the software algorithms to identify images. But collecting and tagging that training data have been a formidable undertaking.

ImageNet, a collaborative effort led by researchers at Stanford and Princeton, is one of the most ambitious projects. Initially, nearly one billion images were downloaded. Those were sorted, labeled and winnowed to more than 14 million images in 22,000 categories. The database, for example, includes more than 62,000 images of cats.

For a computer-age creation, ImageNet has been strikingly labor intensive. At one point, the sorting and labeling involved nearly 49,000 workers on Mechanical Turk, Amazon’s global online marketplace.

Vast image databases like ImageNet have been employed to train software that uses neuron-like nodes, known as neural networks. The concept of computing neural networks stretches back more than three decades, but has become a powerful tool only in recent years. “The available data and computational capability finally caught up to these ideas of the past,” said Trevor Darrell, a computer vision expert at the University of California, Berkeley.

If data is the fuel, then neural networks constitute the engine of a branch of machine learning called deep learning. It is the technology behind the swift progress not only in computer vision, but also in other forms of artificial intelligence like language translation and speech recognition. Technology companies are investing billions of dollars in artificial intelligence research to exploit the commercial potential of deep learning.

Just how far neural networks can advance computer vision is uncertain. They emulate the brain only in general terms — the software nodes receive digital input and send output to other nodes. Layers upon layers of these nodes make up so-called convolutional neural networks, which, with sufficient training data, have become better and better at identifying images.

Fei-Fei Li, the director of Stanford’s computer vision lab, was a leader of the ImageNet project, and her research is at the forefront of data-driven advances in computer vision. But the current approach, she said, is limited. “It relies on training data,” Dr. Li said, “and so much of what we humans possess as knowledge and context are lacking in this deep learning technology.”

Facebook recently encountered the contextual gap. Its algorithm took down the image, posted by a Norwegian author, of a naked, 9-year-old girl fleeing napalm bombs. The software code saw a violation of the social network’s policy prohibiting child pornography, not an iconic photo of the Vietnam War and human suffering. Facebook later restored the photo.

Or take a fluid scene like a dinner party. A person carrying a platter will serve food. A woman raising a fork will stab the lettuce on her plate and put it in her mouth. A water glass teetering on the edge of the table is about to fall, spilling its contents. Predicting what happens next and understanding the physics of everyday life are inherent in human visual intelligence, but beyond the reach of current deep learning technology.

At the major annual computer vision conference this summer, there was a flurry of research representing encouraging steps, but not breakthroughs. For example, Ali Farhadi, a computer scientist at the University of Washington and a researcher at the Allen Institute for Artificial Intelligence, showed off ImSitu.org, a database of images identified in context, or situation recognition. As he explains, image recognition provides the nouns of visual intelligence, while situation recognition represents the verbs. Search “What do babies do?” The site retrieves pictures of babies engaged in actions including “sucking,” “crawling,” “crying” and “giggling” — visual verbs.

Recognizing situations enriches computer vision, but the ImSitu project still depends on human-labeled data to train its machine learning algorithms. “And we’re still very, very far from visual intelligence, understanding scenes and actions the way humans do,” Dr. Farhadi said.

But for cars that drive themselves safely, several years of continuous improvement — not an A.I. breakthrough — may well be enough, scientists say. It will take not just steady advances in computer vision, they say, but also more high-definition digital mapping and gains in radar and lidar, which uses laser light to scan across a wider field of vision than radar and in greater detail.

Millions of miles of test driving in varied road and weather conditions, scientists say, should be done before self-driving cars are sold. Google has been testing its vehicles for years, and Uber is beginning a pilot program in Pittsburgh.

Carmakers around the world are developing self-driving cars, and 2021 seems to be the consensus year for commercial introduction. The German auto company BMW recently announced plans to deliver cars by 2021, in a partnership with Intel and Mobileye, an Israeli computer vision company . The cars would allow hands-free driving first in urban centers, and everywhere a few years later. And last week, Ford announced its driverless car plan with a similar timetable.

“We’re not there yet, but the pace of improvement is getting us there,” said Gary Bradski, a computer vision scientist who has worked on self-driving vehicles. “We don’t have to wait years and years until some semblance of intelligence arrives, before we have self-driving cars that are safer than human drivers and save thousands of lives.”

Continue reading the main story

NYT > Technology

Leave a Reply

Your email address will not be published. Required fields are marked *