Credit Jim Wilson/The New York Times
Richard Socher appeared nervous as he waited for his artificial intelligence program to answer a simple question: âIs the tennis player wearing a cap?â
The word âprocessingâ lingered on his laptopâs display for what felt like an eternity. Then the program offered the answer a human might have given instantly: âYes.â
Mr. Socher, who clenched his fist to celebrate his small victory, is the founder of one of a torrent of Silicon Valley start-ups intent on pushing variations of a new generation of pattern recognition software, which, when combined with increasingly vast sets of data, is revitalizing the field of artificial intelligence.
His company MetaMind, which is in crowded offices just off the Stanford University campus in Palo Alto, Calif., was founded in 2014 with $ 8 million in financial backing from Marc Benioff, chief executive of the business software company Salesforce, and the venture capitalist Vinod Khosla.
MetaMind is now focusing on one of the most daunting challenges facing A.I. software. Computers are already on their way to identifying objects in digital images or converting sounds uttered by human voices into natural language. But the field of artificial intelligence has largely stumbled in giving computers the ability to reason in ways that mimic human thought.
Now a variety of machine intelligence software approaches known as âdeep learningâ or âdeep neural netsâ are taking baby steps toward solving problems like a human.
On Monday, MetaMind is expected to publish a paper describing advances its researchers have made in creating software capable of answering questions about the contents of both textual documents and digital images.
The new research is intriguing because it indicates that steady progress is being made toward âconversationalâ agents that can interact with humans. The MetaMind results also underscore how far researchers have to go to match human capabilities.
Other groups have previously made progress on discrete problems, but generalized systems that approach human levels of understanding and reasoning have not been developed.
Five years ago, IBMâs Watson system demonstrated that it was possible to outperform humans on âJeopardy!â
Last year, Microsoft developed a âchatbotâ program known as Xiaoice (pronounced Shao-ice) that is designed to engage humans in extended conversation on a diverse set of general topics.
To add to Xiaoiceâs ability to offer realistic replies, the company developed a huge library of human question-and-answer interactions mined from social media sites in China. This made it possible for the program to respond convincingly to typed questions or statements from users.
In 2014, computer scientists at Google, Stanford and other research groups made significant advances in what is described as âscene understanding,â the ability to understand and describe a scene or picture in natural language, by combining the output of different types of deep neural net programs.
These programs were trained on images that humans had previously described. The approach made it possible for the software to examine a new image and describe it with a natural-language sentence.
While even machine vision is not yet a solved problem, steady, if incremental, progress continues to be made by start-ups like Mr. Socherâs; giant technology companies such as Facebook, Microsoft and Google; and dozens of research groups.
In their recent paper, the MetaMind researchers argue that the companyâs approach, known as a dynamic memory network, holds out the possibility of simultaneously processing inputs including sound, sight and text.
The design of MetaMind software is evidence that neural network software technologies are becoming more sophisticated, in this case by adding the ability both to remember a sequence of statements and to focus on portions of an image. For example, a question like âWhat is the pattern on the catâs fur on its tail?â might yield the answer âstripesâ and show that the program had focused only on the catâs tail to arrive at its answer.
âAnother step toward really understanding images is, are you actually able to answer questions that have a right or wrong answer?â Mr. Socher said.
MetaMind is using the technology for commercial applications like automated customer support, he said. For example, insurance companies have asked if the MetaMind technology could respond to an email with an attached photo â perhaps of damage to a car or other property â he said.
There is still significant debate within the research community about the best technical approach and even what is the best way to measure progress.
âWe are excited to see them joining the fray in question answering, but we think the data sets they chose are not ideal,â said Oren Etzioni, a computer scientist who is chief executive of the Allen Institute for Artificial Intelligence, in Seattle.
In contrast, his laboratory is focusing on creating software that can answer questions taken from standardized elementary school science tests.