Language is a big ambiguous soup. Words have different meanings in different contexts, homographs exist, and humans regularly warp the established lexicon with slang. It’s almost a miracle that we’ve built computers that can understand us at all.
Systems for understanding human language have advanced significantly in the last few years, as researchers have improved approaches to artificial intelligence like deep learning. These methods set algorithms crawling through immense troves of data to draw connections between words and phrases. This is called “parsing,” or identifying each word and it’s role in the sentence.
Today Google is publishing the code for its language system called SyntaxNet, as well as an already-trained program for English called Parsey McParseface. Google’s tests put Parsey McParseface’s accuracy for correctly understanding words at more than 94 percent—close to Google’s internal benchmarks of 96 percent for the humans they employ for the same task. With SyntaxNet, researchers outside of Google will be able to train and implement their own language understanding systems for other languages, or try to beat Google’s score.
The system works by taking multiple passes at each sentence, forming hypotheses about each potential connection between words. These hypotheses are based on sentences and words the algorithm has been shown in the past, called the training data. The system ranks the hypotheses as it works through variations of each word’s potential meaning, and finally comes to a conclusion based on the highest probability for each word. Researchers call this a “beam search,” first coined at Carnegie Mellon in 1976.
Language understanding is no-brainer for Google, whose entire platform operates on understanding what users want to see from their search. SyntaxNet is obviously built within TensorFlow, Google’s open source machine learning platform, and for more information check out Google’s blog post on the announcement.