How is it that a program like OpenAI’s GPT-3 neural network can answer multiple choice questions or write a poem in a particular style, even though it was never programmed for those specific tasks?
It may be because human language has statistical properties that lead a neural network to expect the unexpected, according to new research from DeepMind, Google’s AI unit.
Natural language, seen from a statistical point of view, has qualities that are “non-uniform”, such as words that can represent multiple things, known as “polysemy”, such as the word “bank”, which means a place where you put money or a mound of earth that rises. And words that sound the same can stand for different things, known as homonyms, like “here” and “hear.”
Those language qualities are the focus of a paper published on arXiv this month, “Data Distribution Properties Drive Emergent Low-Opportunity Learning in Transformers,” by DeepMind scientists Stephanie C. Y. Chan, Adam Santoro, Andrew K. Lampinen, Jane X. Wang, Aaditya Singh, Pierre H. Richemond, Jay McClelland, and Felix Hill.
Also: What is GPT-3? Everything your business needs to know about OpenAI’s revolutionary AI language program
The authors began by asking how programs like GPT-3 can solve tasks in which they are presented with query types for which they have not been explicitly trained, known as “few-try learning.”
For example, GPT-3 can answer multiple-choice questions without being explicitly programmed to answer such a question form, simply by a human user requesting it by typing an example of a multiple-choice question-answer pair.
“Large transformer-based language models can perform few-attempt learning (also known as in-context learning), without having been explicitly trained for it,” they write, referring to Google’s popular Transformer neural network that is the basis of GPT-3 and Google’s BERT language program.
As they explain, “we hypothesized that distribution-specific properties of natural language might drive this emerging phenomenon.”
The authors speculate that such large language model programs behave like another type of machine learning program, known as metalearning. Meta-learning programs, which DeepMind has explored in recent years, work by being able to model data patterns that span different data sets. These programs are able to model not a single distribution of data, but a distribution of data sets, as explained in previous research by team member Adam Santoro.
Also: OpenAI’s gigantic GPT-3 hints at the limits of language models for AI
The key here is the idea of different data sets They conjecture that all the non-uniformities of language, such as polysemy and the “long tail” of language, the fact that speech contains words that are used relatively infrequently, each of these strange facts of language is similar to the split distribution of data.
In fact, the language, they write, is like something between supervised training data, with regular patterns, and meta-learning with lots of different data:
As in supervised training, items (words) are repeated and item label assignments (eg word meanings) are somewhat fixed. At the same time, the long tail distribution ensures that there are many rare words that are infrequently repeated in the context windows, but can burst (appear multiple times) within the context windows. We can also view synonyms, homonyms, and polysemy as weaker versions of the completely unfixed item label assignments used in few-shot metatraining, where the assignments change each episode.
To test the hypothesis, Chan and his colleagues take a surprising approach: They don’t actually work with language tasks. Instead, they train a Transformer neural network to solve a visual task, called the Omniglot, introduced in 2016 by academics from NYU, Carnegie Mellon, and MIT. Omniglot challenges a program to assign the correct classification label to 1,623 handwritten character glyphs.
In the case of Chan et al.’s work, they make the Omniglot labeling challenge a one-time task by randomly shuffling the glyph labels, so that the neural network learns with each “episode”:
Unlike training, where labels were fixed on all sequences, the labels for these two image classes were randomly reassigned for each sequence. […] Because the labels were randomly reassigned for each stream, the model must use the context in the current stream to make a label prediction for the query image (a two-way classification problem). Unless otherwise stated, few-shot learning was always tested on reserved image classes that were never seen in training.
In this way, the authors are manipulating visual data, the glyphs, to capture the non-uniform qualities of language. “At training time, we place the Omniglot images and labels into sequences with various language-inspired layout properties,” they write. For example, they gradually increase the number of class labels that can be assigned to a given glyph, to approximate the quality of polysemy.
“In the evaluation, we tested whether these properties give rise to low-shot learning abilities.”
What they found is that as they multiplied the number of labels for a given glyph, the neural network got better at learning fewer trials. “We see that increasing this ‘polysemy factor’ (the number of labels assigned to each word) also increases low-shot learning,” as Chan and his colleagues put it.
“In other words, making the generalization problem harder actually made low-opportunity learning emerge more strongly.”
At the same time, there’s something about the specific structure of the Transformer’s neural network that helps it achieve learning in just a few tries, Chan and his colleagues find. They test “a vanilla recurrent neural network,” they write, and find that such a network never achieves a low-shot skill.
“Transformers show a significantly greater bias toward low-shot learning than recurrent models.”
The authors conclude that both the qualities of the data, such as the long tail of the language, as well as the nature of the neural network, such as the structure of the Transformer, are important. It is not one or the other but both.
The authors list a number of avenues to explore in the future. One is the connection to human cognition, as infants demonstrate what appears to be learning with few tries.
For example, babies quickly learn the statistical properties of language. Could these distributional features help infants acquire the capacity for rapid learning or serve as useful pre-training for later learning? And might similar non-uniform distributions in other domains of experience, such as vision, also play a role in this development?
It should be apparent that the current work is not a language test at all. Rather, its goal is to emulate the supposed statistical properties of language by recreating the non-uniformities in the visual data, the Omniglot images.
The authors do not explain whether this translation from one modality to another has any effect on the transcendence of their work. Instead, they write that they hope to extend their work to more aspects of language.
“The above results suggest interesting lines of future research,” they write, including: “How do these data distribution properties interact with reinforcement learning versus supervised losses? How might results differ in experiments replicating other aspects of language and language modeling, for example, using symbolic inputs, training on predicting the next token or masked token, and having the meaning of words determined by their context?