PaLM 🌴, DALL-E 2 👨🎨, Chinchilla 🐭, Chain-of-thought prompting ⛓💭✍️, Values and Culture in NLP 🏛
This newsletter covers PaLM, DALL-E 2, and Chinchilla, chain-of-thought prompting, and the role of values and culture in NLP.
This edition is somewhat delayed as I've been busy with planning a move (I'll be flying 🛫 to Germany tomorrow; say hi 👋 if you're in Berlin) and exhausted by current events. I hope that you are all staying safe in these trying times 🇺🇦.
I really appreciate your feedback, so let me know what you love ❤️ and hate 💔 about this edition. Simply hit reply on the issue.
Click here to view the newsletter in your browser.
If you were referred by a friend, click here to subscribe. If you enjoyed this issue, give it a tweet 🐦.
This Model Can Understand Your Jokes 🤪
The emergence of large pre-trained models has fundamentally changed the face and nature of progress in ML and NLP. The underlying methods have not changed dramatically; neural networks have already been pre-trained more than 15 years ago. However, the recent scale of model size and data have enabled unprecedented—and indeed unexpected—capabilities.
Two recent models showcase the impressive progress in vision and NLP: OpenAI's DALL-E 2 and Google's PaLM. Both can be seen as the most recent milestone in a line of ever larger pre-trained models such as DALL-E, T5 and GPT-3, among others.
DALL-E 2 consists of two components: a prior that generates a CLIP image embedding based on a text description and a diffusion-based decoder that generates an image conditioned on an image embedding (see this overview of CLIP and diffusion models). The paper, unfortunately, is light on details regarding the composition and amount of training data. The resulting model produces more photorealistic and faithful images than its predecessor (see below).
While DALL-E 2 is able to generate impressive images, it still has weaknesses. As it relies on image–caption pairs for training, it may perform poorly when generating images that require more fine-grained visual reasoning such as counting.
In NLP, there has been a debate on whether a language model trained only on unsupervised data can ever truly understand natural language (see this overview and this recent principled paper). Multi-modal modals are grounded by definition. So are there any intrinsic limitations to the capabilities of language-and-vision models trained on image–caption alignments? One could argue that in order to learn truly multi-modal representations, a model must not only learn from depictions of the real world but must be able to interact with it.
On the language side, PaLM is a 540B parameter decoder-only pre-trained Transformer model trained on multilingual—but heavily skewed towards English—data from the web as well as GitHub code. The model is evaluated in a few-shot setting on a battery of tasks where it generally outperforms the prior SOTA. On fine-tuning on SuperGLUE, the model handily outperforms the best decoder-only model and is competitive with encoder-decoder models (which generally perform better in such a fine-tuned setting).
What I found most impressive, however, are some of the qualitative examples of model behaviour. For instance, the model is exceptionally good at explaining jokes. You can judge for yourself below. In each case, the model was prompted with just two example joke explanations and then had to generate its own.
It will be interesting to see what this means for tasks such as sarcasm and irony detection, which have been mainstays of competitions such as SemEval. I had previously considered these tasks to be still far out of reach of current model capabilities. Such anecdotal evidence naturally does not mean that these tasks are solved but that we may need more sophisticated and robust benchmarks to assess model performance.
Similarly, explaining jokes is not something that I would have expected current models to be able to do. Consequently, there may be an array of applications that have so far been infeasible where models might be able to add value. We can thus expect to see more work that explores how we can leverage such models for previous unexplored applications. For a large publicly available model to experiment with, check out GPT-NeoX-20B.
Buoyed by these latest advances in NLP, there is a wave of new NLP startups that tackle a diverse set of applications, from search to writing assistants, content moderation, and many more.
Training Compute-Efficient LMs 🐭
While large language models are becoming more powerful, they are also becoming increasingly hard to use due to their huge size. In conjunction with scaling models, it is thus key to make advances in a) compression large models to smaller sizes and b) training more compute-efficient models to begin with.
Regarding the latter, researchers from DeepMind recently observed that current large language models are significantly under-trained. They noticed that for the most compute-efficient training, when doubling the model size the number of training tokens should also be doubled. This is a much larger scaling rate than that predicted by previous scaling laws. Their new 70B-parameter model, Chinchilla outperforms models of up to 530B parameters by training on much more data (1.4T vs 300B tokens).
Such an under-training phenomenon of large models is not entirely new. For RoBERTa, the authors similarly observed that BERT was significantly under-trained and that longer training improves its performance.
Given the non-linear improvement and emergence of new capabilities with large model sizes, it will be key to investigate what is necessary to retain such impressive few-shot and reasoning capabilities at smaller model sizes.
Another direction I am excited about is the modularization of huge models: For most practical applications, not all capabilities of a huge model are truly relevant. How then can we isolate and compress a huge model's domain and task-specific knowledge in a small model that excels on the downstream task? Similarly, how can we efficiently leverage only the parts of the pre-trained model that are necessary for the downstream setting or bootstrap a strong small model using a large pre-trained model? For more thoughts on such a modular perspective of model development, check out Colin Raffel's call to build open-source models like we build software.
Chain-of-Thought Prompting ⛓💭✍️
Despite the increasing power and capabilities of pre-trained models, the way we use and interact with them has not changed much. In-context prompting, pioneered by GPT-3 (see this overview) has been one of the most significant recent developments. However, we are still only scratching the surface of how to best extract information from pre-trained LMs and how to prime them for downstream tasks.
A method that enabled PaLM to perform particularly well on reasoning tasks is chain-of-thought prompting. Rather than training a model to predict the answer, chain-of-thought prompting augments the prompt with an explanation of the reasoning steps to arrive at the answer as can be seen below.
In a few-shot setting, these explanations can be manually written for a few examples. Prompted this way, the model learns to generate a similar explanation, which is particularly useful on more challenging reasoning problems.
Chain-of-thought prompting can be seen in line of several prior research areas. While explanations have been most commonly used to improve interpretability, Rajani et al. (2019) train a model to automatically generate explanations during training and inference, achieving a new SOTA on a commonsense reasoning dataset.
In a similar vein, Nye et al. (2020) train a model to write the intermediate computation steps of an arithmetic problem to a "scratchpad". For summarization, Narayan et al. (2021) train a model to generate an entity chain (an ordered sequence of entities mentioned in the reference summary). At test time, the model first generates the entity chain before generating the summary.
There are other ways to improve learning with such intermediate outputs. Wang et al. (2022) exploit the diversity of reasoning paths by sampling multiple chains of thought and then ensembling the final model predictions. As obtaining explanations for a large number of examples is expensive, Zelikman et al. (2022) generate explanations for a large dataset by bootstrapping a model in the few-shot setting and only retaining explanations that lead to correct answers.
Using explanations, rationales or a description of reasoning steps works empirically but a more principled theory of how models leverage such rationales is still missing. In particular, it would be interesting to investigate to what extent a model's reasoning conforms to the reasoning steps preferred by humans (although the model can also be trained to perform more human-like reasoning, similar to InstructGPT).
Beyond interpretability, generating an intermediate output enables the user to intervene on a model's predictions. Narayan et al. (2021) demonstrate this by removing entities from the entity chain that were not seen in the original input, which improves the faithfulness of the generated summary. As a side-effect, such intermediate-output methods provide an interface and the potential to modulate and steer the predictions of otherwise black-box models. We can thus expect work focusing on whether such rationales truly explain model behaviour, similar to the debate around the explainability of attention.
Overall, chain-of-thought prompting and related methods offer a glimpse of the untapped potential of current models. They also present an important and relatively compute-efficient research direction that can bring large improvements on top of state-of-the-art models. In this research, domain expertise is particularly important as it enables the development of strategies, reasoning steps, or alternative input methods that are particularly suited to an application. Prompts also do not need to be restricted to input–output pairs or explanations and can be much richer, including things to avoid, rules of thumb, positive or negative examples, etc as in the schema of Mishra et al. (2022) below.
Values and Culture in NLP 🏛
One of the most important and far-reaching recent insights is that language models inherit the biases of the data they are trained on (see this overview). Over time, our understanding of such biases has become more nuanced. Beyond generating toxic language when conditioned with certain prompts, recent work has turned to investigating a model's ideology and values.
Gururangan et al. (2022) aim to identify the language ideology encoded in GPT-3 by analyzing what type of language the quality filter used in GPT-3 is biased against. They replicate it and apply it to a corpus of U.S. high school newspapers (augmented with demographic information). They find that the filter favours text from authors who originate from regions with better educational attainment, urban centres, larger schools, and higher valued homes.
Looking at specific cultural values, Arora et al. (2022) converted the questions of Hofstede's cultural dimensions survey and of the World Values survey into prompts that were presented to multilingual language models. They find that the models exhibit differences in cultural values and that the values exhibited by the models are not in line with the values of the survey participants.
Johnson et al. (2022) investigate the values of GPT-3 by prompting it to summarize a range of culturally diverse texts. They examine the generated summaries and highlight problematic summaries and ones where the expressed values conflict with the original text.
These works demonstrate that beyond investigating bias related to specific lexical terms in current models, we also must be aware of the underlying values encoded in the model and expressed in the generated text. After all, we would not want our models to hold views that are outdated or disrespectful in certain cultural settings. However, the best way to investigate and robustly identify such values is still an open question.
The impact of culture in NLP goes beyond ideology and values. For a great overview of the cultural dimensions that are relevant in NLP, have a look at this survey. The authors define four cultural dimensions of relevance: linguistic form and style (how things are expressed in language), common ground (shared knowledge based on which people reason and communicate), "aboutness" (what information is relevant or meaningful to people), and objectives or values (what people strive for).