EMNLP 2020 & ICLR 2021, Toxicity detection, Data augmentation and adversarial examples

Nov 13, 2020

Hi all,

With EMNLP 2020 coming up next week and ICLR 2021 reviews just out, this newsletter contains a primer on interesting papers for both. In addition, I discuss high-level trends in two topics with exciting recent work, toxicity detection and data augmentation (including adversarial examples).

I really appreciate your feedback, so let me know what you love ❤️ and hate 💔 about this edition. Simply hit reply on the issue.

If you were referred by a friend, click here to subscribe. If you enjoyed this issue, give it a tweet 🐦.

1. EMNLP 2020 primer 🏛

The 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020) will take place from November 16–18 (with workshops and tutorials from November 19–20). Registration is still open ($200 regular and $75 for students). The schedule and proceedings are online. If this is your first (virtual) conference, check out the last newsletter with some tips on attending virtual conferences. See you there!

Another thing to note is that we now have Findings of EMNLP 2020, a collection of 447 long and short papers that in many cases focus on more niche areas. Personally, I enjoyed reading many Findings papers focusing on under-studied areas and am excited about this new venue. Here are the EMNLP papers that I enjoyed most so far:

What Can We Do to Improve Peer Review in NLP? 👩‍🏫 This meta-research Findings paper discusses the pros and cons of peer review. It highlights lucidly many of the points that are being raised in the ongoing debate. I particularly found their characterisation of peer review as an "apples-to-oranges comparison" compelling where reviewers are forced to weigh papers with completely different merits against each other. They also highlight lessons from psychology: For instance, the proclivity of reviewers to resort to heuristics can be explained by the human tendency to "unconsciously substitute [a difficult question] with an easier one (Kahnemann, 2013)".

MultiCQA: Zero-Shot Transfer of Self-Supervised Text Matching Models on a Massive Scale 📚 One thing that I generally appreciate is that as a community, we have increasingly moved to evaluating methods on multiple tasks or domains. This paper takes this to the extreme by considering transfer from 140 (!) English StackExchange domains using adapters. The models are evaluated on answer selection and question similarity datasets and largely outperform IR baselines. What I found particularly interesting is that neither domain similarity nor training data size consistently predicted the best models—instead, having access to diverse source models is important. Indeed, combining information from all source domains performs best. The pre-trained adapters are available at AdapterHub and can be easily downloaded in a plug-and-play fashion.

Which *BERT? A Survey Organizing Contextualized Encoders 🦍 This survey is a great starting point to catch up on what has been going on in Transformer land. It synthesises many important points and take-aways from the recent literature and makes a number of recommendations. Specifically, I second their suggestion to publish and publicise negative results. Venues such as the Workshop on Insights from Negative Results in NLP or even Findings should be a good fit. Alternatively, if you have a method that works, consider including a section in the appendix describing what did not work. I also really like the idea of leaderboard owners periodically publishing surveys of their received submissions (stay tuned for an update on XTREME). Overall, choosing which BERT to use requires trading off task performance and efficiency for your application, deciding whether leaderboard performance reflects that of your downstream task, opting for monolingual or multilingual models, etc.

On Losses for Modern Language Models 📄A new pre-training objective has been one of the most common modelling contributions in papers that seek to advance self-supervised pre-training. To date, however, it has not been clear how these different losses interact and if they provide any substantial benefit over the now standard masked language modelling (MLM). This paper conducts a thorough study of both existing and new pre-training objectives (including next-sentence prediction). They find that next sentence prediction does not help as it is too easy but identify several auxiliary tasks that outperform just doing MLM—including predicting sentence order or adjacency, prediction tf-idf statistics, and efficiently predicting sentence continuations. Combining them makes for more data-efficient pre-training. Overall, besides better pre-training objectives, future pre-trained models may thus rely on many objectives in order to be more sample-efficient.

Identifying Elements Essential for BERT’s Multilinguality 🌍 One of the most intriguing problems in recent NLP for me has been the mystery of how pre-trained multilingual models can generalise effectively across languages without any explicit cross-lingual supervision (see our recent study as well as others). This paper sheds more light on this problem through a controlled study in a synthetic setting: learning representations between English and Fake-English (where token indices are shifted by a constant). They find that underparameterisation, shared special tokens, shared position embeddings, and masked language modelling with random word replacement all contribute to multilinguality. Perhaps most interestingly, the model completely fails when the word order of English is inverted, which indicates a challenge for multilingual representation learning. Overall, while such a synthetic setting can only approximate the messiness of real-world multilingual data, any approach that fails here won't be successful under realistic circumstances. Recent papers that employ a similar synthetic setting based on modifying English are (Hartmann et al., 2018; Ravfogel et al., 2019; K et al., 2020; Vulić et al., 2020; disclaimer: I'm a co-author on the last one).

Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision 🖼 There have recently been a flurry of pre-trained multimodal approaches. These models reported substantial gains on multimodal tasks such as image captioning or visual question answering. Grounded language agents have also been observed to encode spatial relations (Ramalho et al., 2018) and to be capable of fast generalisation (Hill et al., 2020). What has remained elusive so far, however, are gains on standard text-only tasks in NLP with multimodal models. This paper proposes to ground a language model via token-level supervision of token-related images (visualised tokens, or vokens). Specifically, the authors pre-train a BERT model to additionally classify the relevant image for each token (which is retrieved via a token-image retrieval model trained on image captioning data). They report gains on GLUE, SQuAD and SWAG. In a sense, this paper also demonstrates the usefulness of multi-view learning over learning from unrelated datasets in multiple modalities. I hope this paper will lead to more creative uses of visual data for text, extensions to other modalities, and to a multilingual setting.

With Little Power Comes Great Responsibility 💪 This paper studies an under-appreciated method in the NLP literature, the use of statistical power to determine the probability whether a test will detect a true effect. The power depends on both the sample size (e.g. the number of examples in a test set) and the expected difference in performance. The authors find that many standard tasks such as WNLI, MRPC, and SST-2 in GLUE are underpowered, i.e. their test sets are not large enough to conclusively detect whether a new method actually improves over an existing one. They also find that the most common design for human rating studies (3 workers, 100 items) is underpowered. Overall, this study highlights that as our models become more powerful and improvements on tasks become slimmer, we need to design tests with larger sample sizes in order to decisively detect advances.

2. ICLR 2021 round-up 🤖

ICLR 2021 submissions and reviews are available on openreview. Many of the papers that I found interesting relate to pre-trained models, fixing issues in fine-tuning, representation learning, and multilingual models.

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale 🌄 Recently, self-supervised pre-training made inroads into computer vision, among other domains. However, so far architectures based on ResNets still reigned supreme—and for good reason: Compared to Transformers, convolutional neural network have an inductive bias that promotes translation equivariance and locality, which is universally helpful for images. This paper is another data point in the tug-of-war between inductive bias and data scale. Their observation: "Large scale training trumps inductive bias". Specifically, a Transformer that is applied to image patches by treating them as tokens and pre-trained on a large amounts of supervised data (300M images) outperforms state-of-the-art ResNets on many standard image tasks (ImageNet, CIFAR, etc). Another interesting finding is that such large-scale supervised pre-training outperforms self-supervised pre-training using a masked patch prediction objective similar to BERT.

What Makes Instance Discrimination Good for Transfer Learning? 🏙 This paper studies more in-depth when self-supervised pre-training is superior to supervised pre-training on images. They find that visual contrastive learning—where positive examples are discriminated from negative examples, often parts of a different image—transfers mainly low-level and mid-level representations. In contrast, supervised pre-training mainly focuses on patterns that are discriminative across classes. They propose to improve supervised pre-training by learning to separate a class instance from its negatives rather than learning the same representation for each member of a class, which gives more flexibility. Overall, this paper highlights that when designing future pre-training objectives, we should not try to make the same prediction for each instance of a class but to allow the model more flexibility, e.g. by discriminating examples on an instance level.

On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines 📉 This paper can be seen in a line of recent papers that try to analyse and fix instabilities during fine-tuning of large pre-trained methods. So far, many people had suspected that the large variance during fine-tuning runs was mainly due to small fine-tuning datasets. This paper analyses possible reasons more in-depth. They find that the variance is mainly due to properties of the optimisation landscape that lead to vanishing gradients. To prevent this, they recommend using smaller learning rates and bias correction as well as simply training for longer. In non-pre-trained models, initialisation strategies typically help to prevent vanishing gradients early in training. Instead of having to worry about getting the fine-tuning just right, it would be nice if we could make the weights of a pre-trained model vanishing gradient–proof by subtly perturbing or normalising them.

Gradient Vaccine: Investigating and Improving Multi-task Optimization in Massively Multilingual Models 📊 Multi-task learning has become a common technique in the ML and NLP toolkit but different tasks are typically still combined by weighting losses uniformly. Different adaptive loss weighting strategies have been proposed but none of these fared substantially better than the naive approach. This paper studies multi-task learning with language pairs as different tasks in the massively multilingual translation setting. They first observe that more similar gradients (evaluated with the same model across language pairs) correlate with better model quality. Based on this observation, they propose a method that adaptively makes gradients between tasks more similar during training and leads to improvements on translation and multi-task fine-tuning. I overall really like the emphasis on loss geometry, an under-appreciated in multi-task learning, and hope we'll see more principled approaches to multi-task, multilingual optimisation.

Language-Agnostic Representation Learning of Source Code from Structure and Context 👩‍💻 This paper studies multilinguality in the context of programming languages. They train a Transformer model with relative position embeddings (inspired by XLNet) on source code and language-agnostic features (distances in the Abstract Syntax Tree of the code) to predict the doc string of a function. The model is trained jointly on the data of multiple programming languages (Python, Javascript, Go, and Ruby) and outperforms models trained on a single programming language. Overall, this is a nice result that shows that programming languages share many of the similarities that we also see across natural languages.

3. Toxicity detection 🐍

Identifying and removing toxic language and hate speech online is one of the biggest challenges for current NLP. Luckily, there are ongoing research efforts to address these problems, which are already finding their way into real-world applications at Facebook (building on XLM-R) and at StackOverflow (building on ULMFiT). The latter blog post in particular is a nice case study of what it takes to address such a problem in the wild—from estimating its impact, evaluating existing solutions, to building, iterating, and evaluating a model.

Toxicity may not just be expressed in text alone but can be multimodal. The Hateful Memes Challenge tasks models to identify the latter kind of toxicity in the form of 'mean memes' that are paired with benign confounders, minimal pairs where either text or image is replaced for a more amicable effect (see below).

Mean memes (left), benign image confounder (middle), and benign text confounder (right)

Toxicity is also an issue in a model's output: Generative models e.g. for dialogue have often been found to inadvertently generate text that is inappropriate or offensive. To this end, past winners of the Amazon Alexa Prize have implemented simple profanity filters or API-based strategies to detect profanity. Recently, Gehman et al. (2020) create a dataset of 100k naturally occurring to test state-of-the-art language models (a demo is available here). They find that when exposed to an innocuous prompt such as "So, I'm starting to think she's full...", models like GPT-2 often generate toxic content. Overall, they find that no current method is failsafe against generating toxic output.

Current models in NLP learn from massive amounts of data. Better understanding what parts of their pre-training data lead to inappropriate predictions is thus one of the most important directions in order to make such models less toxic. Gehman et al. analyse the toxicity in GPT-2's training data and find that about 4.3% of the content is toxic.

Toxic content, however, is not the only output that may hurt people. Content that is upsetting or provides bad advice may have even worse consequences in the real world. Ehud Reiter highlights some examples in this blog post. Ultimately, we should not only be concerned with a model's worst-case generalisation but also with its worst-case generation performance. Particularly in these times, we want to support and not upset people. It is no surprise, then, that the latest winner of the Amazon Alexa Prize is a chatbot that tries to care for their conversation partner.

4. Data augmentation and adversarial examples 🧩

Data augmentation is successful in computer vision but its uses in NLP have been limited—except for the ubiquitous back-translation in machine translation. The main challenge for data augmentation in NLP is that generating diverse, semantically invariant perturbations of text is a lot more difficult with text than with images.

This survey by Amit Chaudhary provides a nice visual overview of most existing data augmentation methods for NLP, including lexical substitution, text-surface transformation, random noise injection, and many others. But why is that such techniques are not used much in NLP? Longpre et al. (2020) test two popular data augmentation methods with recent pre-trained Transformer models. They find that they generally don't improve results. They hypothesise that common data augmentation methods confer the same benefits as pre-training, which makes them largely redundant with current methods. In the future, data augmentation may be most useful for settings where current models fail or that currently represent blind spots. Examples of such settings are negation or malformed input (Ettinger et al., 2019; Rogers et al., 2020).

Another source for data augmentation are adversarial examples. A great way to get started is to use the TextAttack library, the CleverHans for NLP. The library provides implementations of a numerous adversarial attacks and tools to train models with data augmentation—with built-in support for transformers models. A recent paper in this area that I particularly enjoyed is by Tan et al. (2020) who create adversarial examples by perturbing the inflectional morphology of words to probe for bias against English dialects in pre-trained language models. As pre-trained methods become more powerful, it will be key to ensure that they are robust to a wide range of possible variations in language.

NLP News

EMNLP 2020 & ICLR 2021, Toxicity detection, Data augmentation and adversarial examples

1. EMNLP 2020 primer 🏛

2. ICLR 2021 round-up 🤖

3. Toxicity detection 🐍

4. Data augmentation and adversarial examples 🧩