EMNLP 2020 recap, Minimum viable datasets, Efficiency, Roguelikes for RL Research
I sincerely hope that this year was an extreme outlier and that 2021 will be again more in-distribution. I hope you have time to relax and spend time with your loved ones. On a personal note, growing up I never understood why May you live in interesting times (Interesting Times is the title of a book by a favourite author of mine) may be considered a curse. I think I have a better sense of it now.
While personal interactions have been limited mostly to the online setting this year, I am grateful for everyone I've had the chance to interact with or meet this year. Thank you for brightening my 2020. I hope I get to meet some of you in person again next year.
This newsletter includes a (very) brief recap of my EMNLP 2020 and a discussion of datasets that are minimally viable to evaluate the capabilities of a model. I highlight some interesting research areas related to efficiency and—on a more fun note—recent research in reinforcement learning that leverages roguelikes and procedural generation. I also cover miscellaneous blog posts and articles, among them a 385-page (!) ML compendium.
I really appreciate your feedback, so let me know what you love ❤️ and hate 💔 about this edition. Simply hit reply on the issue.
If you were referred by a friend, click here to subscribe. If you enjoyed this issue, give it a tweet 🐦.
EMNLP 2020 recap 🏛
EMNLP 2020 took place last month. Overall, the interactions in the virtual conference venue made it feel closer to an actual conference for me. I particularly enjoyed a number of serendipitous encounters in the corridors while trying to navigate the conference venue, roaming around the posters picking up on nuggets of conversation, and being able to have one-on-one conversations with poster authors.
I covered some of the papers that initially caught my eye in the last newsletter, so I'll share others' recommendations this time:
Gaurav Maheswari gives an extensive overview of papers focusing on bias detection and mitigation. He identifies three main types of bias that papers are dealing with: gender bias, political bias, and annotator bias. Regarding mitigation, papers either focus on mitigation with regard to a specific task or debiasing of the embeddings in general.
Sabrina Mielke and Naomi Saphra shared recommendations on Twitter, which can be found here and here.
Minimum viable datasets 💽
One of the most important factors for making progress on an idea in ML is the speed of iteration, i.e. how long it takes to try a hypothesis or a set of hyper-parameters on a dataset and to obtain results. When we are still validating an idea and testing potential hypotheses, we would ideally like to work with a minimum viable dataset (MVD) for our setting, i.e. a dataset that is a) small so that models can be trained efficiently, b) diagnostic in that it can differentiate good from bad models, and c) representative of the capabilities that we'd like our models to learn in more realistic settings.
MNIST ✍️ MNIST is a classic minimum viable dataset in computer vision but is less used for validating recent approaches as most models achieve 99+% accuracy on it. In addition, inputs are 784-dimensional so require a non-trivial amount of computation. Popular more recent datasets primarily used for meta-learning are mini-ImageNet, a down-sampled version of a subset of ImageNet classes and a number of other datasets used in Meta-Dataset. Sam Greydanus also recently proposed an MNIST-1D dataset in this blog post as a more efficient, minimum viable alternative to MNIST.
SQuAD and MultiNLI 🙋♀️ In this context, I think it is interesting to consider what minimum viable datasets exist for current NLP models. What dataset do you turn to when you quickly want to validate an idea? In my impression, SQuAD and MultiNLI have taken on this role for pre-trained models to some extent. Good performance on them demonstrates that a model has learned certain things about natural language such as a broad understanding of semantics. However, both are far from efficient to train.
Beyond these, minimum viable datasets may often be task-specific. Some common datasets are not challenging or realistic enough to differentiate between classic and current methods: On MLDoc, a cross-lingual document classification dataset, word embedding-based approaches and deep models achieve similar performance (Artetxe et al., 2020) while n-gram and deep models perform similarly on public test sets for language identification (Caswell et al., 2020).
Overall, I think it's worthwhile to think more about how we can test for certain capabilities of natural language understanding in an efficient and minimum viable way. Subsampling existing datasets, leveraging analysis methods (see this ACL 2020 tutorial), and evaluating sample efficiency (Yogatama et al., 2019) are all viable options. Given how expensive pre-training is, being able to diagnose model performance early is particularly important. In other words, what is the MNIST for pre-training?
Efficiency was important in the previous section, so let's dig a bit deeper into progress in efficiency in 2020. In ML research, there will always be work that seeks to leverage scale and to grow models bigger. This year, it felt like such progress was nicely juxtaposed with progress in making models smaller.
Reporting energy efficiency ⚡️ An important paper early in the year was one by Henderson et al. (2020), which introduced a tool that calculates the total energy of an experiment—building on the power usage effectiveness values used by Strubell et al. (2019). The latter paper appeared only last year but has already gathered 445 citations (according to Google Scholar), demonstrating substantial interest in efficiency. Henderson et al.'s tool can be plugged in into an experiment and enables calculating its energy efficiency on-the-fly. I hope that the reporting of such numbers will become standard going forward.
Efficient Transformers 🤖 Within the broader topic of efficiency, another active research area has been the development of efficient Transformer models. For a great overview of this topic, check out this survey on efficient Transformers. For an empirical comparison, have a look at the Long Range Arena (see below; note: I'm a co-author).
Workshops and competitions 🥇 Another great development was the emergence of relevant workshops and competitions. The SustaiNLP workshop organized a shared task at EMNLP 2020, which evaluated models in terms of their energy consumption (following the Henderson et al. paper above) and their SuperGLUE score. The Efficient QA competition at NeurIPS 2020 evaluated open-domain question answering models based on the model size. Remarkably, there is a system smaller than 30 MB, which achieved >25% accuracy on the task—with the best systems achieving ~53% and weighing in around 5–6 GB.
Roguelikes for RL Research 👾
or: The Promise of Procedural Generation
According to Wikipedia,
Roguelike (or rogue-like) is a subgenre of role-playing video games characterized by a dungeon crawl through procedurally generated levels [...].
Compared to deterministic environments for reinforcement learning (RL), procedurally generated environments are interesting as agents are forced to generalise and are not able to memorise sequences of actions that lead to previously visited states (see e.g. Go-Explore on Montezuma's Revenge).
NetHack 🧝♂️ A recent environment in this line of work is the NetHack Learning Environment by Facebook researchers, a learning environment for the roguelike NetHack. The goal of NetHack is to ascend, i.e. to attain the status of a demigod by offering the Amulet of Yendor to a deity. That sounds like your typical RPG but is arguably a more compelling objective than, say, hitting bricks with a ball. NetHack has complex RPG elements and dynamics as well as actions and effects displayed in natural language. These components make it an interesting learning environment for anyone working on RL + NLP.
Slay the Spire 🧘♂️ Another roguelike that has been very popular is Slay the Spire, which blends roguelike progression with a deck-building card game. Slay the Spire would make for a similarly challenging learning environment as humans have achieved high performance but optimal play is still very challenging. If you want to dive deeper into this game, the community has recently released a 27 GB dataset of 77 million runs, which can be found here and could be used for training models via imitation learning.
Procedurally generated environments are difficult to master due to the number of possible combinations. A model will likely never encounter the exact same situation twice. Slay the Spire has 1.84∗10^19 possible seeds, which, however, still pales compared to the 10^300 possible conformations for a typical protein (as estimated by Levinthal). One of the few domains that may be more challenging arguably is natural language, which achieves—as is well known—"infinite use of finite means".
Miscellaneous blog posts 📑
ML & DL Compendium 2017-2020 📚Ori Cohen has been curating notes on a myriad of topics related to machine learning over the last three years. You thought David Abel's conference notes were extensive? Check out this ~385 page document about all things ML. If you don't know where to start on any ML-related topic (see the partial table of contents above), it's a great jumping off point. It's also great for skimming: go through it until something catches your eye. Through a brief skim, I've found many resources on semi-supervised learning, active learning, calibration, etc I wasn't aware of. The notes on probability and information theory are also good for revision before ML interviews.
Learn Natural Language Processing the practical way 👩🎓Vered Zimmerman reached out to me after this newsletter. She pointed out that many of the resources that tend to get shared to help people get started with NLP are of little help as each item such as the popular Stanford NLP course takes weeks or months of dedicated work. As most of the value that NLP brings in practice also requires domain expertise, Vered charts a different, more practically-oriented course. It initially teaches you how to scrape text and experiment with it using spaCy. Through shallow and subsequently deeper reading, you learn about different concepts, tasks, and libraries. The learning culminates in your first small project. That, however, is only the start of your NLP journey..
Experiments with the ICML 2020 Peer-Review Process ✍️ICML 2020 conducted a number of studies and analyses of its reviewing process. Specifically, they investigated the following topics:
Resubmission bias. Q: Do reviewers get biased when they know that the paper they are reviewing was previously rejected from a similar venue? Findings: Yes. They give almost one point lower score.
Herding Effects in Discussions. Q: Does the final decision of the paper depend on the order in which reviewers join the discussion? Findings: No.
Novice Reviewer Recruiting. Q: Can junior researchers be recruited as reviewers without compromising the quality of the process? Findings: Yes. They often write even higher quality reviews.
You Can’t Escape Hyperparameters and Latent Variables: Machine Learning as a Software Engineering Enterprise 🧑🎄 If you watch one thing from NeurIPS 2020, then make it this keynote. Charles Isbell and Michael Littman take you on A Christmas Carol trip that sheds light on the risks of ML algorithms—past, present, and future. The talk features guest appearances of many ML researchers that discuss problems related to bias in algorithms and how to mitigate them.
Seven recommendations for machine translation evaluation 🗣 Matthias Müller makes seven recommendations for MT evaluation in a series of blog posts. Many such as those based on evaluating on low-resource datasets are also relevant for other domains:
When reporting BLEU scores, compute BLEU with translations that are fully postprocessed. Leave tokenization to the standard BLEU tool.
Train several models of the same kind with different random seeds. Report and discuss the variance.
Work on realistic low-resource data sets instead of simulated ones.
Use the most recent training and test sets available for a particular language pair and domain.
Don’t rely too much on statistical significance testing for automatic metrics. Focus on gains in automatic translation quality beyond standard deviation, or ideally human evaluation.
When designing your human evaluation, follow current best practices, for instance the ones laid out in Läubli et al. (2020).
Do not succumb to the urge of ignoring best practices just for the sake of comparing to previous work.