QA—how did we get here? 🙋‍♀️, Adapting to time ⏳, Data detectives 🕵️‍♀️

Mar 23, 2021

Hi all,

A struggle for me these days is the lack of human connection and interaction. I think this is especially challenging for students without an established network. The last newsletter highlighted some resources on how to connect with others during these isolating times. In the same spirit, here are some upcoming virtual events that allow you to connect with like-minded people—the first two mainly focusing on students:

6–9 May: Stuts + Tacos (Joint Student Conference on Linguistics and Computational Linguistics). Registration deadline: April 19.
26 July—13 August: ESSLLI (European Summer School in Logic, Language, and Information). Deadline for submission to the student session: April 4.
7—8 October: TTO (Conference for Truth and Trust Online). Paper submission deadline: 30 July. Talk proposal deadline: 13 August.

Are there any other accessible events that you would like me to share?

This newsletter covers three high-level topics: a brief history of question answering (QA) including lessons we can learn from QA pre-Deep Learning; how we can design models that can adapt to new words; and some excellent data detective work on multilingual corpora.

I really appreciate your feedback, so let me know what you love ❤️ and hate 💔 about this edition. Simply hit reply on the issue.

If you were referred by a friend, click here to subscribe. If you enjoyed this issue, give it a tweet 🐦.

Question answering—how did we get here? 🙋‍♀️

A journey of shooting for the moon—and answering questions about moons and moonshots in the process.

Writing this section, I realised that there is already a lot of high-quality information on the topic available. My favourite resource: The Open-Domain QA Tutorial at ACL 2020. I hope I can nevertheless provide some new insights in a (relatively) condensed format.

Question answering (QA) is one of the most universally useful applications of NLP and omnipresent in services such as Google Search and virtual assistants. It is also perhaps one of the tasks that is closest to artificial general intelligence. As such, QA has influenced many representations of AI in popular culture such as Douglas Adams' Deep Thought, which not only gave us the number 42 but arguably inspired many "Deep" successors. As NLP capabilities have grown more powerful in recent years, question answering has also grown in importance. Among individual NLP tasks, it is only rivalled by machine translation in terms of the number of conference submissions (see the EMNLP 2020 opening remarks).

The first time I was amazed by the QA capabilities of AI was when IBM Watson defeated its human opponents in the popular quiz show Jeopardy! in convincing fashion. Since then, QA has gone from strength to strength 💪. So let's explore together how QA developed, from meticulously hand-crafted parsers to the giant black-box models of today. You can see a rough overview of this trajectory in the figure below.

Overview of the history of question answering

Similar to information extraction, the unrestricted open-domain setting proved too challenging in the early days of QA research. Consequently, most early systems focused on closed, well-defined domains. Baseball (Green et al., 1963) was an early system that could answer questions about, well, baseball. Later, LUNAR (Woods et al., 1972; Woods, 1978), developed in collaboration with NASA, answered questions about moon rocks and soil gathered by the Apollo 11 mission. Both systems relied on parsing a question into a query language that was then executed on a database—similar to today's semantic parsing methods. While Baseball used a simple dictionary for analysis, LUNAR's parser was hand-engineered and expensive to build. In 1978, Wendy Lehnert outlined a theory of question answering (I couldn't find a free version of the book but you can read reviews for the high-level ideas) that detailed core components of a QA system: 1) different knowledge sources; 2) algorithms that generate inferences based on the text and map the question to a question type; 3) a parser; and 4) a taxonomy of question types. The systems at the time, however, fell short of executing on this grand vision and failed to combine these components in a general setting.

Another notable example was WOLFIE (WOrd Learning From Interpreted Examples; Thompson and Mooney, 1998), which used a heuristic algorithm to learn a semantic lexicon for geographical database queries from example input–output pairs—a step closer to the learning algorithms of today. In 1999, TREC, the competition series that revitalised information extraction introduced a question answering task. TREC-8 brought question answering from the area of semantic parsing closer to information retrieval (IR): Given a document collection of news articles and a question, a system must identify the document where the answer occurs and the corresponding answer string. The best systems were able to answer about 2/3 of all questions. This may seem striking but questions were mostly short and fact-based. Systems generally followed a multi-step process: They 1) identified the type of the answer based on the question ("who", "when", "where", etc); 2) used IR to retrieve relevant documents based on question similarity; 3) performed a shallow parse of the documents; and 4) detected entities of the corresponding type in the context; if no entity was found, they fell back on heuristics. This general combination of IR + document processing still is at the heart of open-domain QA systems today (Chen et al., 2017).

In the early 2000s, Mulder (Kwok et al., 2001) and AskMSR (Brill et al., 2002) broke with this tested formula; instead of using sophisticated linguistic analyses, they made multiple queries to a search engine, trusting that the redundancy of information on the web would allow them to find the correct answer. Among the returned snippets, n-grams of the relevant answer type are filtered, grouped, and scored to produce the final answer. On the whole, such systems are conceptually closer to current systems that perform question rewriting (Elgohary et al., 2019) or that reformulate a question to elicit the best possible answers (Buck et al., 2018).

The DeepQA architecture of IBM Watson (Ferrucci et al., 2012)

In 2011, IBM Watson introduced a much more elaborate QA pipeline, which can be seen above. Importantly, after an initial set of candidate answers was generated, various algorithms and sources were used to score each candidate answer. While evidence aggregation has been used in some recent systems (Wang et al., 2018), current state-of-the-art methods generally do not rerank hypotheses based on additional evidence.

Starting in 2013, researchers proposed to frame QA as a supervised learning problem in a reading comprehension setting: Given a document, the system was tasked to answer multiple questions about the document. Notable datasets in this format are MCTest (Richardson et al., 2013) consisting of a set of fictional stories and multiple choice questions; the CNN / Daily Mail datasets (Hermann et al., 2015) automatically constructed from news articles and summaries; and the Stanford Question Answering Dataset (SQuAD; Rajpurkar et al., 2016) based on Wikipedia. The latter two established span extraction systems that use a neural network to identify a span in the document that answers the question as the dominant methodology in this setting.

Later analyses revealed that systems trained on CNN / Daily Mail and SQuAD learned to exploit simple heuristics based on how the questions are formulated (Chen et al., 2016; Weissenborn et al., 2017; Jia & Liang, 2017)—not unlike the answer type matching heuristics of their TREC predecessors.

Later datasets such as MS MARCO (Bajaj et al., 2016) and Natural questions (Kwiatkowski et al., 2019) rely on real anonymised queries from Bing and Google search and asked annotators to find answers in retrieved documents. As such information-seeking questions are asked without reference to a target document, biases that regard an overlap between the question and the answer paragraph can be avoided. While such annotation is more laborious (many questions are not answered by the retrieved documents), the generated examples reflect a more realistic open-domain setting that goes beyond the TREC QA setting of prior work. As real search queries are hard to come by, subsequent datasets such as TyDi QA (Clark et al., 2020) instead ask annotators to generate open-domain questions based on a prompt about a specific topic.

Other important recent directions are a) to make QA conversational where questions also depend on the previous context (Choi et al., 2018; Reddy et al., 2019); b) to combine evidence from multiple documents (Yang et al., 2018; Welbl et al., 2018); c) multilingual QA (Clark et al., 2020; Artetxe et al., 2020; Lewis et al., 2020); and d) generating longer answers (Krishna et al., 2021). Methodologically, the most recent advance is to combine retrieval with learned models end-to-end and to generate answers rather than just extract them.

Overall, while QA research has come a long way, current models still struggle with the hardest settings: Answering adversarial questions (Bartolo et al., 2020) in reading comprehension as well as information-seeking questions in the open-domain QA setting (SOTA on identifying the answer span in Natural Questions is only 0.64 F1, a far cry from the SOTA of 0.93 F1 on SQuAD).

I think we can take inspiration from the history of QA to design models that a) deal with different answer types more explicitly; b) that use multiple pieces of evidence to reweigh hypotheses; and c) leverage various complementary sources of knowledge.

Adapting to time ⏳

The world is continually changing. So should our models if we expect them to be useful in real-world applications.

Language provides a window into our culture. One particularly compelling way to view culture is through the words that are unique to a language and difficult to translate. These are words like the Danish hygge, a mood of coziness and comfort 🤗, the Portuguese saudade, a sense of nostalgic longing 😔, or the German Schadenfreude, taking pleasure in someone else's misfortune 😬—I contest that the latter is uniquely or even predominantly German (edit: a Japanese slang word for Schadenfreude is meshiuma (メシウマ); thanks for the suggestion!).

Words tell a story of the cultural and social context that gave rise to them. When our existing vocabulary cannot adequately represent a changing world, we create new words. Most recently, the pandemic led to many new and colourful creations, such as Covidiot, a person who deliberately flouts government restrictions. The Leibniz Institute for the German Language has catalogued more than 1,200 new German words that emerged during the pandemic, including many relatable ones such as overzoomed (stressed by too many Zoom calls) and Impfneid (envy of those that have been vaccinated). Are there are any new words that emerged in the pandemic in your language? You can share them in a reply to the tweet below.

Sebastian Ruder @seb_ruder

The pandemic led to the creation of many new words such as Covidiot (English: a person who deliberately flouts government restrictions) or overzoomed (German: stressed by too many Zoom calls). Are you aware of any pandemic-related new words in your language?

Living in the pandemic, we can easily grasp the meaning of these words. But can we expect the same from our language models that have been trained on outdated data and may never have seen a mention of COVID-19?

In a recent paper, Hu et al. (2020) investigate this problem in the context of language modelling on Twitter data. They release datasets that cover 100M tweets of 1M users over six years. They found it particularly useful to adaptively control the number of gradient steps based on online validation data. Like this, the model can be updated more often it if fails to generalise.

Overall, in NLP and ML more broadly we are far too used to a static evaluation paradigm that assumes that nothing about our data changes. Evaluating on this particular type of out-of-distribution data is crucial to enable us to deploy and interact with our systems in the real world.

Data detectives 🕵️‍♀️

There is no better data than better data.

—Based on tweets from Rada Mihalcea and Marco Guerini

Data is king in machine learning. The unreasonable effectiveness of data has been heralded in the past. More recently, researchers realised the bitter lesson that scale of data is what ultimately matters most. However, what often goes unsaid is the importance of good data.

Garbage in, garbage out is a familiar concept in computer science that indicates that flawed input data leads to flawed outputs. In current ML and NLP, the equivalent concept may be bias in, bias out. There is useful bias such as one that can be encoded via data augmentation. However, in general bias in the input data leads not only to biased model predictions but may be amplified.

In NLP, we are painfully aware of this issue. There is a long line of recent papers that have analysed biases in recent models (Blodgett et al., 2020; Shah et al., 2020). The first step to analyse the biases in our models is to look at the input data. Some of my favourite papers take such a data-first approach. For instance, Chen et al. (2016) thoroughly examine the CNN / Daily Mail datasets. Their findings: The datasets are easier than previously thought and can be bested with a simple feature-based model. Relative to the outsize importance of data, such detective work, however, is rare in practice.

As our datasets grow larger, sleuthing becomes even more arduous. Gone are the days where you can hope to inspect every training example. How do you even make sense of 750 GB of text (the amount of data used for training T5)? Recent analyses of pre-training data increasingly rely on automatic classifiers e.g. of toxicity (Gehman et al., 2020). The identification of biases where such classifiers are unavailable or do not perform well still requires human eyes. If such analysis is challenging in English, consider how challenging it is to analyse data in 100s of languages.

Caswell et al. (2021) recently embark on such an ambitious endeavour. By assembling an exceptionally diverse team of 46 volunteers speaking 41 languages, they perform a manual audit of 231 language-specific subsets of large corpora that have been used to train multilingual models, including the multilingual version of C4 (Xue et al., 2020). Annotating 100 lines in each subset based on whether a sentence is an incorrect translation, in a wrong language, or non-linguistic content, they arrive at many startling observations:

In the automatically aligned WikiMatrix (Schwenk et al., 2019) two-thirds of audited samples were misaligned on average.
CCAligned (El-Kishky et al., 2020), OSCAR, and WikiMatrix suffer from severe quality issues.
12% of languages apparently covered by JW-300 (Agić & Vulić, 2019) are supposedly sign languages but instead just high-resource languages that are incorrectly labelled.

While some of these problematic samples can be avoided using filtering based on length-ratio, LangID, or TF-IDF wordlists (Caswell et al., 2021), there is no easy fix. They recommend to document such issues instead and not to release datasets with low percentages of in-language content.

On the monolingual side, the C4 dataset used for training T5 (Raffel et al., 2020) was recently made a lot easier to download. In summary, I hope the above studies as well as the availability of this data will inspire new analyses that help broaden our understanding of what goes into our murky piles of linear algebra.

NLP News

QA—how did we get here? 🙋‍♀️, Adapting to time ⏳, Data detectives 🕵️‍♀️

Question answering—how did we get here? 🙋‍♀️

Adapting to time ⏳

Data detectives 🕵️‍♀️