All times are Pacific (PDT).

  • 8:00-8:30 Virtual Breakfast. Informal groups

  • 8:30-8:45 Intro Talk

  • 8:45-10:15 Invited Talks #1: Scalable, Multimodal NLP (I)

    • 8:45-9:15 NLP Intelligence for Microsoft Outlook and Teams​ (Mei-Yuh Hwang)

I will be talking about what work we are doing to improve productivity by helping users complete tasks easier and faster on Outlook and Teams. This includes smart email search, people search, files search, smart reply and smart compose, among others. Internationalization is achieved via transfer learning on language-agnostic representations such as mBERT, XLM, and InfoXLM, with machine translated training data. More and more delightful experience will be added into Microsoft Office products, including sharepoint. Feedback is highly appreciated.

    • 9:15-9:45 From Disembodied to Embodied Multimodal Learning (Dhruv Batra)

I will describe recent and ongoing research in embodied language understanding. Consider a navigation instruction such as 'Walk down the stairs and stop at the brown sofa'. Following this instruction requires embodied AI agents (robots and VR assistants) to ground scene elements referenced via language (e.g. 'stairs') to visual content in the environment (pixels corresponding to 'stairs'). We ask the following question -- can we leverage abundant 'disembodied' web-scraped vision-and-language corpora (e.g. Conceptual Captions) to learn visual groundings (what do 'stairs' look like?) that improve performance on a relatively data-starved embodied perception task (Vision-and-Language Navigation)? Specifically, we develop VLN-BERT, a visiolinguistic transformer-based model for scoring the compatibility between an instruction ('...stop at the brown sofa') and a sequence of panoramic RGB images captured by the agent. I will also discuss new capabilities in Habitat, a scalable photo-realistic simulator from FAIR, to support grounded language experiments.

    • 9:45-10:15 NLP-enabled Design Assistance for Visual Communication (Thamar Solorio)

Content authoring and design refers to the interdisciplinary research space that includes Graphic Design and Artificial Intelligence fields such as NLP, and machine learning. This area addresses open problems in leveraging AI-empowered models to assist users during creation by modelling the author/audience needs so that the outcome is aesthetically appealing and effectively communicates its intent. Research related to this space is emerging, and we have seen significant improvements in various platforms for generating, formatting, and editing digital text. During this talk, I will present recent efforts in this space. In particular, I will discuss our recent results on predicting word emphasis in short texts. For textual content, word emphasis is used as a powerful tool to better convey the desired meaning of the written text to the audience. In addition, whether on flyers, posters, ads, social media posts or motivational messages, emphasis is usually designed to grab the viewer's attention by being distinct from the rest of the design elements. Due to the subjective nature of the task, where multiple appropriate solutions may exist, we formulate this task as label distribution learning. I will discuss the advantages and disadvantages of this formulation and will conclude with a brief overview of our ongoing work in a related task of emphasis prediction in presentation slides. My goal is to motivate more research in this exciting and relatively unexplored problem space.

  • 10:15-10:30 Break. Informal groups

  • 10:30-12:00 Invited Talks #2: Privacy, Safety, Evaluation, Explainability

    • 10:30-11:00 Dialogue Modeling Via Hash Functions: Applications to Psychotherapy (Irina Rish)

We propose a novel dialogue modeling framework using kernelized hashcodes as compressed text representations; unlike traditional deep learning models, this framework works well on relatively small datasets, while also scaling to large ones. We also derive a novel lower bound on mutual information, used as a model-selection criterion favoring representations with better alignment between the utterances of participants in a collaborative dialogue setting, as well as higher predictability of the generated responses. As demonstrated on several real-life datasets, including psychotherapy sessions, the proposed approach significantly outperforms several state-of-art neural network based dialogue systems, both in terms of computational efficiency (raining time is reduced from days or weeks to hours) as well as response quality, achieving an order of magnitude improvement over competitors in frequency of being chosen as the best model by human evaluators.

    • 11:00-11:30 Why Do Good People Design Bad Chatbots? How Can We Help? (Tom Yeh)

One proudest accomplishment of the NLP research community is the democratization of tools, algorithms and datasets to allow anyone to easily create interactive natural language applications such as chatbots. By making it easy for anyone to create a chatbot, however, we inadvertently also made it easy for anyone to create a "bad" chatbot. Never were so many chatbots made by so many people with good intentions but with poor results. Why? In this talk, I argue that the reason is, as we are inventing and giving tools to people to create, we forgot to also give them tools to test and measure the quality of their creations. If people can't measure it, they can't improve it. To support my reasoning, I will share three insights. First, about the status quo, we do already have tools for testing chatbots for important properties such as relevance, safety, privacy, and empathy. But only experts know how to use those testing tools. Second, about our path forward, as a research community, we need to greatly improve the usability of our testing tools using human-centered research methodologies to allow anyone to test their chatbots easily. As an example, I will present our prototype of such testing tool that can automatically provide feedback on a chatbot's quality and offer recommendations to improve it. Third, about the rewards, once good testing tools become available, we can expect to see improvement in the quality of the chatbots created by people, now that people have a way to measure it. To test this hypothesis, we ran a study comparing 10 chatbots designed with and without our testing tool. Our findings suggest that having a testing tool indeed helped people improve the quality of their chatbots. To end, I will raise two questions for the audience to consider. What tools are you already using to test your systems? What can you do to share with and make it easy for others to benefit from those tools?

    • 11:30-12:00 Evaluating and Testing Natural Language Processing Models (Sameer Singh)

Current evaluation of the generalization of natural language processing (NLP) systems, and much of machine learning, primarily consists of measuring the accuracy on held-out instances of the dataset. Since the held-out instances are often gathered using similar annotation process as the training data, they include the same biases that act as shortcuts for machine learning models, allowing them to achieve accurate results without requiring actual natural language understanding. Thus held-out accuracy is often a poor proxy for measuring generalization, and further, aggregate metrics have little to say about where the problem may lie. In this talk, I will introduce a number of approaches we are investigating to perform a more thorough evaluation of NLP systems. I will first provide a quick overview of automated techniques for perturbing instances in the dataset that identify loopholes and shortcuts in NLP models, including semantic adversaries and universal triggers. I will then describe recent work on creating comprehensive and thorough tests and evaluation benchmarks for NLP using CheckList, that aim to directly evaluate comprehension and understanding capabilities. The talk will cover a number of NLP tasks, including sentiment analysis, textual entailment, paraphrase detection, and question answering.

  • 12:00-2:00 Lunch/Poster/Demo Breakouts

  • 2:00-2:30 Lightning Talks*

    • Social Bias Frames: Reasoning about Social and Power Implications of Language

    • They, Them, Theirs: Rewriting with Gender-neutral English

    • Simulated Multiple Reference Training Improves Low-Resource Machine Translation

(* The above papers are nominated for the best paper, which will be announced during the event)
  • 2:30-4:00 Invited Talks #3: Scalable, Multimodal NLP (II)

    • 2:30-3:00 Building Embodied Conversational Agents (Asli Celikyilmaz)

Language understanding and generation are particularly challenging for multi-modal tasks such as visual-language navigation, which poses several challenges in building AI agents that can understand natural language instructions to navigate in real human environments to reach a goal (e.g., find an object). Success in these task requires building multimodal language groundings that allow the agent to successfully navigate while reasoning about vision-language dynamics. We train our agents to execute our commands but not necessarily teach the agents how to react when uncertainties in the environment arise. In this talk, I will present our recent work in which we go beyond instruction following and teach the navigating agent to learn to communicate in natural language to get help from another agent (oracle) to reach the goal more efficiently. I will present our new thinking into a more general problem for understanding how a system asks for and receives assistance with the goal of exploring techniques to transfer and generalize for vision-language navigation research field.

    • 3:00-3:30 Efficient transformers for natural language processing (Hanna Hajishirzi)

In this talk, I present our recent work in introducing a deep and light-weight transformer, DeLighT, that delivers similar or better performance than standard transformer-based models with significantly fewer parameters. Our network more efficiently allocates parameters both (1) within each Transformer block using a deep and light-weight transformation and (2) across blocks using block-wise scaling, that allows for shallower and narrower DeLighT blocks near the input and wider and deeper DeLighT blocks near the output. Overall, DeLighT networks are 2.5 to 4 times deeper than standard transformer models and yet have fewer parameters and operations. Experiments on benchmark machine translation and language modeling tasks show that \arch~matches or improves the performance of baseline Transformers with 2 to 3 times fewer parameters on average.

    • 3:30-4:00 Axes of scalability when training conversational agents (Y-Lan Boureau)

Training conversational agents requires thinking of scalability along multiple axes. Much work has shown that the number of parameters and training examples are important factors in the performance of a model. But scalability can also mean, crafting training procedures that can adapt to changing environments without requiring re-training from scratch, devising ways to obtain streams of additional training data in a continual fashion, designing efficient evaluation techniques, and conducting research in a way that makes it easier for teams to build upon the work of others.
I will show how these axes of scalability have guided our research towards building open-domain conversational agents, and go over recent results and current challenges.

  • 4:00-4:15 Break. Informal groups

  • 4:15-5:45 Panel: NLP research post COVID-19

    • Jason Williams (moderator), Ruhi Sarikaya, Svitlana Volkova, Alex Rudnicky, Mona Diab

  • 5:45-6:00 Closing Remarks

  • 6:00-7:00 (or later): Virtual Happy Hour