Text2Story 2022
Fifth International Workshop on Narrative Extraction from Texts
held in conjunction with the 44th European Conference on Information Retrieval
Fifth International Workshop on Narrative Extraction from Texts
held in conjunction with the 44th European Conference on Information Retrieval
Although information retrieval and natural language processing have made significant progress towards an automatic interpretation of texts, the problem of constructing consistent narrative structures is yet to be solved. In the fifth edition of the Text2Story workshop, we aim to foster the discussion of recent advances in the link between Information Retrieval (IR) and formal narrative understanding and representation of texts. Specifically, we aim to provide a common forum to consolidate the multi-disciplinary efforts and foster discussions to identify the wide-ranging issues related to the narrative extraction task.
We challenge the interested researchers to consider submitting a paper that makes use of the tls-covid19 dataset - published at ECIR'21 - under the scope and purposes of the text2story workshop. tls-covid19 consists of a number of curated topics related to the Covid-19 outbreak, with associated news articles from Portuguese and English news outlets and their respective reference timelines as gold-standard. While it was designed to support timeline summarization research tasks it can also be used for other tasks including the study of news coverage about the COVID-19 pandemic.
We invite five kinds of submissions:
Papers must be submitted electronically in PDF format through Easy Chair . All submissions must be in English and formatted according to the one-column CEUR-ART style with no page numbers. Templates, either in Word or LaTeX, can be found in the following zip folder . There is also an Overleaf page for LaTeX users.
Submissions will be peer-reviewed by at least two members of the programme committee. The accepted papers will appear in the proceedings published at CEUR workshop proceedings (usually indexed on DBLP).
Abstract: A substantial majority of Americans share the belief that the political discourse in the U.S. has recently become more negative, and more than half of them blame this change on Donald Trump. However, as is often the case in politics, talk is cheap and hard data is difficult to come by. To provide quantitative answers (and distribute blame deservedly) we consider the large-scale extraction and attribution of quotes by politicians for the analysis of political discourse. In the first part of this talk, I introduce Quobert, a transformer-based model that exploits the parallelism in news reporting for the extraction and attribution of quotes from news. Using Quotebank, a comprehensive corpus of 235 million unique quotations that we extracted with Quobert from a decade of news, I then demonstrate how this data can be used to quantify trends in the use of political language. In particular, I will focus on the uptick in negativity in U.S. politicians' language after the end of Obama's tenure, quantify the shifts in language tone, and unravel to whom these shifts could feasibly be attributed.
Bio: Andreas Spitz is an assistant professor and head of the Data and Information Mining lab at the University of Konstanz. He holds a PhD in computer science from Heidelberg University and visited the EPFL Data Science lab as a postdoctoral researcher. Andreas' research interests lie at the intersection of information retrieval, natural language processing, computational social science, and complex network analysis. He is particularly interested in graph representations of natural language and how they can be used to efficiently query, visualize, and explore large corpora.
Abstract: Many documents can only be accessed through digitization. This is notably the case of historical and handwritten documents, but also that of many digitally-born documents, turned into images for various reasons (e.g., a file conversion or the intermediary use of an analog form in order to manually sign a document, fill out a form, send by post, etc.). Being able to analyze the textual content of such digitized documents requires a phase of conversion from the captured image to a textual representation, key parts of which are optical character recognition (OCR) and layout analysis. The resulting text and structure are often imperfect, to an extent which is notably correlated with the quality of the initial medium (which may be stained, folded, aged, etc.) and with the quality of the image taken from it. In this talk, I will present recent advances in AI and natural language understanding that enable this type of corpus to be analyzed in a way that is robust to digitization. For example, I will show how we were able, in the H2020 NewsEye project to create state-of-the-art results for the cross-lingual recognition and disambiguation of named entities (names of people, places, and organizations) in large corpora of historical newspapers written in 4 languages, written between 1850 and 1950. This type of result paves the way to a large-scale analysis of digitized documents, notably able to cross linguistic borders.
Bio: Antoine Doucet is a tenured full Professor in computer science at the L3i laboratory of the University of La Rochelle since 2014. Leader of research group in document analysis, digital contents and images in La Rochelle Université (about 50 people), he also directs the ICT department of the Vietnamese-French University of Science and Technology of Hanoi (USTH). He was until January 2022 the coordinator of the H2020 project NewsEye, focusing on augmenting access to historical newspapers, across domains and languages. He further leads the effort on semantic enrichment for low-resourced languages in the context of the H2020 project Embeddia. His main research interests lie in the fields of information retrieval, natural language processing, (text) data mining and artificial intelligence. The central focus of his work is on the development of methods that scale to very large document collections and that use as few external resources as possible, in order to be particularly applicable to documents of any type written in any language, from news articles to social networks, and from digitized manuscripts to digitally-born documents. Antoine Doucet holds a PhD in computer science from the University in Helsinki (Finland) since 2005, and a French research supervision habilitation (HDR) since 2012.
Displaying agenda in event timezone (Norway local time).
|
||
09h30 – 09h40 | Introduction
(Ricardo Campos) in-person | slides |
|
|
||
Session Chair: Ricardo Campos | ||
09h40 – 10h20 | Keynote 1: Robust and multilingual
analysis of historical documents (Antoine Doucet, University of La Rochelle) in-person | slides |
|
10h20 – 10h40 | Time for some German? Pre-Training a
Transformer-based Temporal Tagger for German (Satya Almasian, Dennis Aumiller and Michael Gertz) in-person | video | slides |
|
10h40 – 11h00 | Understanding COVID-19 News Coverage using
Medical NLP (Ali Emre, Veysel Kocaman, Hasham Ul Haq and David Talby) in-person | slides |
|
|
||
11h00 – 11h30 | Coffee Break |
|
|
||
Session Chair: Sumit Bhatia | ||
11h30 – 11h50 | Changing the Narrative Perspective: From
Ranking to Prompt-Based Generation of Entity Mentions (Mike Chen and Razvan Bunescu) online | slides |
|
11h50 – 12h10 | EnDSUM: Entropy and Diversity based
Disaster Tweet Summarization (Piyush Kumar Garg, Roshni Chakraborty and Sourav Kumar Dandapat) online | video | slides |
|
Session Chair: Marina Litvak | ||
12h10 - 12h30 | Simplifying News Clustering Through
Projection From a Shared Multilingual Space (João Santos, Afonso Mendes and Sebastião Miranda) in-person | video | slides |
|
12h30 - 12h50 | Exploring Data Augmentation for
Classification of Climate Change Denial: Preliminary Study (Jakub Piskorski, Nikolaos Nikolaidis, Nicolas Stefanovitch, Bonka Kotseva, Irene Vianini, Sopho Kharazi and Jens Linge) online | slides |
|
12h50 - 13h10 | Dynamic change detection in topics based
on rolling LDAs (Jonas Rieger, Kai-Robin Lange, Jonathan Flossdorf and Carsten Jentsch) in-person | video | slides |
|
|
||
13h10 – 14h00 | Lunch Break |
|
|
||
Session Chair: Adam Jatowt | ||
14h00 – 14h50 | Keynote 2: We Have the Best Words:
From the Web-scale Extraction and Attribution of Quotes to Analyzing Negativity in U.S.
Political Language (Andreas Spitz, University of Konstanz) in-person | slides |
|
14h50 - 15h10 | Text2Icons: representing narratives with
icon strips (Joana Valente, Alípio Jorge and Sérgio Nunes) in-person | slides |
|
15h10 - 15h30 | Comprehensive contextual visualization of
a news archive (Ishrat Sami, Tony Russell-Rose and Larisa Soldatova) online | video | slides |
|
|
||
15h30 – 16h00 | Coffee Break |
|
|
||
Session Chair: Alípio Jorge | ||
16h00 - 16h20 | Causality Mining in Fiction (Margaret Meehan, Andrew Piper and Dane Malenfant) online | video | slides |
|
16h20 - 16h40 | Extracting Impact Model Narratives from
Social Services’ Text (Bart Gajderowicz and Mark Fox) online | slides |
|
Session Chair: Vasco Campos | ||
16h40 - 17h00 | MARCUS: An Event-Centric NLP Pipeline that
generates Character Arcs from Narratives (Sriharsh Bhyravajjula, Ujwal Narayan and Manish Shrivastava) in-person | video | slides |
|
|
||
17h00 – 17h30 | Best Paper Award and Reviewers
Award (Ricardo Campos, Alípio Jorge, Adam Jatowt, Marina Litvak) in-person |
|
|
Text2Story 2022 will be held at the 44th European Conference on Information Retrieval (ECIR'22) in Stavanger, Norway
Registration at ECIR 2022 is required to attend the workshop (don't forget to select the Text2Story workshop).
This project is financed by the ERDF – European Regional Development Fund through the North Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 and by National Funds through the Portuguese funding agency, FCT - Fundação para a Ciência e a Tecnologia within project PTDC/CCI-COM/31857/2017 (NORTE-01-0145-FEDER-03185)