Mining and Enriching Multilingual Scientific Text Collections


Horacio Saggion

Large Scale Text Understanding Systems Lab
Natural Language Processing Group
Department of Information and Communication Technologies
Universitat Pompeu Fabra


Scientists worldwide are confronted with an exponential growth in the number of
scientific documents being made available, for example: Elsevier publishes over
250K scientific articles per year (or one every two minutes) and has over 7
million publications; MedLine, the most important source in biomedical research,
contains 21 million scientific references, and the World Intellectual Patent
Organization (WIPO) contains some 70 million records. All this unprecedented
volume of information complicates the task of researchers who are faced with the
pressure of keeping up-to-date with discoveries in their own disciplines and
with the challenge of searching for innovation, new interesting problems to
solve, checking already solved problems or hypothesis, or getting information on
past and current available methods, solutions or techniques. At the same time
and with the rise of open science initiatives and social media, research is more
connected and open creating new opportunities but also challenges for the
scientific community.
In this scenario of scientific information overload, natural language processing
has a key role to play. Over the past few years we have seen a number of tools
for the analysis of the structure of scientific documents (e.g. transforming PDF
to XML), methods for extracting keywords, or classifying sentences into
argumentative categories being developed. However, deep analysis of scientific
documents such as: finding key claims, assessing the argumentative quality and
strength of the research, or summarizing the key contributions of a piece of
work are less common. Besides, most research in scientific text processing is
being carried out for the English language, neglecting both the share of
scientific information available in other languages and the fact that scientific
publications are many times bilingual.
In this talk, I will present work carried out in our laboratory towards the
development of a system for “deep” analysis and annotation of scientific text
collection. Originally for the English language, it has now being adapted to
Spanish. After a brief overview of the system and its main components, I will
present the development of a bi-lingual (Spanish and English) fully annotated
text resource in the field of natural language processing that we have created
with our system together with a faceted-search and visualization system to
explore the created resource.
With this scenario in mind I will speculate on the challenges and opportunities
that the scientific field brings to our community not only in terms of language
but also from the point of view of social media and science education.

Speaker short bio:

Horacio Saggion is an Associate Professor at the Department of Information and
Communication Technologies, Universitat Pompeu Fabra (UPF), Barcelona. He is the
head of the Large Scale Text Understanding Systems Lab, associated to the
Natural Language Processing group (TALN) where he works on automatic text
summarization, text simplification,  information extraction,  sentiment analysis
and related topics.  Horacio obtained his PhD in Computer Science from
Universite de Montreal, Canada in 2000. He obtained his BSc  in Computer Science
from Universidad de Buenos Aires in Argentina, and his MSc in Computer Science  
from UNICAMP in Brazil.  He was the Principal Investigator for UPF  in the EU
projects Dr Inventor and Able-to-Include and is currently principal investigator
of  the national project TUNER and the Maria de Maeztu project Mining the
Knowledge of Scientific Publications.  Horacio has published over 150 works in
leading scientific journals, conferences, and books in the field of human
language technology.  He organized four international workshops in the areas of
text summarization and information extraction and was scientific Co-chair of
STIL 2009 and scientific Chair of SEPLN 2014. He is a regular programme
committee member for international conferences such as ACL, EACL, COLING, EMNLP,
IJCNLP,  IJCAI and is an active reviewer for international journals in computer
science, information processing,  and human language technology.  Horacio has
given courses, tutorials, and invited talks at a number of international events