Categories: corpora
Test data
ATTENTION: Before you can download please contact ggianna AT iit DOT demokritos DOT gr to get you team's username and password.
Test data downloadable from this link.
We note that 3 of the test data topics (M001,M002,M003) will not be taken into account during evaluation, since they have been used as training data (see below).
Training data - ** NEW **
Training data downloadable from this link, using the participant username and password provided via e-mail.
Task overview
This MultiLing task aims to evaluate the application of (partially or fully) language-independent summarization algorithms on a variety of languages. Each system participating in the task will be called to provide summaries for a range of different languages, based on corresponding corpora. In the MultiLing Pilot of 2011 the languages used were 7, while in the MultiLing 2015 8 languages will be used. Participating systems will be required to apply their methods on a minimum of two languages. Evaluation will favour systems that apply their methods in more languages.
The MultiLing task requires to generate a single, fluent, representative summary from a set of documents describing an event sequence. The language of the document set will be within a given range of languages and all documents in a set share the same language. The output summary should be of the same language as its source documents. The output summary should be 250 words (for non-Chinese languages) or 750 bytes (for Chinese language, in UTF-8 encoding) at most.
Sample input and output
The input and output data samples are based on the MultiLing 2013 equivalent task.
Sample input files for several languages (UTF-8 encoded, plain text files). Unzip the provided file to see the sample input files.
Sample output files for several languages (UTF-8 encoded, plain text files, 250 words max for non-Chinese languages or 750 bytes max for Chinese language). Unzip the provided file to see the sample output files.
References: