		= Online Forum Summarization (OnForumS), MultiLing 2015 =

			= README for the sample data release =


The sample data is formed of one news article from The Guardian and a select set of readers' comments.

There are five files constituting the sample data release:
 1. 81043636.ofs.in.xml
 2. 81043636.ofs.out.xml
 3. 81043636.utf8.txt
 4. outputFormatOFS.txt
 5. ofs.dtd

Participants will be expected to take file 1 as input and produce file 2 as output
by populating the <links></links> section accordingly. File 3 is provided as an
auxiliary text version of the input, file 4 is a sketch of the XML format with
comments and file 5 is a DTD specification of the XML format. The text in file 1
is sentence-split and pre-tokenised (i.e., with spaces between tokens), whereas in
file 3 it is not.  

The test data to be handed out for the final evaluation will be formed of a set of
news articles, where for each article there will be a pair of files, one XML file like
file 1 above and one auxiliary text file like file 3 above. 

In addition to the data, participants will receive a validation program that they
can run over their outputs in order to make sure these conform with the OnForumS
format expectations (DTD + some specific checks, see * below for DTD validation).

Please note that the set of links provided within file 2 in order to illustrate the
task is a non-exhaustive set of links which was the result of pre-pilot crowdsourcing
evaluations using Crowd Flower.


For any questions on OnForumS please contact malexa @ essex.ac.uk or jstein @ kiv.zcu.cz.

--
* A Java DTD validator that can be used is the DOMValidator class at the following link:

 http://www.herongyang.com/XML/DTD-Validation-of-XML-with-DTD-Using-DOM.html

Download the class, compile it and run it as follows:

 java -Xmx1000M -Xms1000M -cp <YOUR-CLASSPATH> DOMValidator 81043636.ofs.out.xml

