

		= Crowdsourcing-based System Evaluation for OnForumS'15 =


 * NEW *

Using the gold data set release 20150520 we were able to compute
Precision, Recall and F1 figures for every link-label per system run
macro-averaged over the full set of documents. Results are attached
as extra worksheets in the original Evaluation spread sheets circulated
previously (see below).


---

Submissions to OnForumS were evaluated via crowdsourcing on Crowd Flower.
The evaluation results are attached in this tarball in file EN_OFS_SystemEvaluationAndRanking.xlsm. 

The crowdsourcing HIT was designed as a validation task, where each system proposed
link and labels are presented to a crowd worker for their validation.

The approach used for the OnForumS evaluation is IR-inspired and based on the
concept of 'pooling' used in TREC, where the assumption is that possible links
that were not proposed by any system are deemed irrelevant. Then from those links
proposed by systems, four categories are formed as follows:
a. links proposed in 4 or more system runs
b. links proposed in 3 system runs
c. links proposed in 2 system runs
d. links proposed only once

Due to the volume of links proposed by systems, a stratified random sample was extracted
for evaluation based on the following strategy:
- all of the 'a' links
- all of the 'b' links
- randomly select one third of 'c' links from each system run
- randomly select one third of 'd' links from each system run

Then the sample was loaded onto Crowd Flower and validated via crowdsourcing
(the resulting aggregated files are attached under subdirectory CrowdFlowerFiles/
in the tarball).

Subsequently, the links validated via crowdsourcing were mapped back onto the full set of
system proposed links (see columns L-R in spread sheets with names corresponding to the
documents from which the links were extracted). The primary key for the mapping is
a unique ID assigned to each link (column F).

Finally, correct and incorrect links were counted first for the linking task only.
From those links validated as correct, the correct and incorrect argument and sentiment
labels were counted. Using these counts precision scores were computed akin to precision
at rank n by pooling to depth n in TREC. System runs were then ranked based on these
precision scores resulting in the following table for English:

ENGLISH
LINKING	
GroupAndRun	Precision (LINK)
BASE-overlap	0.928571429
USFD_UNITN-run2	0.892561983
JRC-run1	0.85786802
UWB-run1	0.85106383
JRC-run2	0.829192547
USFD_UNITN-run1	0.818461538
BASE-first	0.738878143
CIST-run2	0.709424084
CIST-run1	0.702239789
	
ARGUMENT STRUCTURE	
GroupAndRun	Precision (ARGM)
CIST-run2	0.990601504
CIST-run1	0.988527725
UWB-run1	0.974358974
BASE-first	0.915531335
JRC-run2	0.896153846
USFD_UNITN-run1	0.891891892
JRC-run1	0.884848485
BASE-overlap	0.881578947
USFD_UNITN-run2	0.859813084
	
SENTIMENT DETECTION	
GroupAndRun	Precision (SENTM)
CIST-run1	0.946050096
CIST-run2	0.933837429
BASE-first	0.927027027
BASE-overlap	0.922077922
UWB-run1	0.897435897
JRC-run2	0.895752896
USFD_UNITN-run2	0.885714286
USFD_UNITN-run1	0.88030888
JRC-run1	0.874251497
---

And for Italian:

ITALIAN	
LINKING	
GroupAndRun	Precision (LINK)
BASE-overlap	0.590909091
UWB-run1	0.25
USFD_UNITN-run1	0.2
JRC-run1	0.152380952
CIST-run1	0.084269663
CIST-run2	0.033333333
BASE-first	0.010309278
	
ARGUMENT STRUCTURE	
GroupAndRun	Precision (ARGM)
CIST-run2	1
UWB-run1	1
CIST-run1	0.777777778
BASE-first	0.75
BASE-overlap	0.692307692
JRC-run1	0.44
USFD_UNITN-run1	0
	
SENTIMENT DETECTION	
GroupAndRun	Precision (SENTM)
CIST-run1	0.666666667
BASE-overlap	0.5
JRC-run1	0.375
BASE-first	0.333333333
UWB-run1	0.25
CIST-run2	0
USFD_UNITN-run1	0
---

More details on the results and the evaluation method will be provided in due course.

For any questions, please email:
jstein at kiv.zcu.cz
malexa at essex.ac.uk

