The SEMILAR Corpus: The SEMantic SimILARity Corpus
The goal of the SEMILAR corpus is to offer word-level semantic similarity judgments made by humans which could then be linked to the semantic similarity of the largers texts. The corpus contains 701 pairs of annotated sentences taken from the Microsoft Research Paraphrase Corpus (MSRP). The SEMILAR corpus complements the MSRP corpus by providing human-made annotations, describing relations between lexical tokens or phrases in the two texts that form a paraphrase instance in MSRP.
Like in MSRP, the SEMILAR corpus also offers an overall judgment on the existence of a paraphrase relation between the texts in an instance. These judgments were made based on our protocol to annotate the word-level semantic similarity relations. More details about the protocol are available in the following paper.
Vasile Rus, Mihai Lintean, Cristian Moldovan, William Baggett, Nobal Niraula, Brent Morgan, The SIMILAR Corpus: A Resource to Foster the Qualitative Understanding of Semantic Similarity of Texts, In Semantic Relations II: Enhancing Resources and Applications, The 8th Language Resources and Evaluation Conference (LREC 2012), May 23-25, Instanbul, Turkey.