SEMILAR: A Semantic Similarity Toolkit

(and Other Resources)              

*** Downloaded 4,518 times from 121 different countries and 49 different US states (as of July 22, 2015) ***

Why The SEMILAR Project?

The goal of the SEMantic simILARity software toolkit (SEMILAR; pronounced the same way as the word 'similar') is to promote productive, fair, and rigorous research advancements in the area of semantic similarity.

Semantic similarity is the practical, widely used approach to address the natural language understanding issue in many core NLP tasks such as paraphrase identification, Question Answering, Natural Language Generation, and Intelligent Tutoring Systems. The full understanding approach, which is the other approach to language understanding, is desirable. However, because full language understanding requires world knowledge, it is more challenging and presently less practical for large scale use and real world applications.

In the semantic similarity approach, the meaning of a target text is inferred by assessing how similar it is to another text, called the benchmark text, whose meaning is known. If the two texts are similar enough, according to some measure of semantic similarity, the meaning of the target text is deemed similar to the meaning of the benchmark text. For instance, in dialogue-based Intelligent Tutoring Systems in which learners interact with a tutoring system through dialogue, students' natural language answers to, say, science problems are assessed by comparing them to ideal responses provided by experts. The students' answers are deemed correct if they are similar enough to experts' responses, which are deemed correct.

The development of SEMILAR has been motivated by the lack of an integrated environment that would provide

  • Easy access to the various implementations of the semantic similarity approach from the same interface and/or library
  • Easy access to semantic similarity methods that work at different levels of text granularity: word-to-word, sentence-to-sentence, paragraph-to-paragraph, document-to-document, or a combination of the various granularities such as word-to-sentence, sentence-to-paragraph, etc.
  • A common environment for the systematic comparison of the various semantic similarity methods

Introducing SEMILAR

The SEMILAR software environment offers users, researchers, and developers, easy access to fully-implemented semantic similarity methods in one place through both a GUI-based interface and a library. Besides productivity advantages, SEMILAR provides a framework for the systematic comparison of various semantic similarity methods.

The automated methods offered by SEMILAR range from simple lexical overlap methods to methods that rely on word-to-word similarity metrics to more sophisticated methods that rely on fully unsupervised methods to derive the meaning of words and sentences such as Latent Semantic Analysis and Latent Dirichlet Allocation to kernel-based methods for assessing similarity.

Besides automated ways for assessing the semantic similarity of texts, the toolkit offers facilities for manual assessment by experts. The manual assessment and annotation component offers GUI-based facilities for experts to assess and annotate the semantic similarity of texts. This component is called SEMILAT, the SEMantic simILarity Annotation Tool. SEMILAT is available for download. The SEMILAR corpus built by our research group is also available for download. The SEMILAR corpus offers word-level similarity qualitative judgments by human experts which can be used to further the understanding of the various word-to-word semantic similarity methods and their impact on the similarity of larger texts, e.g. sentences or paragraphs.

Some of the most important features of SEMILAR are listed below:

  • Easy GUI interface
  • Data management
  • Preprocessing
  • Lexical and syntactic feature extraction
  • Visualization
  • GUI-based data assessment and annotation (SEMILAT: The SEMantic simiLArity Annotation Tool)
  • Performance reports (if data is accompanied by expert judgments)

Latest News

@ Download the SemAligner Application

@ Download the SEMILAR Library

@ Download the SEMILAR Application

July 27, 2016 - SemAligner Tool v. 1.0 has been released - click here to go to download page.

July 22, 2015 - The SEMILAR Application, which offers GUI-based access to the SEMILAR library, is now available for download. Please click here for download instructions.

November 28, 2014 - SEMILAR API has been downloaded from 102 different countries and 46 US states.

May 26, 2014 - LSA models developed using the whole Wikipedia articles and TASA corpus are available for download. To download the LSA models, please click here.

October 24, 2013 - LSA models have been developed using the whole Wikipedia articles. LSA-based similarity demo is available here.

August 4, 2013 - SEMILAR online demo has been added.

July 30, 2013 - SEMILAR API 1.0 has been released. Please find the details about how to download here.

May 15, 2013 - SEMILAR will be presented at ACL 2013. To access the 6-page paper, click here.

June 22, 2012 - The first version of SEMILAT, the semantic similarity annotation tool, has been released - click here for downloading instructions.

The SEMILAR Corpus has also been released - click here to download