SEMILAR: A Semantic Similarity Toolkit


By downloading the SEMILAR software you agree to the following LICENSE terms and to reference SEMILAR by citing the following paper:

Rus, V., Lintean, M., Banjade, R., Niraula, N., and Stefanescu, D. (2013). SEMILAR: The Semantic Similarity Toolkit. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, August 4-9, 2013, Sofia, Bulgaria. Available here

SEMILAR is available as a Java API (to download see the next section in this page) and as a Java application with a Graphical User Interface. This page has details about the Java API and the details about the GUI tool is available here.

Online demo for sentence to sentence similarity is available here.

LSA models have been developed using the whole Wikipedia articles. The online demo for word to word similarity is available here.

How to start using SEMILAR Library?

Please see the "Getting started with SEMILAR API" section of the User manual, download the following files (last updated on July 30,2013), and make the sample codes running. For more information, please also look at the FAQ page and the References page.

1. SEMILAR main package (897 MB)

2. Example code files

3. LSA Models (163 MB)
  LSA models developed using the whole Wikipedia articles and TASA corpus are also available for download. To download the models, please click here

4. LDA Models (17.5 MB)

5. Word to word Similarity test data

6. LDA tool test data

7. PMI data calculated using whole Wikipedia text (as of Jan 2013), Clean Wikipedia articles are also available
Clean wikipedia articles (of Jan 2013 snapshot): (~3 GB, ~8 GB when unzipped)
The PMI data are available at: (1.06 GB)

8. Any problems? E-mail Rajendra Banjade at rbanjade@ and Dr. Vasile Rus at vrus @ memphis. edu.

More details about the SEMILAR API 1.0

SEMILAR API 1.0 (requires Java 1.7) contains many similarity assessment methods.

So, what are the similarity methods available in SEMILAR?
SEMILAR API comes with various similarity methods based on Wordnet, Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), BLEU, Meteor, Pointwise Mutual Information (PMI), Dependency based methods, optimized methods based on Quadratic Assignment, etc. And the similarity methods work in different granularities - word to word, sentence to sentence, or bigger texts. Some methods have their own variations which coupled with parameter settings and your selection of preprocessing steps could result in a huge space of possible instances of the same basic method.

More questions? Please find the details about SEMILAR API and references in the User Manual or see the FAQ page and References.