This library encapsulates our statistico-semantic algorithm to derive term similarity from a corpus of texts.
Term similarity is defined here as any kind of relation between terms that could bring them to appear in the same context.
To assert the probability of such a kind of relation, we first produce a specified amount of recurring terms in the corpus. We offer the possibility to define stopwords (which can also contain corpus-specific recurring words), which will be excluded from this list.
The corpus itself is cut into blocks (either sentences, lines or n-grams, as specified). Then, the algorithm computes the co-occurrences between each of the terms in the corpus and the set of recurring terms. The amount of co-occurrences is defined as the amount of blocks in which the terms co-occur.
For instance, if we have two blocks with the words A B C and A C D respectively, and if C has been found as one of the most recurring terms, then A, B and D will have a score with C equal to 2, 1 and 1 respectively (C will also have a score of 2).
Once all scores have been computed, the vector of each term (containing their scores with each recurring term) is normalized so that the sum of its coordinates equals 1, and can be used as features of the terms to compute the actual (for instance euclidian or Manhattan) distance between them.
StatSemDistance is available under the GNU LGPL v3.
Contributor: Samuel Gesche
Last release date: 2013-10-23