64 bits version now
warning, much slower
A simple command line tool for comparing text files using the simhash algorithm and contrasting with the jaccard index.
Almost pure fork of node-simhash, by Scott Horn:
- Patches log4js issue by setting a forced version of log4js
- Cleans French diacritics
If you have just clone this like then run the following
npm install npm link
Command line tool usage
simhash file1.txt file2.txt simhash https://file.com/page1.html https://file.com/page2.html
var simhash = ;simhash;
Compare two text strings using both simhash and jaccard index and print a summary
Compare two text strings using both simhash and jaccard index
Count the binary ones in a number.
Convert string to set of shingles using the default of 2 words per shingle and tokenize using the natural libraries default tokenizer.
Compare two strings by tokeniseing and then compare the intersection of shingles to the union of shingles.
Print a 32-bit number as a binary string of 32 characters
Convert a set of shingles to a set of crc-32 hashes.
Often you have a list of strings, and what to check how close they are each from other.
getDistanceReport will produce a JSON report containing, for each text, the closest ones.
Parameters are the following:
- an array of textual objects; each object
textproperty containing its string, and a
simhashproperty with the hash already calculated; feel free to put other properties typically an ID
- the maximal acceptable similarity: if the similarity between two strings is greater than this threshold, then it will be added in the list of the closest ones; use 0.8 for instance to only trigger when texts are 80% different or less
- the maximum number of closest strings to be given in the output (only the most close ones will be given)
The output is an array of objects:
for: reference to the textual object
closestOnes: an array with the closes elements; each object points to an element (
withproperty) and gives the distance (