SimHash algorithm, introduced by Charikarand is patented by Google.
Simhash 5 steps: Tokenize, Hash, Weigh Values, Merge, Dimensionality Reduction
-
tokenize
tokenize your data, assign weights to each token, weights and tokenize function are depend on your business
-
hash (md5, SHA1)
calculate token's hash value and convert it to binary (101011 )
-
weigh values
for each hash value, do hash*w, in this way: (101011 ) -> (w,-w,w,-w,w,w)
-
merge
add up tokens' values, to merge to 1 hash, for example, merge (4 -4 -4 4 -4 4) and (5 -5 5 -5 5 5) , results to (4+5 -4+-5 -4+5 4+-5 -4+5 4+5),which is (9 -9 1 -1 1)
-
Dimensionality Reduction
Finally, signs of elements of
V
corresponds to the bits of the final fingerprint, for example (9 -9 1 -1 1) -> (1 0 1 0 1), we get 10101 as the fingerprint.
How to use SimHash fingerprints?
Hamming distance can be used to find the similarity between two given data, calculate the Hamming distance between 2 fingerprints.
Based on my experience, for 64 bit SimHash values, with elaborate weight values, distance of similar data
often differ appreciably in magnitude from those unsimilar data.
how to calculate:XOR, 只有两个位不同时结果是1 ,否则为0,两个二进制value“异或”后得到1的个数 为海明距离 。