algorithm - Detecting duplicate webpages among large number of URLs -
from quote in google blogspot,
"in fact, found more 1 trillion individual links, not of them lead unique web pages. many pages have multiple urls same content or urls auto-generated copies of each other. after removing exact duplicates . . . "
how google detect exact duplicate webpages or documents? idea on algorithm google uses?
according http://en.wikipedia.org/wiki/minhash:
a large scale evaluation has been conducted google in 2006 [10] compare performance of minhash , simhash[11] algorithms. in 2007 google reported using simhash duplicate detection web crawling[12] , using minhash , lsh google news personalization.[13]
a search simhash turns page:
https://liangsun.org/posts/a-python-implementation-of-simhash-algorithm/
https://github.com/leonsim/simhash
which references paper written google employees: detecting near-duplicates web crawling
abstract:
near-duplicate web documents abundant. 2 such documents differ each other in small portion displays advertisements, example. such differences irrelevant web search. quality of web crawler increases if can assess whether newly crawled web page near-duplicate of crawled web page or not. in course of developing near-duplicate detection system multi-billion page repository, make 2 research contributions. first, demonstrate charikar's fingerprinting technique appropriate goal. second, present algorithmic technique identifying existing f-bit fingerprints differ given fingerprint in @ k bit-positions, small k. our technique useful both online queries (single fingerprints) , batch queries (multiple fingerprints). experimental evaluation on real data confirms practicality of our design.
another simhash paper:
http://simhash.googlecode.com/svn/trunk/paper/simhashwithbib.pdf
Comments
Post a Comment