algorithm - Detecting duplicate webpages among large number of URLs -


from quote in google blogspot,

"in fact, found more 1 trillion individual links, not of  them lead unique web pages. many pages have multiple urls same content or urls auto-generated copies of each other. after removing exact duplicates . . . " 

how google detect exact duplicate webpages or documents? idea on algorithm google uses?

according http://en.wikipedia.org/wiki/minhash:

a large scale evaluation has been conducted google in 2006 [10] compare performance of minhash , simhash[11] algorithms. in 2007 google reported using simhash duplicate detection web crawling[12] , using minhash , lsh google news personalization.[13]

a search simhash turns page:

https://liangsun.org/posts/a-python-implementation-of-simhash-algorithm/

https://github.com/leonsim/simhash

which references paper written google employees: detecting near-duplicates web crawling

abstract:

near-duplicate web documents abundant. 2 such documents differ each other in small portion displays advertisements, example. such differences irrelevant web search. quality of web crawler increases if can assess whether newly crawled web page near-duplicate of crawled web page or not. in course of developing near-duplicate detection system multi-billion page repository, make 2 research contributions. first, demonstrate charikar's fingerprinting technique appropriate goal. second, present algorithmic technique identifying existing f-bit fingerprints differ given fingerprint in @ k bit-positions, small k. our technique useful both online queries (single fingerprints) , batch queries (multiple fingerprints). experimental evaluation on real data confirms practicality of our design.

another simhash paper:

http://simhash.googlecode.com/svn/trunk/paper/simhashwithbib.pdf


Comments

Popular posts from this blog

html - How to style widget with post count different than without post count -

How to remove text and logo OR add Overflow on Android ActionBar using AppCompat on API 8? -

javascript - storing input from prompt in array and displaying the array -