Googlers Talk about Dupe Detection and How They Chooses Canonical Page
http://search-off-the-record.googledevelopers.libsynpro.comIn the latest episode of Search Off the Record Podcast
Some highlights:
- Google essentially calculates multiple kinds of checksums about textual content of the page and then compare to checksums, because it’s much easier/less resource-intensive to do that than comparing perhaps the actual text content on all duplicate variations. This is set up to catch both duplicates and near-duplicate pages.
- They put these into a dupe cluster.
- Then they use about twenty signals to decide which page to pick as canonical from a dupe cluster.