Googlers Talk About Their Indexing System Caffeine
http://search-off-the-record.googledevelopers.libsynpro.comThis was a really cool episode with Martin, John and Jerry. They talk about Google’s indexing system caffeine, here are some key points:
- it normalize the HTML for processing ( pdf, spreadsheet, word documents, lotus files, etc are also converted to HTML for processing )
- it gets the content of the header tags and looks at the styling for the header tag to see the relative importance of the header tags
- it look for some meta tags – robots meta tags, etc
- if they find HTML body-related tags in the head like div, p, span iframe etc they close the head just before that and will start processing the body tag from there
- collapsor – the system that’s doing error page handling tries to detect a page to see if it’s a 404 page, so if you have content that this system might mistake for an error page chances are they can not be indexed by Google. ( so if you hate someone, drop in some random 404 page not found messages randomly into the page and watch they pull out their hair trying to get those pages indexed ). This system also has the potential to flag an out of stock product page on an e-commerce site as a soft 404 page depending on the words you use on the page.