Mind what you say in Facebook comments, Google will soon be indexing them and serving them up as part of the company’s standard search results. Google’s all-seeing search robots still can’t find comments on private pages within Facebook, but now any time you use a Facebook comment form on a other sites, or a public page within Facebook, those comments will be indexed by Google.
Typically when Google announces it’s going to expand its search index in some way everyone is happy — sites get more searchable content into Google and users can find more of what they’re looking for — but that’s not the case with the latest changes to Google’s indexing policy.
Developers are upset because Google is no longer the passive crawler it once was and users will likely become upset once they realize that comments about drunken parties, embarrassing moments or what they thought were private details are going to start showing up next to their names in Google’s search results.
For now most of the ire seems limited to concerned web developers worried that Google’s new indexing plan ignores the HTML specification and breaks the web’s underlying architecture. To understand what Google is planning to do and why it breaks one of the fundamental gentleman’s agreements of the web, you first have to understand how various web requests work.
There are two primary requests you can initiate on the web — GET and POST. In a nutshell, GET requests are intended for reading data, POST for changing or adding data. That’s why search engine robots like Google’s have always stuck to GET crawling. There’s no danger of the Googlebot altering a site’s data with GET, it just reads the page, without ever touching the actual data. Now that Google is crawling POST pages the Googlebot is no longer a passive observer, it’s actually interacting with — and potentially altering — the websites it crawls.
While it’s unlikely that the new Googlebot will alter a site’s data — as the Google Webmaster Blog writes, “Googlebot may now perform POST requests when we believe it’s safe and appropriate” — it’s certainly possible now and that’s what worries some developers. As any webmaster knows, mistakes happen, especially when robots are involved, and no one wants to wake up one day to discover that the Googlebot has wreaked havoc across their site.
If you’d like to stop the Googlebot from crawling your site’s forms, Google suggests using the robots.txt file to disallow the Googlebot on any POST URLs your site might have. So long as you’re surfacing your content in other ways — and you should be, provided you want it indexed — there shouldn’t be any harm in blocking the Googlebot from POST requests.
If, on the other hand, you’d like to stop the Googlebot from indexing any embarrassing comments you may have left on the web, well, you’re out of luck.
[Photo by Glen Scott/Flickr/CC]