Matt Cutts, head of Web spam at Google, posted a very useful and informative video on the Webmaster channel on Youtube about how Google indexes urls that are specifically blocked by the individual site’s robots.txt file.
A Robots.txt file tells the search engines what to crawl and what not to crawl. If you don’t have something listed in the robots.txt file as “don’t crawl this” or “disallow” then it’s pretty much being indexed.
So here’s what Matt said. They get messages from angry Webmasters that say that Google violated their robots.txt file because they see their link in the search results. Matt explains that Google indeed listens to the robots.txt file but if people are linking to that particular page with anchor text that is helpful, Google will index it without actually going to the page, because the page as some value to people. Often times they will just have a page result without a description. If the page is indexed in the Open Directory Project (DMOZ) or in the Yahoo Directory, then Google might use the description from there to add it to the link.
In both instances Google has not gone to that page. The video is below. It’s worth the 4 1/2 minutes to watch it. Very informative and Matt even gives some suggestions on how to get the url out of the search index completely. But to be absolutely honest. If people are linking to that page that is blocked by your robots.txt file because it has some value, maybe you should open it up and let Google crawl it. It would most likely add value to your site.
Latest posts by Seth Goldstein (see all)
- Our Top 10 Favorite WordPress Plugins - February 12, 2018
- Business Networking Tips Podcast Episode 2 – Business Networking Halloween Horror Stories - November 3, 2017
- Business Networking Tips Podcast – Episode 0 – The Beta - October 6, 2017