Robots.txt Parser

Parse your robots.txt file the same way Google's crawlers do

Choose a Googlebot, enter your robots.txt file in the text area and enter the path you'd like to check.

You must ensure that the path you wish to check follows the format specified by RFC3986, since this library will not perform full normalization of those URI parameters. Only if the URI is in this format, will the matching be done according to the REP specification. This is exactly as per Google's open source project.

Why does this exist?

  1. The old Search Console robots.txt tester differs from real Googlebot behaviour and we expect to see it deprecated at some point.
  2. Google published an open source project containing the code their crawlers use to parse robots.txt but:
    1. It needs to be compiled, which requires at least a modicum of C++ skills
    2. It doesn't contain the Google-specific logic that googlebot-image and other Google crawlers use

This site's own robots.txt file exposes the differences compared to each. You can copy-and-paste it here to see that both googlebot and googlebot-image should be DISALLOWED from crawling /bar/. This differs from the Search Console checker (in the googlebot case) and from the open source project (in the googlebot-image case).

How does this tool differ from Google's open source project?

Apart from some minor tweaks to make it available on the web, the only substantive change is to allow it to take an ordered tuple of user agents as a comma-separated pair (e.g. googlebot-image,googlebot) in order to enable functionality that mimics how Google crawlers behave in the wild. You can read the documentation for how they should work here. The key practical differences this makes are that some Google crawlers fall back on googlebot directives if their own user agent is missing, and some only obey specific directives and ignore User-agent: * rulesets.

We have verified this behaviour against real googlebot-image behaviour in the wild, and assume it holds for the other Google crawlers which are described as operating this way.

Who are you anyway?

I'm Will Critchlow. I am CEO of SearchPilot and SEO Partner at Brainlabs and I am unhealthily interested in how robots.txt files work.

You can follow me on Twitter here.

What else do we need to know to use this tool?

If you select Googlebot Image, News or Video, it will run against the underlying parser with the input googlebot-<sub>,googlebot which first seeks a robots.txt ruleset targeting the specific crawler. Only if that is not present, will it parse the robots.txt file as googlebot. The same will happen if you input googlebot-image or similar in the "other" box. You may also provide, in the "other" box a comma-separated tuple of user agents (with no spaces) - which will behave as described above, seeking a ruleset targeting the first user agent and parse as the second user agent if none is found.

If you select an AdsBot or the AdSense option (user-agent mediapartners-google) then it will only respect rulesets that specifically target that user agent and will ignore User-agent: * blocks.

More Information