Robots.txt Parser

Parse your robots.txt file the same way Google's crawlers do

Choose a Googlebot, enter your robots.txt file in the text area and enter the path you'd like to check.

You must ensure that the path you wish to check follows the format specified by RFC3986, since this library will not perform full normalization of those URI parameters. Only if the URI is in this format, will the matching be done according to the REP specification. This is exactly as per Google's open source project.

Why does this exist?

  1. It used to be possible to test specific robots.txt configurations in Search Console, but the tester did not match real googlebot behaviour, and eventually Google chose to deprecate that tool.
  2. Google published an open source project containing the code their crawlers use to parse robots.txt but:
    1. It needs to be compiled, which requires at least a modicum of C++ skills
    2. It doesn't contain the Google-specific logic that googlebot-image and other Google crawlers use

This site's own robots.txt file exposes the problem of using the open source checker without a wrapper. You can copy-and-paste it here to see that both googlebot and googlebot-image should be DISALLOWED from crawling /bar/. Without handling this kind of case specifically, the open source project will not get this right.

How does this tool differ from Google's open source project?

Apart from some minor tweaks to make it available on the web, the only substantive change is to allow it to take an ordered tuple of user agents as a comma-separated pair (e.g. googlebot-image,googlebot) in order to enable functionality that mimics how Google crawlers behave in the wild. You can read the documentation for how they should work here. The key practical differences this makes are that some Google crawlers fall back on googlebot directives if their own user agent is missing, and some only obey specific directives and ignore User-agent: * rulesets.

We have verified this behaviour against real googlebot-image behaviour in the wild, and assume it holds for the other Google crawlers which are described as operating this way.

Who are you anyway?

I'm Will Critchlow. I am CEO of SearchPilot and I am unhealthily interested in how robots.txt files work.

You can follow me on Twitter here or email me: [email protected].

What else do we need to know to use this tool?

If you select Googlebot Image, News or Video, it will run against the underlying parser with the input googlebot-<sub>,googlebot which first seeks a robots.txt ruleset targeting the specific crawler. Only if that is not present, will it parse the robots.txt file as googlebot. The same will happen if you input googlebot-image or similar in the "other" box. You may also provide, in the "other" box a comma-separated tuple of user agents (with no spaces) - which will behave as described above, seeking a ruleset targeting the first user agent and parse as the second user agent if none is found.

If you select an AdsBot or the AdSense option (user-agent mediapartners-google) then it will only respect rulesets that specifically target that user agent and will ignore User-agent: * blocks.

More Information