Parse your robots.txt file the same way Google's crawlers do
Choose a Googlebot, enter your robots.txt file in the text area and enter the path you'd like to check.
You must ensure that the path you wish to check follows the format specified by RFC3986, since this library will not perform full normalization of those URI parameters. Only if the URI is in this format, will the matching be done according to the REP specification. This is exactly as per Google's open source project.
Why does this exist?
- It used to be possible to test specific robots.txt configurations in Search Console, but the tester did not match real googlebot behaviour, and eventually Google chose to deprecate that tool.
- Google published an open source project containing the code their crawlers use to parse robots.txt but:
- It needs to be compiled, which requires at least a modicum of C++ skills
- It doesn't contain the Google-specific logic that
googlebot-imageand other Google crawlers use
This site's own robots.txt file exposes the problem of using the open source checker without a wrapper. You can copy-and-paste it here to see that both
googlebot-image should be
DISALLOWED from crawling
/bar/. Without handling this kind of case specifically, the open source project will not get this right.
How does this tool differ from Google's open source project?
Apart from some minor tweaks to make it available on the web, the only substantive change is to allow it to take an ordered tuple of user agents as a comma-separated pair (e.g.
googlebot-image,googlebot) in order to enable functionality that mimics how Google crawlers behave in the wild. You can read the documentation for how they should work here. The key practical differences this makes are that some Google crawlers fall back on
googlebot directives if their own user agent is missing, and some only obey specific directives and ignore
User-agent: * rulesets.
Who are you anyway?
What else do we need to know to use this tool?
If you select Googlebot Image, News or Video, it will run against the underlying parser with the input
googlebot-<sub>,googlebot which first seeks a robots.txt ruleset targeting the specific crawler. Only if that is not present, will it parse the robots.txt file as
googlebot. The same will happen if you input
googlebot-image or similar in the "other" box. You may also provide, in the "other" box a comma-separated tuple of user agents (with no spaces) - which will behave as described above, seeking a
ruleset targeting the first user agent and parse as the second user agent if none is found.
If you select an
AdsBot or the AdSense option (user-agent
mediapartners-google) then it will only respect rulesets that specifically target that user agent and will ignore
User-agent: * blocks.
- Google's open source robots.txt parser
- My speculation of how Google crawlers like
googlebot-imageparse robots.txt files (this tool uses a version of the open source parser built from a branch that includes these changes)
- In order to be able to call it from Python, I modified the open source project to output information in a structured way. You can view this branch of my fork here