Parse your robots.txt file the same way Google's crawlers do
Choose a Googlebot, enter your robots.txt file in the text area and enter the path you'd like to check.
You must ensure that the path you wish to check follows the format specified by RFC3986, since this library will not perform full normalization of those URI parameters. Only if the URI is in this format, will the matching be done according to the REP specification. This is exactly as per Google's open source project.
Why does this exist?
- The old Search Console robots.txt tester differs from real Googlebot behaviour and we expect to see it deprecated at some point.
- Google published an open source project containing the code their crawlers use to parse robots.txt but:
- It needs to be compiled, which requires at least a modicum of C++ skills
- It doesn't contain the Google-specific logic that
googlebot-imageand other Google crawlers use
This site's own robots.txt file exposes the differences compared to each. You can copy-and-paste it here to see that both
googlebot-image should be
DISALLOWED from crawling
/bar/. This differs from the Search Console checker (in the
googlebot case) and from the open source project (in the
How does this tool differ from Google's open source project?
Apart from some minor tweaks to make it available on the web, the only substantive change is to allow it to take an ordered tuple of user agents as a comma-separated pair (e.g.
googlebot-image,googlebot) in order to enable functionality that mimics how Google crawlers behave in the wild. You can read the documentation for how they should work here. The key practical differences this makes are that some Google crawlers fall back on
googlebot directives if their own user agent is missing, and some only obey specific directives and ignore
User-agent: * rulesets.
We have verified this behaviour against real
googlebot-image behaviour in the wild, and assume it holds for the other Google crawlers which are described as operating this way.
Who are you anyway?
I'm Will Critchlow. I am CEO of SearchPilot and SEO Partner at Brainlabs and I am unhealthily interested in how robots.txt files work.
You can follow me on Twitter here.
What else do we need to know to use this tool?
If you select Googlebot Image, News or Video, it will run against the underlying parser with the input
googlebot-<sub>,googlebot which first seeks a robots.txt ruleset targeting the specific crawler. Only if that is not present, will it parse the robots.txt file as
googlebot. The same will happen if you input
googlebot-image or similar in the "other" box. You may also provide, in the "other" box a comma-separated tuple of user agents (with no spaces) - which will behave as described above, seeking a
ruleset targeting the first user agent and parse as the second user agent if none is found.
If you select an
AdsBot or the AdSense option (user-agent
mediapartners-google) then it will only respect rulesets that specifically target that user agent and will ignore
User-agent: * blocks.
- Google's open source robots.txt parser
- My pull request implementing the change @methode specified
- My original blog post showing differences between the documentation, the Search Console tester, and the open source parser
- My speculation of how Google crawlers like
googlebot-imageparse robots.txt files (this tool uses a version of the open source parser built from a branch that includes these changes)
- In order to be able to call it from Python, I modified the open source project to output information in a structured way. You can view this branch of my fork here