Caution: my English is far from perfect. (Русский тоже не всегда хорош).

Monday, 15 August 2022

A robots.txt Problem

To prevent a part of our web application from being scanned by search engines and other web crawlers, we add a robots.txt like

    User-agent: *
    Disallow: /path


It's so simple, what can go wrong?

A real story happened to me.

Turns out, my cloud platform - Google App Engine - has a caching and compression layer between the application and the Internet. It can gzip content for one client, cache it, and then return the same gzipped responses to other clients, even if they haven't specified the Accept-Encoding: gzip header; or even explicitly requested uncompressed content.

This unwise, in my opinion, behaviour is documented here: https://cloud.google.com/appengine/docs/legacy/standard/java/how-requests-are-handled#response_caching

Example:

# Force a gzipped response

$ curl -v -H 'Accept-Encoding: gzip' -H 'User-Agent: gzip' https://yourapp.appspot.com/robots.txt
...
content-encoding: gzip
...
Warning: Binary output can mess up your terminal. Use "--output -" to tell
Warning: curl to output it to your terminal anyway, or consider "--output
Warning: <FILE>" to save to a file.


# Now explicitly request uncompressed robots.txt

$ curl -v -H 'Accept-Encoding: identity' https://yourapp.appspot.com/robots.txt
...
content-encoding: gzip
...
Warning: Binary output can mess up your terminal. Use "--output -" to tell
Warning: curl to output it to your terminal anyway, or consider "--output
Warning: <FILE>" to save to a file.


(BTW, despite the doc says the default caching duration is 10 minutes, I observed Google App Engine returning gzipped responses for at least 30 minutes).

A web crawler (Dotbot from moz.com) has encountered such a gzipped robots.txt response and was unable to parse it, so considers all the URLs in the app domain as allowed for crawling. Moreover, the crawler caches this gzipped response. All its subsequent requests to robots.txt are conditional (ETag based, I think), and result in 304 Not Modified, thus the crawler continues relying on the gzipped version it cannot parse, and regularly visits the unwanted URLs.

Luckily, the Dotbot clearly identifies itself in the User-Agent header, and they have a working support email, so after a five month communication in a ticket I discovered the reason.

Fixed the Google App Engine behaviour by adding an explicit configuration to the appengine-web.xml:

  <static-files>
    <include path="/**">
      <http-header name="Vary" value="Accept-Encoding"/>
    </include>
    <exclude path="/**.jsp"/>
  </static-files>

Also made a little modification to the robots.txt, to be sure the ETag changes.


Blog Archive