Skip to content

Revisiting robots.txt

Published: at 03:34 PM

Search engines gather information about domains by using spiders, to crawl through web pages to look for connecting links, and crawlers, which look for new content on websites.

It’s so commonplace now that there is an entire industry around ensuring that anything those crawlers touch has some kind of optimization in order to rank as high as possible in that specific search engine.

👋 Hello, Search Engine Optimization (SEO); which is an entire topic unto itself, so I won’t be digging into it here.

Yet, it helps to understand this concept to understand robots.txt.

It is worth mentioning that a lot of websites are apps now and not just simply HTML and CSS, so I tend to use the term website and app interchangeably even though there is nuance to those terms in regards to web development.

Necessity is the mother of invention

There was a time when spiders went where they wanted, gathering as much information as possible. The problem is that a poorly written one could cause a Denial of Service (DoS) attack. One individual, Charles Stross, wrote one which performed such an attack by accident. This, understandably, angered the sys admin, Martijn Koster, who proposed a simple robots.txt file to list what directories could be crawled and which should be ignored.

This turned into a de facto standard which was later suggested to be the Robots Exclusion Protocol that was championed by Google and proposed under the RFC 9309 in 2022.

Even today, some robots.txt files acknowledge this problem and give a stern warning. A quick look at the file on Wikipedia reveals:

# Please note: There are a lot of pages on this site, and there are
# some misbehaved spiders out there that go _way_ too fast. If you're
# irresponsible, your access to the site may be blocked.

Voluntary Compliance

The robots.txt file is a request by the domain to any crawler that traverses a site’s paths to respect the conditions set up in the file. This means that it is up to the crawler to respect those stated desires.

Considering all I know about bad actors on the web, this is important to keep in mind. Realistically speaking no one should put anything on the internet they do not want there in perpetuity.

This file can contain a treasure trove of information for reconnaissance.

Contents

Syntax

A combination of plaintext, regex, and robot specific syntax is used in this file.

# msnbot is greedy, slow it down.
Crawl-delay: 120
User-agent: msnbot
Disallow: /cgi-bin/
Disallow: /chicken/*.xml$
Disallow: /~rubberducky/

# All other bots follow this process
User-agent: *
Disallow: /cgi-bin/
Disallow: /~rubberducky/

# Sitemap
Sitemap: https://www.sudoversity.fyi/robots.txt

Some robots.txt files can be insanely long, especially if translation is being offered by the app. It’s helpful to break up sections with comments. To write a comment in a robots.txt file, use an octothorpe, also known as a hash or number sign.

Currently two regular expressions are recognized by larger search engine crawlers:

  1. * - The wildcard
  2. $ - Matches end of URL string

In the example above, the * after User-agent: will match any bot that is not msnbot. The /chicken/*.xml$ means any file in the directory chicken that has the extension xml is not accessible to msnbot when the url ends in .xml. If there is a path like https://www.sudoversity.fyi/chicken/eggs.xml?bob that file can be accessed.

Other Considerations

I’ve seen the ? used at the end of some paths in a robots.txt file.

Finally, as stated above, the robots.txt should be respected by crawlers. Just remember that’s not something to be taken for granted.