Search engines gather information about domains by using spiders, to crawl through web pages to look for connecting links, and crawlers, which look for new content on websites.
It’s so commonplace now that there is an entire industry around ensuring that anything those crawlers touch has some kind of optimization in order to rank as high as possible in that specific search engine.
👋 Hello, Search Engine Optimization (SEO); which is an entire topic unto itself, so I won’t be digging into it here.
Yet, it helps to understand this concept to understand
It is worth mentioning that a lot of websites are apps now and not just simply HTML and CSS, so I tend to use the term website and app interchangeably even though there is nuance to those terms in regards to web development.
Necessity is the mother of invention
There was a time when spiders went where they wanted, gathering as much information as possible. The problem is that a poorly written one could cause a Denial of Service (DoS) attack. One individual, Charles Stross, wrote one which performed such an attack by accident. This, understandably, angered the sys admin, Martijn Koster, who proposed a simple
robots.txt file to list what directories could be crawled and which should be ignored.
This turned into a de facto standard which was later suggested to be the Robots Exclusion Protocol that was championed by Google and proposed under the RFC 9309 in 2022.
Even today, some
robots.txt files acknowledge this problem and give a stern warning. A quick look at the file on Wikipedia reveals:
# Please note: There are a lot of pages on this site, and there are
# some misbehaved spiders out there that go _way_ too fast. If you're
# irresponsible, your access to the site may be blocked.
robots.txt file is a request by the domain to any crawler that traverses a site’s paths to respect the conditions set up in the file. This means that it is up to the crawler to respect those stated desires.
Considering all I know about bad actors on the web, this is important to keep in mind. Realistically speaking no one should put anything on the internet they do not want there in perpetuity.
This file can contain a treasure trove of information for reconnaissance.
A combination of plaintext, regex, and robot specific syntax is used in this file.
# msnbot is greedy, slow it down.
# All other bots follow this process
robots.txt files can be insanely long, especially if translation is being offered by the app. It’s helpful to break up sections with comments. To write a comment in a
robots.txt file, use an octothorpe, also known as a hash or number sign.
Currently two regular expressions are recognized by larger search engine crawlers:
*- The wildcard
$- Matches end of URL string
In the example above, the
User-agent: will match any bot that is not
/chicken/*.xml$ means any file in the directory
chicken that has the extension
xml is not accessible to
msnbot when the url ends in
.xml. If there is a path like
https://www.sudoversity.fyi/chicken/eggs.xml?bob that file can be accessed.
I’ve seen the
? used at the end of some paths in a
Finally, as stated above, the
robots.txt should be respected by crawlers. Just remember that’s not something to be taken for granted.