Date First Published: 25th September 2022
Topic: Web Design & Development
Subtopic: SEO
Article Type: Computer Terms & Definitions
Difficulty: MediumDifficulty Level: 6/10
Learn more about what a robots.txt file is in this article.
A robots.txt file is a text file used by webmasters to instruct search engine bots on which URLs, URL patterns, and directories they can index in order to prevent pages that aren't meant to be indexed from showing up in search engines, such as error pages and checkout pages and prevent the server hosting the website from getting overloaded. The robots.txt file is located in the root directory and can be viewed by typing the URL with the homepage along with 'robots.txt', such as 'https://example.com/robots.txt'.
Not all websites have a robots.txt file and this file is not usually linked from any other pages of the site, but search engine bots will always look for this file first before indexing pages. A good bot will try to visit the robots.txt file and follow the instructions before viewing any other pages on the domain name whilst a bad bot will ignore the robots.txt file and take notice of the instructions. A domain name can only contain one robots.txt file. Multiple files for the same domain name are considered spam by search engines.
Google Search Console can test robots.txt files and detect any syntax errors. However, it can only be used to test files on a webmaster's own sites.
All subdomains need their own separate robots.txt file. For example, if example.com had its own file, blog.example.com and support.example.com would also need their own. The robots.txt file for example.com would not apply to any subdomains hosted under that domain name.
It is not absolutely necessary for a website to have a robots.txt file. Search engines, such as Google and Bing can usually find and index all the important pages of a site by following links to other pages. They will automatically recognise pages that can show up under multiple URLs and not index the duplicate versions or only index the page that is marked as canonical. If a webmaster does not have any areas of their site that they want to control user-agent access to, then they don't need a robots.txt file. Pages can be blocked from being indexed by adding the meta tag below in the head of the HTML document.
However, if there are certain parts of a website, such as certain directories and URLs that webmasters do not want to be indexed by search engines, then it is recommended to have a robots.txt file as without one, the webmaster would have to manually add the noindex tag to every page in the directory, which is time-consuming. The purpose of a robots.txt file is to keep web crawlers out of certain parts of a website.
A robots.txt file is simply a plaintext file that can be edited using a text editor, such as Notepad. When uploading the file to a web server, the file must be named robots.txt, not something different, such as 'robots-file.txt' or 'search-engine-crawling.txt' or else search engine bots will not recognise it.
Examples of robots.txt files can be seen below. The 'disallow' command is the most common command that instructs bots to not index the webpage or a set of webpages after the command. Below are lots of examples of robots.txt file rules.
Be careful when editing the robots.txt file. Even a small syntax error can cause the file to malfunction and all the rules to not take effect.
Here, the '/' represents the root website directory, which includes the homepage and all other directories of a domain name. With this short command, web crawlers are blocked from crawling an entire site. The 'user-agent' represents the bot. Each search engine identifies itself with a different user agent. Different bots include:
Bot names are not case-sensitive. For example, 'Googlebot' is the same as 'googlebot' as long as it is spelt correctly.
The 'user-agent: *' means that the rules apply to every bot. However, if a webmaster wanted to block a specific bot, they could do it this way:
Multiple user agents can be added by specifying them on a new line. In the example below, both Googlebot and Bingbot are blocked from accessing a whole website.
In the example above, search engine bots cannot access any URLs under the directory '/blog/', so 'example.com/blog/blog-post.html' and 'example.com/blog/blog-1.html' would be blocked, but 'example.com/blog-posts/blog-1.html' wouldn't.
Specific pages can be disallowed by the robots.txt rules below. In the example above, 'example.com/home.html' is disallowed but 'example.com/blog/home.html' isn't.
It is recommended to add the noindex tag to prevent specific URLs from being indexed, but they can also be blocked in the robots.txt file as shown in the example below.
Removing the forward slash at the end changes the URLs that are blocked from being indexed. In the example above, it will match any path that begins with 'blog', so 'example.com/blog.html', 'example.com/blog.php', and 'example.com/blog/blog.html' would be disallowed, but it would not match 'example.com/blogger.html'. Adding the * at the end ('/blog*' ) would be equivalent and the wildcard would be ignored.
The rule above would match any file that ends in the extension '.php', so 'example.com/index.php', and 'example.com/blog/home.php' would be disallowed, but it would not match 'example.com/php.html'.
The rule above would match any path that contains /blog and .php, in that specific order, so 'example.com/blog.php', would be disallowed, but it would not match 'example.com/blog.html'.
It is recommended to specify the location of any sitemaps in the robots.txt so that web crawlers can find them easier. An example of that can be seen below:
Crawl delay is an unofficial addition to the robots.txt file that specifies the number of seconds a web crawler should wait before loading and crawling page content in order to prevent web servers from becoming overloaded. Google and other search engines ignore this and it can be set in Google Search Console. In the example, the crawl delay is set to 10 seconds, meaning that they can access 8640 pages a day. For a small site, this is plenty, but for a large site, this may not be enough.
The 'allow' command gives permission for specific URLs to be crawled. Even though all URLs are, by default, 'allow', it can be used to add an exception for a disallowed rule. In the example below, all URLs in the '/articles/' directory are disallowed except 'first-article.html'.
Sometimes, rules can conflict with each other. As shown below, access to the '/articles'/ directory is both allowed and disallowed. For Google and Bing, search engines will take notice of the rule with the most characters. If they are equal in length, the least restricted instruction wins. That would be the allow instruction, but in this example, it is the disallow instruction as it has more characters. Not all search engines work in the same way and other search engines may only take notice of the disallow command when there are conflicts.
If so, it is important that you tell me as soon as possible on this page.
Network Services Network Setups Network Standards Network Hardware Network Identifiers Network Software Internet Protocols Internet Organisations Data Transmission Technologies Web Development Web Design Web Advertising Web Applications Web Organisations Web Technologies Web Services SEO Threats To Systems, Data & Information Security Mechanisms & Technologies Computer Hardware Computer Software Ethics & Sustainability Legislation & User Data Protection