What Is A Robots.txt File?

What Is A Robots.txt File

Date First Published: 25th September 2022

Topic: Web Design & Development

Subtopic: SEO

Article Type: Computer Terms & Definitions

Difficulty: Medium

Difficulty Level: 6/10

Learn more about what a robots.txt file is in this article.

A robots.txt file is a text file used by webmasters to instruct search engine bots on which URLs, URL patterns, and directories they can index in order to prevent pages that aren't meant to be indexed from showing up in search engines, such as error pages and checkout pages and prevent the server hosting the website from getting overloaded. The robots.txt file is located in the root directory and can be viewed by typing the URL with the homepage along with 'robots.txt', such as 'https://example.com/robots.txt'.

Not all websites have a robots.txt file and this file is not usually linked from any other pages of the site, but search engine bots will always look for this file first before indexing pages. A good bot will try to visit the robots.txt file and follow the instructions before viewing any other pages on the domain name whilst a bad bot will ignore the robots.txt file and take notice of the instructions. A domain name can only contain one robots.txt file. Multiple files for the same domain name are considered spam by search engines.

Google Search Console can test robots.txt files and detect any syntax errors. However, it can only be used to test files on a webmaster's own sites.

Note: Info Icon

All subdomains need their own separate robots.txt file. For example, if example.com had its own file, blog.example.com and support.example.com would also need their own. The robots.txt file for example.com would not apply to any subdomains hosted under that domain name.

Does A Website Need A Robots.txt File?

It is not absolutely necessary for a website to have a robots.txt file. Search engines, such as Google and Bing can usually find and index all the important pages of a site by following links to other pages. They will automatically recognise pages that can show up under multiple URLs and not index the duplicate versions or only index the page that is marked as canonical. If a webmaster does not have any areas of their site that they want to control user-agent access to, then they don't need a robots.txt file. Pages can be blocked from being indexed by adding the meta tag below in the head of the HTML document.

<meta name="robots" content="noindex">

However, if there are certain parts of a website, such as certain directories and URLs that webmasters do not want to be indexed by search engines, then it is recommended to have a robots.txt file as without one, the webmaster would have to manually add the noindex tag to every page in the directory, which is time-consuming. The purpose of a robots.txt file is to keep web crawlers out of certain parts of a website.

How To Create A Robots.txt File?

A robots.txt file is simply a plaintext file that can be edited using a text editor, such as Notepad. When uploading the file to a web server, the file must be named robots.txt, not something different, such as 'robots-file.txt' or 'search-engine-crawling.txt' or else search engine bots will not recognise it.

Examples of robots.txt files can be seen below. The 'disallow' command is the most common command that instructs bots to not index the webpage or a set of webpages after the command. Below are lots of examples of robots.txt file rules.

Warning: Warning Icon

Be careful when editing the robots.txt file. Even a small syntax error can cause the file to malfunction and all the rules to not take effect.

Block an entire site from being indexed

User-agent: * Disallow: /

Here, the '/' represents the root website directory, which includes the homepage and all other directories of a domain name. With this short command, web crawlers are blocked from crawling an entire site. The 'user-agent' represents the bot. Each search engine identifies itself with a different user agent. Different bots include:

  • Google: Googlebot
  • Google Images: Googlebot-Image
  • Bing: Bingbot
  • DuckDuckGo: DuckDuckBot
  • Yandex: YandexBot
Note: Info Icon

Bot names are not case-sensitive. For example, 'Googlebot' is the same as 'googlebot' as long as it is spelt correctly.

The 'user-agent: *' means that the rules apply to every bot. However, if a webmaster wanted to block a specific bot, they could do it this way:

User-agent: Googlebot Disallow: /

Multiple user agents can be added by specifying them on a new line. In the example below, both Googlebot and Bingbot are blocked from accessing a whole website.

User-agent: Googlebot User-agent: Bingbot Disallow: /

Disallow all URLs under a directory from being indexed

User-agent: * Disallow: /blog/

In the example above, search engine bots cannot access any URLs under the directory '/blog/', so 'example.com/blog/blog-post.html' and 'example.com/blog/blog-1.html' would be blocked, but 'example.com/blog-posts/blog-1.html' wouldn't.

Disallow specific pages

User-agent: * Disallow: /home.html

Specific pages can be disallowed by the robots.txt rules below. In the example above, 'example.com/home.html' is disallowed but 'example.com/blog/home.html' isn't.

Disallow specific URLs patterns

It is recommended to add the noindex tag to prevent specific URLs from being indexed, but they can also be blocked in the robots.txt file as shown in the example below.

User-agent: * Disallow: /blog

Removing the forward slash at the end changes the URLs that are blocked from being indexed. In the example above, it will match any path that begins with 'blog', so 'example.com/blog.html', 'example.com/blog.php', and 'example.com/blog/blog.html' would be disallowed, but it would not match 'example.com/blogger.html'. Adding the * at the end ('/blog*' ) would be equivalent and the wildcard would be ignored.

User-agent: * Disallow: /*.php$

The rule above would match any file that ends in the extension '.php', so 'example.com/index.php', and 'example.com/blog/home.php' would be disallowed, but it would not match 'example.com/php.html'.

User-agent: * Disallow: /blog*.php

The rule above would match any path that contains /blog and .php, in that specific order, so 'example.com/blog.php', would be disallowed, but it would not match 'example.com/blog.html'.

User-agent: * Disallow: /blog*.php

Specifying a sitemap in the robots.txt file

It is recommended to specify the location of any sitemaps in the robots.txt so that web crawlers can find them easier. An example of that can be seen below:

Sitemap: https://example.com/sitemap.xml

Crawl delay

Crawl delay is an unofficial addition to the robots.txt file that specifies the number of seconds a web crawler should wait before loading and crawling page content in order to prevent web servers from becoming overloaded. Google and other search engines ignore this and it can be set in Google Search Console. In the example, the crawl delay is set to 10 seconds, meaning that they can access 8640 pages a day. For a small site, this is plenty, but for a large site, this may not be enough.

User-agent: * Crawl-delay: 10

Adding exceptions to disallowed rules

The 'allow' command gives permission for specific URLs to be crawled. Even though all URLs are, by default, 'allow', it can be used to add an exception for a disallowed rule. In the example below, all URLs in the '/articles/' directory are disallowed except 'first-article.html'.

Disallow: /articles/ Allow: /articles/first-article.html

Sometimes, rules can conflict with each other. As shown below, access to the '/articles'/ directory is both allowed and disallowed. For Google and Bing, search engines will take notice of the rule with the most characters. If they are equal in length, the least restricted instruction wins. That would be the allow instruction, but in this example, it is the disallow instruction as it has more characters. Not all search engines work in the same way and other search engines may only take notice of the disallow command when there are conflicts.

Disallow: /articles/ Allow: /articles/

Robots.txt File Tips

  • Do not use the robots.txt file to protect sensitive data or pages that you don't want other visitors to see. The robots.txt file is viewable and users can see the list of allowed and disallowed pages. Instead, password protect the page or remove it from the website.
  • Do not rely on robots.txt to keep a page out of a search engine. Instead, use the noindex meta tag. A page that is disallowed in robots.txt can still appear in the index if it is linked to by other websites, but the search result will not have a description.
  • Understand that the robots.txt file has some limitations. The instructions in the robots.txt file cannot force web crawlers to not index a page. Whilst popular search engines, such as Google obey the rules, some less common ones may not.


Feedback

  • Is there anything that you disagree with on this page?
  • Are there any spelling, grammatical, or punctuation errors on this page?
  • Are there any broken links or design errors on this page?

If so, it is important that you tell me as soon as possible on this page.