Date First Published: 5th August 2022
Topic: Web Design & Development
Subtopic: SEO
Article Type: Computer Terms & Definitions
Difficulty: MediumDifficulty Level: 5/10
Learn more about what a web crawler is in this article.
A web crawler, also known as a web spider or simply a crawler, is a bot operated by a search engine that travels across the World Wide Web and crawls and indexes content, including webpages, images, and videos. This is how index entries are produced for search engines. Without a web crawler, search engines would not be able to discover and index a website and it would get no organic traffic, making it much more difficult to find. Web crawlers are necessary for the functioning of search engines as before they can deliver the correct pages, images, and videos that match user's keywords, bots have to crawl and index them. Web crawling is the process of using an automated bot to index data on webpages.
Web crawlers may also be known as spiders due to the way they crawl the web. This is because they crawl a lot of other sites at the same and their legs span a large area of the web, following links to other pages, similar to how spiders crawl on their spiderwebs that have lots of links to each other.
Web crawlers work by discovering and scanning new URLs and reviewing, categorising, and ranking pages using their algorithms after crawling. Once successfully indexed, users can find them by typing relevant queries into search engines, such as Google, Bing, and Yandex. Web crawlers rank pages based on hundreds of factors, some of which include the number of backlinks the page has, website speed, mobile-friendliness, SSL/TLS/a>, and content quality. For more information, see this article.
Web crawlers determine the priority based on how often the page gets modified and the specified priority in the sitemap.xml file and use this information to determine how often they crawl the page. For example, web crawlers will notice new or recently updated pages and recrawl the page. If web crawlers notice that a page does not get updated very often, they will crawl it less often. Most pages are crawled every 4 - 30 days.
Web crawlers often find hyperlinks to other pages along the way and add them to their crawl queue. Backlinks to external sites are what web crawlers follow to discover and crawl new websites.
In addition, website owners can request that they crawl a page. After they make their request, the web crawler will perform checks to ensure the page can be crawled. These may include:
If the page meets all of the checks, it will be added to the queue. Requesting a page to be crawled does not guarantee that it will be indexed. Most web crawlers prioritise pages in the queue based on the order they were crawled and the importance of the pages.
When crawling a page, web crawlers read the metadata in the head of the HTML document that specifies the title and description of the page that will show up in search engines. Including these meta tags is one of the key tips for SEO. The meta title and description are what appears on the search engine results page rather than the content of the webpage that is visible to users.
In most cases, not all pages of a website are meant to show up in search engines. Webmasters can block pages they don't want to show up in the search engines from being indexed, such as error pages. There are two ways of blocking web crawlers from indexing these pages:
The robots.txt file is a text file that instructs search engines on what directories or pages of a website they can and can't index. Since this file is readable by anyone, it should not be used for hiding pages with confidential information. It only should only be used for controlling how web crawlers index a website. Entries can be added, allowing and disallowing certain URLs, URL patterns, or whole directories to appear in the search engine results.
In the example above, Sainsburys.co.uk has disallowed all of those URLs from being crawled by search engines. Those URLs are mostly shopping cart, checkout, and registration pages which should not come up in search engines. The 'user-agent *' property near the top blocks all search engines from crawling those URLs.
The <meta name="robots" content="noindex"> tag is located in the head of the HTML document and prevents the URL from being included in the search results. Indexing on a website can be controlled on a page-by-page basis and it only applies to one specific URL. If multiple URLs need to be blocked from being indexed, using the robots.txt file is recommended as it is a much quicker method. Web crawlers have to crawl pages in order to see the tag.
Search engines have their own web crawler bots. The bots from widely used search engines include:
If so, it is important that you tell me as soon as possible on this page.
Network Services Network Setups Network Standards Network Hardware Network Identifiers Network Software Internet Protocols Internet Organisations Data Transmission Technologies Web Development Web Design Web Advertising Web Applications Web Organisations Web Technologies Web Services SEO Threats To Systems, Data & Information Security Mechanisms & Technologies Computer Hardware Computer Software Ethics & Sustainability Legislation & User Data Protection