What Is Duplicate Content?

Date First Published: 6th October 2022

Topic: Web Design & Development

Subtopic: SEO

Computer Terms & Definitions

Difficulty: Easy

Difficulty Level: 3/10

CONTENTS

Causes Of Duplicate Content

Learn more about what duplicate content is in this article.

Duplicate content is defined as content that is exactly the same as content that already exists on the World Wide Web. An example of duplicate content would be two pages with identical content that are accessible at two different URLs. Although duplicate content very rarely leads to penalties by search engines (unless a website is intentionally copying content from other websites), it can lead to SEO issues as it can cause confusion for search engines as to which identical page should appear at the top of the search results page. Usually, when search engine bots detect duplicate content that can be accessed at more than one URL, they automatically crawl the original page and ignore the others.

Search engine bots will rarely show multiple versions of the same content and are forced to automatically choose which version is the best result, based on factors, such as HTTPS and page quality. Webmasters can add the rel=“canonical” tag to the head of the HTML page if the same content is accessible from more than one URL. Search engines will then crawl the specified URL as canonical and all other URLs will be considered duplicates and crawled less frequently. It can only be seen by viewing the source code of a HTML page. They can also set up a 301 redirect, which will redirect the duplicates to the original page.

Causes Of Duplicate Content

Common causes of duplicate content are:

HTTP and HTTPS pages

If both the HTTP (non-secure) and HTTPS (secure) versions of pages are accessible, this can cause duplicate content issues. Users should not be able to access both HTTP and HTTPS versions of a website. A canonical URL should not be specified for the HTTPS versions. Instead, implement a 301 redirect from HTTP to HTTPS.

WWW and non-WWW pages

Some websites are both accessible with the 'www' prefix and without it (e.g. 'www.mysite.com' and 'mysite.com'. It is recommended to remove this prefix as it is already very clear that the website is part of the World Wide Web and it will make the URL shorter and easier to remember. However, some websites may still use this prefix. Keeping or removing this prefix is a personal choice, but it is best to stick to one version and implement a 301 redirect from that version to the other.

URLs with and without trailing slashes

Some URLs are accessible both with and without the trailing slash present at the end of URLs (e.g. 'example.com/page.html/' and 'example.com/page.html'.). Both versions are acceptable, but it is highly recommended to stick to one and implement a 301 redirect from that version to the other as both versions of the website with and without the trailing slash could be mistaken for duplicate content by search engines.

URLs with and without 'index.html'.

The homepage of a website can usually be visited by typing the domain name and nothing else. For example, 'mysite.com' would be the same page as 'mysite.com/index.html' Hiding 'index.html' from the homepage is useful for SEO as both versions of the homepage with and without 'index.html' could be mistaken for duplicate content by search engines. It also makes the homepage URL shorter and easier to memorise as it is just the domain name.

Intentionally copied content

Some websites may intentionally copy content from other websites and use it as their own, which is a form of plagiarism and is created to intentionally manipulate the search results and gain more organic traffic. Search engines are usually aware of and can detect websites that share exactly the same content. They may choose to crawl the original page and ignore the others. This only applies to content that is entirely copied word-to-word from other websites or very similar. Search engines may also penalise or deindex websites that intentionally copy content from other websites and view it as a form of spam. Google states that it 'tries hard to index and show pages with unique information'.