An in-depth guide to duplicate content
Unmanaged duplicate content is, in my opinion, one of the most detrimental search engine optimisation issues for a website, with the potential to significantly impact your SERP rankings and organic performance.
If you’ve been involved in Digital Marketing for a while you’ve most likely heard of “Duplicate Content”, perhaps from internal SEO teams, content marketers or partner agencies. You may have also listened to an explanation and feel you have a basic grasp of what it involves.
Over the past few years I have read, watched and heard a plethora of different explanations of duplicate content; from SEO forums to social media posts, and even blog articles from professional agencies. There are many cases – particularly since 2013 – where sites have been launched with issues which have never been identified, and as a result have never met their potential. As a result can’t help feeling that a lot of people (including professional optimisers) don’t quite understand what Duplicate Content is and how it can impact your online presence.
Given the potential impact it’s surprising that there’s so much misinformation about what it is and how to resolve it. In this post I’ll be explaining:
What duplicate content is (and isn’t)
“Your website seems to contain large amounts of duplicate content”
“But we wrote all the content ourselves!?”
The first hurdle to get past is language; more often than not people associate duplicate content with plagiarism. This is not the case.
There are two categories of duplicate content; on-site and off-site. Parallels can be drawn between off-site duplicate content issues and plagiarism, although this isn’t typically a technical issue you can control.
The associated causes, impacts, and solutions for each type are entirely different, and believe me, on-site duplicate content is far worse. Its this category which I will be addressing in this guide.
According to my definition (I may have read it somewhere, or made it up), “On-site duplicate content” is a technical SEO problem, caused by the way a website is engineered. It occurs when a specific webpage renders at multiple different URLs. It is not content which has been stolen, reused, or taken from other places of the web or your website.
So you know, almost every CMS driven website produces duplicate content – the question is whether or not it’s being managed properly.
The simplest example is your homepage. A homepage might show up when you type example.com or www.example.com In this case the same content is being rendered at two different URLs, meaning that one of them is a duplicate.
Now, it’s only a problem if search engines are able to crawl the duplicates. That said, never underestimate a Googlebot’s ability to find stuff. They usually have a helping hand like an incorrectly configured sitemap or CMS link. When Google is sending you over 50% of your online customers it’s worth taking precautions.
So why worry about it?
Don’t worry, but do be aware of it. Google’s index is based entirely on URLs. When the exact same page renders at 2 different URLs there’s no clear indication as to which is the correct page. As a result, neither page ranks as well as they should.
In addition, back in May 2012, amongst a raft of other updates, Google included harsher penalties for duplicate content as part of their Panda 3.4 update. I was fortunate enough to work on a site at the time that was heavily penalised following the update, and quickly learned how to deal with duplicate content penalties.
It’s worth mentioning at this point that, unlike Penguin’s link penalties, duplicate content penalties can be removed very quickly by taking the right steps. In my experience you do not need to wait for a Panda refresh.
Signs of duplicate content
There are a number of instances in which duplicate content can crop up, but it most commonly occurs around the time of a Panda update, following the launch of a new website, or during development changes to a site where management of duplicate content has been implemented incorrectly (or not at all). You’ll see rankings and traffic start to slide, although the impact will depend on the severity of the problem.
If you’ve got a solid grasp of duplicate content you’ll be able to find it by carrying out manual checks on a site, but for a quick spot-check you can carry out a site search in Google (site:yourdomain.com). If you see the following message on the last page of the search results there’s a chance that duplicate content is afoot. You’ll need to investigate further to be certain.
How duplicate content occurs
As I mentioned at the start, one of the most common instances of duplicate content on every website is duplication between the www subdomain and non-www root domain
Depending on your server, you’ll find that the homepage could also render at:
- example.com/index.php (linux servers)
- www.example.com/index.php (linux servers)
- example.com/home.aspx (windows servers)
- www.example.com/home.aspx (windows servers)
This is the simplest, most noticeable instance of duplicate content, and for the most part people are aware of it.
This type of duplication usually occurs throughout a website, so if your site renders at www.example.com and example.com, it probably renders a www.example.com/category and example.com/category too. This means that the duplicates are sitewide, and will have a significant impact on organic performance.
- 301 (permanent) redirect
- Canonical link element
Sub-folders, sub-categories, and child pages
Most websites use some form of categories and sub-categories to help users find information. Categories are often the most important areas of an ecommerce site, as they intuitively target refined, specific search terms. For example, If I sell a Widgets at Widgets.com, and a potential customer wants to buy “Blue Widgets”, more often than not it will be a category page for “Blue Widgets” returned as a result. The same applies to any site which categorises content into sub-folders and child pages.
Let’s say I have the category structure as follows:
Here the user has probably navigated to the first category, and then into one of it’s sub-categories. Many systems will allow this sub-category to render at example.com/sub-category without the parent category included in the URL. This sub-category now renders the same content at multiple URLs; one which includes the parent category, and one which doesn’t.
The same applies to child pages which could render at example.com/category/product and example.com/product. This might occur on a non-ecommerce site as example.com/services/service-name and example.com/service-name.
- 301 (permanent) redirect
- Canonical link element
In some cases the contents of a category page may be broken into several pages; 1, 2, and 3, for example. We refer to this as a ‘paginated series’.
Using the previous example, here’s what page 1 will normally look like:
Page 2 might then be accessed at:
Precisely how the pagination is reflected in the URL will depend on the setup of the site. In this instance we’re still in the same category, but on the second page. Search engines may well interpret the subsequent pages as duplicates of page 1.
- rel=“next” and rel=“previous” link elements
Most websites affix a parameter to a URL based on certain conditions, such as the use of a filter, a ‘sort by’ function, or a variety of other purposes. A common cause is the use of “breadcrumbs” which help users navigate a site. Breadcrumbs represent the path the user has taken to a specific page, and are usually clickable for navigation purposes.
Breadcrumbs are specific to the user, and are driven by session parameters which are sometimes visible in the page URL.
Here “Path” refers to the route the user took, and the numbers represent specific categories. In this example the user has accessed category 312, followed by category 214. This might generate breadcrumbs that look like this:
home -> category -> sub-category -> product
Now we’re still on the same product page as identified in the URL, except with URL parameters that create the breadcrumbs.
The exact same content renders on this page, but it can be accessed using a variety of different URLs. This problem is exacerbated be the number of different routes a user could take, increasing the amount of duplicates considerably.
- Canonical link element
Capitalisation & trailing slashes
Some platforms tend to ignore letter cases in URLs, allowing a page to render irrespective of capitalisation. If the page is accessible at URLs that contain upper case letters as well as ones using only lower case letters you’re probably going to have some problems. For example:
The same applies to trailing slashes (/) in URLs:
- 301 (permanent) redirect
- Canonical link element
Random CMS Junk
Obviously this is not a technical term. Not all websites operate on the latest, most up to date CMS platform. Many are outdated, bespoke, and quite frankly not in a good condition for SEO purposes.
The quality of a bespoke CMS, for example, is directly related to the knowledge and ability of the development team that built it. A slight lack in technical SEO knowledge can result in a site that outputs a large amount of dynamic duplicate content.
Looking for this is quite simple; conduct a site search in Google using “site:example.com”. Look for indexed URLs containing “?”’s, path parameters, “index.php/?”. Assuming you have SEO friendly URLs, these are most likely to be unmanaged duplicates of canonical pages.
- Canonical link element
Localisation & Translation
There are two ways to tailor content for an audience. Localisation is when content is provided in the same language, but information is tweaked for each audience to account for linguistic differences. These variants might exist on a subdomain (us.example.com) or a subfolder (example.com/us).
Where the equivalent pages exists for another locale (such as uk.example.com or example.com/uk) content should be localised for 2 reasons
- to ensure the right content ranks for the right audience
- to ensure that similar content is not considered a duplicate
The same applies to translation, except the difference is in the language. For example en.example.com or example.com/en
What’s important is that search engines don’t perceive these pages as unmanaged duplicates, or as different pages; they are the same page, tailored for a different audience.
- I’ll be covering this in a later post 🙂
Other instances of duplicate content
Duplicate content can arise in a number of other ways. Once you understand what it is, you can identify and resolve duplicate issues. Remember “duplicate content occurs when the same page renders at multiple URLs”.
How to manage duplicate content
First of all, duplicate content is not a bad things – almost every website outputs duplicate content. The problem is when this duplicate content is not managed using 301 redirects, robot directives, canonical link elements, or alternate link elements.
301 (permanent) redirects
Until the introduction of the canonical link element, 301 redirects were the best way to manage duplicate content. However, redirect and link elements work different.
Once a 301 redirect is applied to a duplicate a user will no longer be able to access it, and will be redirected to (all being well) the (correct) canonical version. The problem is that often duplicates exist precisely for users. To use the example of path parameters; breadcrumbs provide great usability for visitors. If the URLs including path parameters are redirected breadcrumbs will no longer work correctly, detracting from the website’s navigation.
A 301 should only be applied to pages which offer no extra value to a user, such as the root domain and subdomain (www.example.com and example.com). In doing so roughly 90% of the authority of the donor page to the target page provided the redirect is maintained, consolidating your link equity.
Canonical link elements
The canonical link element deals with duplicate content in the same way as a 301 redirect, with one exception; users can still access the page. Therefore this is the most effective way to manage duplicates without running the risk of detracting from the user experience.
A canonical link element looks like this:
<link rel="canonical" href="http://example.com">
It points to the canonical (correct) version of the web page on which is found. The beauty of the canonical link element is that it can be applied site wide, ensuring protection against duplicate content issues, irrespective of whether there’s a problem or not.
The canonical version of the page should have a self referring canonical link element – one that points to itself. Therefore, and duplicates of this page will have a canonical link element pointing to the canonical version.
Like a 301 redirect, the canonical link element passes roughly 90-95% of link equity to the target page. Canonical link elements work across domains too. So, if for some reason your site is rendering on a second domain, the canonical link elements will still point back to the original, preventing duplicate issues.
A Final Tip
There are some nuances to getting the most out of a canonical link element, and choosing the canonical version. The version set as the canonical will rank in search engines, therefore we want to use the one with the best possible chance of rank well.
For example, I might have a product page which renders at example.com/mens-shoes/black-shoes and also at example.com/black-shoes. If someone was to search for “men’s black shoes” which do you think has the best chance of ranking? Where the category or subcategory contain valuable search terms, it may be worth setting the canonical version as the one which includes them in the URL.
You may have noticed the appearance of “structured breadcrumbs” some time in 2013, or maybe not. Traditionally, when a webpage appears in the SERPs, the URL of the page is displayed below the page title.
With the right code in place, it’s now possible to show the actual site architecture, based on breadcrumbs.
Referring to my previous example of categories, sub-categories, and child pages, in order for these beautifully structured elements to show, the subcategory’s canonical versions MUST include the parent categories in the URL in order for canonical version to include the correct breadcrumbs.
Neither duplicate content, nor indexation should be managed using the robots.txt file. A disallow entry in Robots.txt provides meta directives at the root domain level and as such It’s very common for pages disallowed in Robots.txt to continue to be indexed when they are accessed directly by Googlebot, or another crawler. Once a disallowed page is indexed it will remain in the index irrespective of the content of your robots.txt file and will also prevent crawlers from picking up canonical link elements on the pages in question. Take a look below:
If you insist on trying managing duplicate content by controlling indexation, you’re better off using the “noindex” meta directive at the page level – a much more reliable solution. However, this will not pass Link Authority to canonical pages in the same way a canonical link element or 301 redirect would.
At 2,400 words there’s an awful lot more that I’d like to write on the subject, and perhaps I will. If after reading this you still don’t know what duplicate content is feel free ask for help in the comments below.