How does Site Auditor detect duplicate content?


Site Auditor uses semantic analysis of the pages that it crawls on your website to identify the pages that share content. You can find this error in the Content section of Site Auditor.

How does Site Auditor discover duplicate content?

Site Auditor scans each page of your website for what could be considered "readable" content, attempting to exclude headers, footers and navigation. The readable content is then compared against the other pages in order to make a determination about which pages could be duplicates. The more that these pages overlap, the greater chance that they'll be flagged as duplicates.

Can I see which pages are duplicates of each other?

Yes! To discover which pages are duplicates of each other, follow these steps:

  1. Navigate to SEO Research > Site Auditor.
  2. Click the Duplicate Content error in the summary, or find duplicate content in the Content section.
  3. Next to the entry marked as duplicate, click the red Yes link.

These pages aren't duplicates, why is Site Auditor saying otherwise?

There are a few common reasons for false reporting of duplicate content in Site Auditor:

  • If your pages don't have a lot of unique content — which is to say, a low word count — then Site Auditor may have trouble determining differences between two pages.
  • Many content management systems present the same content in a variety of different ways, through author pages, archive pages, search pages and more. It's good practice to use your robots.txt file to prevent archive pages from being crawled.
  • You purposefully have different ways of accessing the same content on your website, but you aren't using the Canonical tag to inform crawlers to ignore these pages. Site Auditor supports the rel="canonical" tag and will take it into consideration when making determinations about duplicate content.

How does rel="canonical" work?

As of March 2015, Raven supports canonical tags when crawling websites. You'll see a Canonical column in Site Auditor that displays the canonical URL associated with a page and the page will not be included in the duplicate content issue total.

A canonical URL is basically a way of telling a crawler that you have different ways of accessing the same content, and to ignore those pages when checking for duplicate content.
Let's say that we have three pages — Page A, Page B, and Page C — which we know all serve the same content. In order to have them not flagged as duplicate content, Page A and Page B can have a canonical URL that point to Page C. This will exclude all three pages from duplicate content checks. However, if Page A has a canonical URL pointing to Page B, and Page B has a canonical URL pointing to Page C, then it's still possible for Page A and Page C to be flagged as having duplicate content of each other. Keeping this consistent is critical for successfully implementing the canonical tag.
To check for duplicate content between two pages ("Page A" and "Page B" in this example), Site Auditor first compares the canonical URLs of the pages as described above. If Page B has a canonical URL pointing to Page A, or vice versa, then they are not flagged as duplicates, because that's telling the crawler "I know these two pages are different ways of reaching the same content." Or, if Page A and Page B have the same canonical URL that point to a different page (that isn't page A or page B), then they are not flagged as duplicates, because that's also telling the crawler, "I know these pages are different ways of getting to this other piece of content."
Have more questions? Submit a request