Site Auditor uses semantic analysis of the pages that it crawls on your website to identify the pages that share content. You can find this error in the Content section of Site Auditor.
How does Site Auditor discover duplicate content?
Site Auditor scans each page of your website for what could be considered "readable" content, attempting to exclude headers, footers and navigation. The readable content is then compared against the other pages in order to make a determination about which pages could be duplicates. The more that these pages overlap, the greater chance that they'll be flagged as duplicates.
Can I see which pages are duplicates of each other?
Yes! To discover which pages are duplicates of each other, follow these steps:
- Navigate to SEO > Site Auditor.
- Click the Duplicate Content error in the summary, or find duplicate content in the Content section.
- Next to the entry marked as duplicate, click the red Yes link.
These pages aren't duplicates, why is Site Auditor saying otherwise?
There are a few common reasons for false reporting of duplicate content in Site Auditor:
- If your pages don't have a lot of unique content — which is to say, a low word count — then Site Auditor may have trouble determining differences between two pages.
- Many content management systems present the same content in a variety of different ways, through author pages, archive pages, search pages and more. It's good practice to use your robots.txt file to prevent archive pages from being crawled.
- You purposefully have different ways of accessing the same content on your website, but you aren't using the Canonical tag to inform crawlers to ignore these pages. Site Auditor supports the rel="canonical" tag and will take it into consideration when making determinations about duplicate content.
How does rel="canonical" work?
As of March 2015, Raven supports canonical tags when crawling websites. You'll see a Canonical column in Site Auditor that displays the canonical URL associated with a page and the page will not be included in the duplicate content issue total.