Site Auditor FAQs – Raven Help Desk

This document contains frequently asked questions that arise when using Site Auditor.
You can quickly navigate to a particular FAQ by using the table of contents below.
Are there two versions of Site Auditor?
How many pages can Site Auditor crawl?
What issues will Site Auditor find on my website?
Why won't Site Auditor crawl my website?
Can I set my next crawl date?
How does Site Auditor discover duplicate content?
Can I see which pages are duplicates of each other? These pages aren't duplicates, why is Site Auditor saying otherwise?
How does rel="canonical" work?
I use WordPress, why do I have so many missing image title tags?
How long does Site Auditor take to crawl? Is it stuck?
How do I slow down Site Auditor's crawl of my website?

Are there two versions of Site Auditor?

Yes. There is Site Auditor Classic and Site Auditor Studio. Site Auditor Classic is the legacy version of Site Auditor and is no longer being updated. Site Auditor Studio is the newest version of Site Auditor.

How many pages can Site Auditor crawl?

Site Auditor is capable of crawling up to 10,000 pages per website, with a total limit of 400,000 pages crawled per billing cycle for Start users, two million pages for Grow users, five million pages for Thrive users, and seven million pages for Lead users. If you hit your monthly page limit, your account will reset your allowance once your next billing period begins.

What issues will Site Auditor Classic find on my website?

In total, Site Auditor collects data on nine different areas of your website. You can see an overview of issues in the Summary section, which includes corresponding tabs, as well as a graphic display of where your issues are most prominent. Those areas are as follows:

Site Auditor Sections
Visibility	Issues dealing with the ability for your website to be accessed by search engine robots.
Meta	Issues dealing with the metadata that is used to inform robots and users about the content of your page.
Content	Issues dealing with the actual written content on your pages, including duplicate content and low word counts.
Links	Issues dealing with your links, specifically about whether or not you're using the nofollow attribute and if any links are dead on your pages.
Images	Issues dealing with broken images and images missing attributes that enhance usability on your website.
Semantics	Issues dealing with the heading structure of your page. You should only have one H1 heading per page, with successive headings in order from there.
Desktop Page Speed	How long it takes for your individual pages to load on a desktop or laptop, with metrics on actual load time and Google Page Speed scores included.
Mobile Page Speed	How long it takes for your individual pages to load on a smartphone, tablet, or mobile device, with metrics on actual load time and Google Page Speed scores included. This also includes useful information on mobile best practices, separate from load times.
Crawl Comparison	Comparisons between the last crawl and a previous crawl.

More information on each specific error can be found in the Summary section, which displays errors and descriptions for each error.

Why won't Site Auditor crawl my website?

Site Auditor will crawl most websites, but there are some situations where Site Auditor's page crawler is stopped in its tracks before it can accumulate data on your pages:

Links on your website use JavaScript. Site Auditor only follows "a href" links and does not currently follow JavaScript links. That includes websites built using Wix, which are not supported in Site Auditor.
You’re blocking IP addresses. Raven uses Amazon Web Services (AWS), so if you're blocking their range of IP addresses, Site Auditor won’t be able to crawl your site. You can write an exception to allow access to our specific user agent in this instance, which is as follows: Mozilla/5.0 (compatible; RavenCrawler/2.0; +https://raventools.com/seo-website-auditor/). This should also work if your site is in development and you're only giving access to specific user agents.
You're using security software to prevent unwanted traffic. Tools like Cloudflare and Incapsula are valuable for preventing DDoS attacks against your website, but they also prevent tools like Site Auditor from accessing and crawling your website. You'll need to whitelist our user-agent to crawl your website.

Your robots.txt file is blocking search engines. If your robots.txt file is set to disallow page crawls (see robotstxt.org for details), our crawler will not be able to access your site. To allow Raven to crawl your site, add the following code to your robots.txt file:
User-agent: RavenCrawler Allow: /

Because security is different from website to website, we aren't able to assist with creating security exceptions – we can only provide information about how the crawler is accessing your website. Additionally, please note that Site Auditor will only crawl public pages. Anything behind a login screen or password will not be crawled.

Can I set my next crawl date?

Yes. Once you've set your crawl frequency to either weekly or monthly, you'll be given the option of choosing when your next crawl will occur. This date will be used as the key for future crawls, setting your schedule unless changed.

How does Site Auditor discover duplicate content?

Site Auditor scans each page of your website for what could be considered "readable" content, attempting to exclude headers, footers, and navigation. The readable content is then compared against the other pages to make a determination about which pages could be duplicates. The more that these pages overlap, the greater the chance that they'll be flagged as duplicates.

Can I see which pages are duplicates of each other?

Yes! To discover which pages are duplicates of each other, follow these steps:

Navigate to SEO Research > Site Auditor Classic.
Click the Content Issues tab in the summary section followed by Pages having duplicate content issues.
Next to the entry marked as duplicate, click the red Yes link.

These pages aren't duplicates, why is Site Auditor saying otherwise?

There are a few common reasons for false reporting of duplicate content in Site Auditor:

If your pages don't have a lot of unique content — which is to say, a low word count — then Site Auditor may have trouble determining differences between two pages.
Many content management systems present the same content in a variety of different ways, through author pages, archive pages, search pages, and more. It's good practice to use your robots.txt file to prevent archive pages from being crawled.
You purposefully have different ways of accessing the same content on your website, but you aren't using the Canonical tag to inform crawlers to ignore these pages. Site Auditor supports the rel="canonical" tag and will take it into consideration when making determinations about duplicate content.

How does rel="canonical" work?

As of March 2015, Raven supports canonical tags when crawling websites. You'll see a Canonical column in Site Auditor that displays the canonical URL associated with a page and the page will not be included in the duplicate content issue total.

A canonical URL is a way of telling a crawler that you have different ways of accessing the same content, and to ignore those pages when checking for duplicate content.

Let's say that we have three pages — page A, page B, and page C — which we know all serve the same content. To have them not flagged as duplicate content, page A and page B can have a canonical URL that points to page C. This will exclude all three pages from duplicate content checks. However, if page A has a canonical URL pointing to page B, and page B has a canonical URL pointing to page C, then it's still possible for page A and page C to be flagged as having duplicate content of each other. Keeping this consistent is critical for successfully implementing the canonical tag.

To check for duplicate content between two pages ("page A" and "page B" in this example), Site Auditor first compares the canonical URLs of the pages as described above. If page B has a canonical URL pointing to page A, or vice versa, then they are not flagged as duplicates, because that's telling the crawler "I know these two pages are different ways of reaching the same content." Or, if page A and page B have the same canonical URL that points to a different page (that isn't page A or page B), then they are not flagged as duplicates, because that's also telling the crawler, "I know these pages are different ways of getting to this other piece of content."

I use WordPress, why do I have so many missing image title tags?

If you run Site Auditor on your WordPress blog or website, you may discover that;many more of your images are missing title tags than you might expect — especially if you've diligently added title information to each image as you uploaded them into WordPress's Media Library.
The reason for this is that WordPress, as of version 3.5, doesn't pass title information into embedded images. Instead, they use this detail internally, to help with searching and sorting your uploaded images in the Media Library itself. To restore that functionality, you may want to consider installing a plugin like Restore Image Title.

Another thing to keep in mind is that not all of your images may be managed in Media Library. In particular, images that have been hardcoded into your WordPress theme can be missing this information. If updating the Title field in your images and applying those parameters via a plugin doesn't solve all of your title tag issues, be sure to check your template files to ensure that all of your theme's images have this information.

How long does Site Auditor take to crawl? Is it stuck?

Crawling a website with Site Auditor can take up to 24 hours to complete. That's 20 hours maximum for the crawl plus any time on the front-end queue and processing end. This depends on a few different variables:

Number of other queued websites

The first consideration is the number of other websites that are currently queued to be crawled. Site Auditor crawls roughly 70 websites at a time, which means that we move through queued websites fairly quickly, but there are times when there's a backlog of websites for our systems to work through.

This is particularly the case on high-volume days, like the first of the month. If your crawls are set on a weekly or monthly schedule, you can change the Next Scheduled Crawl date in the Settings page to avoid these higher volume days.

Number of pages to be crawled

Site Auditor will crawl a maximum of 10,000 pages and a large website will take longer to crawl than a tiny website with just a few pages.

You can speed up crawls of large websites by changing the Maximum Number of Pages to Crawl setting in the Settings page to a lower number: 50, 100, 250, 500, or 1000. That will cap the maximum crawl, resulting in faster completion of the crawl.

How long it takes to crawl each page

The realities and settings of your web server and website will also affect Site Auditor's ability to efficiently crawl your website. A website with poor bandwidth, for example, may result in Site Auditor crawling each page more slowly.

Additionally, if you've implemented a crawl-delay setting in your robots.txt file, this will insert a delay (measured in seconds) between each page. On a large website, this can inflate the amount of time it takes for Site Auditor to crawl your website.

How do I slow down Site Auditor's crawl of my website?

By default, Site Auditor delays itself for two seconds between crawling pages. In some cases, you may need to slow Site Auditor down so that it can fully crawl your website. This is usually necessary when your website has flood protection turned on or has minimal bandwidth capabilities. For most websites, this will not be a factor — but here's how to slow down Raven's crawler.

Add this text to your robots.txt file:

User-agent: RavenCrawler
Crawl-delay: 5

This will tell RavenCrawler, the user-agent for Site Auditor, to wait five seconds between each page. You can increase or decrease this number as necessary, but keep in mind that Site Auditor's crawl time maxes out at three hours. Adding a Crawl-delay can prevent Site Auditor from crawling your entire website.