How to Prevent Site Auditor Classic From Crawling Pages on a Website


There are two ways to prevent Site Auditor Classic from crawling pages on a website:

  1. Add path exclusions to the Site Auditor
  2. Add disallow rules to the robot.txt file

The path exclusion option is recommended as it offers the most flexibility.

Adding a Path Exclusion

There are two ways to add a path exclusion to your crawl from within Site Auditor Classic. The first is through the Tool Options menu and the second is within the Content page.

Using the Tool Options menu

1] In the upper right of the Site Auditor Classic page, click the Tools Options link.

Tool Options Link.png

2] From the menu that appears choose Settings.

Choose Settings.png

3] On the Settings page, enter the URL you would like excluded in the Block Pages From Crawler textbox. You can exclude folders, and specific files and use wildcards to exclude query strings and other, more complicated, URLs.

Add Blocked Pages Button.png

4] Click the Add button to add the excluded pages. The excluded pages will appear below.

5] Repeat steps 3 - 4 to add more pages to exclude. Click Close when you're finished.

6] You can delete an exclusion by clicking the Remove button to the right.

Remove Path Exclusion.png

Using the Content page

1] From within Site Auditor Classic, click Content on the left side menu.

Content on Left Side Menu.png

2] On the Content page, locate the URL you want to exclude by scrolling through the list or using the search box at the top.

3] When you locate the URL, click the settings icon (i.e., gear) on the far right.

Gear Icon Far Right.png

4] From the resulting popup menu, choose Exclude URL from Future Crawls. You will see a message at the top of your screen indicating your URL has been excluded.

Exclude from Future Crawls.png


Adding a Disallow Rule

By default, Site Auditor delays itself for two seconds between crawling pages. In some cases, you may need to slow Site Auditor down so that it can fully crawl your website. This is usually necessary when your website has flood protection turned on or has minimal bandwidth capabilities. For most websites, this will not be a factor — but here's how to slow down Raven's crawler.

5] Add this text to your robots.txt file:

User-agent: RavenCrawler
Crawl-delay: 5

This tells RavenCrawler, the user-agent for Site Auditor, to wait five seconds between each page. You can increase or decrease this number as necessary, but keep in mind that Site Auditor's crawl time maxes out at three hours. Adding a Crawl-delay can prevent Site Auditor from crawling your entire website.

Have more questions? Submit a request