Every week in our OnPage’s Eleven series, we present a new feature from the new OnPage.org version. In today’s article, we will show you how to customize the crawl settings in OnPage.org V3.
In the previous article, Eva explained the numerous benefits that come with the new Robots.txt Monitoring and how you can use the alerts to improve your website. This week, we will explain how you can customize the crawl settings to meet your crawl requirements and will also point out the various factors that you should pay attention to.
A comprehensive website analysis is key if you want to derive the necessary optimization measures. That is why you need the correct crawl settings for your website. OnPage.org offers extensive setting options to ensure you get the information you need from the website analyses. After all, you want your optimization measures to yield optimal results.
Experience has shown that different target groups often have different crawl expectations. Below are three common use cases that illustrate how important it is to adapt the crawl settings to your requirements.
Use case 1: Optimization of an entire website
You operate a small website and wish to analyze the complete website, including all the sub-domains.
Crawl requirements: The entire website and all sub-domains should be included in the analysis.
Use case 2: Optimization of a specific section of the website
As an SEO manager who works for a big company, you are responsible for the optimization of a specific directory and would like to check if all the pages in this directory are listed in the respective Sitemap.xml.
Crawl requirements: Only one directory should be crawled. Analyzing the entire website would be unnecessary and would not help identify the parts of your directory that need optimization.
Use case 3: Relaunch
You are planning to relaunch your website and would like to analyze the website’s performance and check for errors before its launch.
Crawl requirements: Website analysis and performance review under pressure despite htaccess password protection.
With the correct settings, you can give the crawler precise instructions for any use case. This article describes the various configuration options and shows you how to adapt them to your needs.
To change the settings, click on the ‘Crawl settings’ button in the Zoom Module.
Figure 1: Crawl settings in the Zoom Module
Number of crawled URLs
A crawl cannot "use up" the URL budget. You can always hit the limit for every crawl as often as you like.
Parallel requests are used to determine the number of requests that should be sent by the crawler to the website. These should be set to the same number as the simulated website visitors. If your website is for example very large, you should set a high number of parallel requests in order to reduce the crawl duration.
Figure 2: Set the number of parallel requests
Tip: We recommend arranging server load tests with the IT department in order to avoid unwanted server failures.
.htaccess login file
Analyzing a website that is still in a testing environment is not a problem for our OnPage.org crawler. You can save the login information for the crawler in the "How to crawl" tab. This enables you to test the website before the final launch and solve any problems that might be identified.
Figure 3: Save login information for password-protected websites
The website’s robots.txt file is used to control search engine bots. Click on "What to crawl" under "Robots.txt Behavior" in order to specify how the OnPage.org crawler should handle your website’s robots.txt file.
Figure 4: Robots.txt behavior in crawl settings
The following options are hereby possible:
Follow robots.txt: The OnPage.org crawler only crawls the pages that are allowed in the robots.txt file.
Do not follow robots.txt: The OnPage.org crawler crawls all pages but also checks if the pages are blocked in the robots.txt file.
Users with a business, agency, or enterprise account also have the option to add a customized robots.txt file. The customized file only works for the OnPage.org crawler and is used instead of the website’s robots.txt file. This feature is usually advantageous if editing the robots.txt is not possible at that particular time or if you want to test new instructions.
Figure 5: Options for handling the robots.txt file
The homepage URL determines the starting point of the OnPage.org crawler. By default, the bot starts with the homepage and follows your website’s internal linking. You can set any URL on your website as the homepage.
For example, you should set https://www.domain.com/ as the homepage URL if the website was moved from http to https. Changing the homepage URL can be necessary if crawling is restricted for a certain section of the website.
Figure 6: Enter the homepage URL
This feature is very helpful if you want to optimize a specific directory on the website. In such a case, crawling the entire website would mean additional effort since you will have to process a lot of irrelevant data to get the information you need. Most of this extra information would be related to other directories and would therefore be of little relevance to the directory you wish to optimize. In order to make your optimization measures as efficient as possible, click on "What to crawl" and choose the directory you want (e.g., /category) in the "Subfolder-Mode".
Another advantage: Limiting the crawl to the relevant directory reduces the crawl duration and allows you to start the analysis sooner.
Figure 7: Activate the Subfolder-Mode
The important thing when limiting the crawl to a specific directory is that you remember to set the appropriate homepage URL.
Example: If you activate the Subfolder-Mode for the "Insurance" directory, you have to make sure the specified homepage URL is in this directory.
Possible variants would be:
When analyzing the domain, reviewing the subdomains plays a rather small role. You should however keep an eye on the entire website structure. Particularly for small websites, you can have the entire website crawled, including the subdomains. If you want to analyze a specific subdomain, you can also set this as a separate project. In that case, you should activate the subdomain crawling mode to limit the crawl to the desired subdomain.
By default, the OnPage.org crawler analyzes the website’s Sitemap.xml file that is saved under the default path www.domain.com/sitemap.xml. However, many website operators often save the Sitemap.xml file under a different URL in order to prevent unwanted access by competitors. In crawl settings, you can set any URL as the Sitemap.xml path. Simply click on "What to crawl" -> "Sitemap URLs" and set the sitemap URLs you want.
Figure 8: Enter Sitemap.xml URLs
You can also disable the sitemap crawling. However, reviewing the sitemap is advisable since this helps you to quickly and easily identify URLs that are not listed in the sitemap.
Under "Evaluate Sitemaps?", you can also specify whether the Sitemap.xml file should be downloaded and its content crawled. This approach is very helpful as it enables you to find out if your website has any pages that cannot be accessed via the internal link structure even though they are listed in the sitemap.
This function is similar to a blacklist and allows you to exclude URLs from the crawl. Here, you can exclude entire directories, parameters, or individual pages from the crawl. At the same time, you can also use this feature to limit the crawl to specific pages. The Regex rules can also be applied here. This feature enables you to compile individualized analyses reports.
You must set an appropriate homepage URL once you limit the crawl to specific parts of the website.
The crawler should be able to crawl and analyze the specified homepage URL.
Note: Any changes in the crawl settings always take effect in the next crawl.
You should always test the crawl settings before starting a new crawl.
Nothing is more disappointing than waiting for crawl results only to find that they are from the wrong pages.
To test your settings, simply click on "Test settings" and check whether the specified homepage URL is crawled using the desired settings.
The following are some of the criteria to consider:
Status code of the website: Expected result: 200.
If you get a different status code, you should check the specified homepage URL and change it if necessary. You must ensure that none of your configurations exclude the homepage URL from the crawl.
Part of the project ("local file"): Expected result: green check mark.
If a red X mark is displayed, the homepage URL you specified does not match with the project’s domain. Check the URL and change it if necessary. In some cases, you might also have to revise the project. This could mean deleting the existing project or creating a new one with the desired domain.
Local links: A list with the website’s internal links should be displayed.
If you get an empty list, your page either has no internal links or one of the afore-mentioned criteria is not met. In that case, you should either revise the homepage URL or modify you website’s internal links.
If all three criteria are met, you can rest assured that the crawl will be successful and that you will get the data you want.
Figure 9: Test settings – Important criteria for a successful crawl
Set crawl time
Regular analysis of the website is the key to all optimization measures. Under "Automatic crawls", you can set a fixed interval for automatic crawls. You will be automatically notified of crawl results via email.
Figure 10: Set regular intervals for automatic crawls
Correct crawl settings help save time and effort. Through targeted crawl configurations, you can reduce the crawl duration significantly and make sure that only the desired sections will be crawled and analyzed. You should also remember to test the settings and analyze the website regularly.
The following articles have been published in this series so far: