« Back to front page

How to Customize Your Crawl to Get the Best Results for Your Business

A comprehensive website analysis tailored to your needs is vital for sustainable website quality management. With Ryte, you can easily customize your analysis to get the best results for your business and derive appropriate optimization measures. This article describes the various configuration options and shows you how to adapt them to your needs.

Contents

1. Introduction
2. Use Cases
3. Basic project settings
4. Advanced project settings
5. Previous analysis
6. Test settings
7. Conclusion

Introduction

Ryte uses its own crawler to analyze your website, helping you to identify issues that could harm your user experience. The crawling technology closely resembles Google’s crawler. It starts on the homepage of your project, and makes its way from page to page by following the internal link path. Just like the Google Crawler, the Ryte bot can be controlled. For example, you can instruct the crawler to exclude certain directories, pages or subdomains from the crawl, and therefore from your analysis. All Basic and Business Suite account owners benefit from the full functionality of the project settings.

For a quick tutorial about how to set up your project, check out this video.

The crawler settings can be found when clicking on your project settings on the top right hand corner throughout the Ryte Suite. The settings are divided into "Project setup" and "Advanced analysis". These can be modified individually for every project.

Figure 1: Access your project settings within the tool

Use cases

Different businesses have different requirements for a website analysis. The following use cases demonstrate how you can customize the crawler to suit your particular needs:

1: Analyze an entire website

You want to analyze the entire website, including all the sub-domains. This is recommended for smaller websites.

Crawl requirements: The entire website and all sub-domains should be included in the analysis.

2: Analyze a specific section of a website

As an SEO manager who works for a big company, you are responsible for the optimization of a specific directory and would like to check if all the pages in this directory are listed in the respective Sitemap.xml.

Crawl requirements: Only one directory should be crawled. Analyzing the entire website would be unnecessary and would not help identify the parts of your directory that need optimization.

3: Prepare for a website migration

You are planning to relaunch your website and would like to analyze the website’s performance and check for errors before its launch. For more advice about preparing for a relaunch, check out this article.

Crawl requirements: Website analysis and performance review despite htaccess password protection.

Basic project settings

You can make several basic adaptations to the analysis in the "project setup", or you can use the default settings. Let’s go through them step-by-step.

How many URLs should be analyzed?

Set the limit of the number of URLs that you want to crawl. Ideally, this should equal the number of URLs your website has. If you don’t know how many indexable URLs your website has, try the site query in Google: site:en.ryte.com. The number that appears on top will tell you how many of the domain’s pages are listed in the Google index – you can orient yourself towards it. The Ryte crawler is able to crawl from 100 to 21 million URLs.

Figure 2: How many URLs should be analyzed

How fast should the analysis be?

Here you can decide how fast your analysis should be. Parallel requests are used to determine the number of requests that should be sent by the crawler to the website. The more parallel requests you use, the faster your site will be analyzed. Large websites should set a high number of parallel requests to reduce the crawl duration. However, using more than 10 could cause your server to slow down, so check with your administrator or our support team for advice if you want to use more than 10 parallel requests.

Figure 3: Set the number of parallel requests

What should be analyzed?

We analyze certain aspects by default with our recommended crawl setting. However, you can untick or tick boxes as you require to ensure you analyze the data you need.

Figure 4: Recommended analysis settings

Accept cookies

If your website uses cookies, here, you can instruct the crawler to accept cookies. This option is disabled by default, because this reveals issues for users (or crawlers) that do not accept cookies, for example session IDs and cloaking. These errors are often overlooked, because browsers enable cookies by default.

Analyze images

The Ryte crawler regards pictures as autonomous resources and crawls them by default. If you only want to analyze your HTML content, you should untick it. However, flawed and deleted images won’t be displayed in the reports. For a thorough website analysis, we recommend analyzing images.

Crawl subdomains

If your website has a lot of subdomains, you can crawl all subdomains by ticking this box. This option is activated by default. However, if you only want to analyze the specific subdomain of your project, you should untick it.

Obey robots.txt

You can instruct the Ryte crawler to regard or disregard the robots.txt. If you deliberately exclude content in the robots.txt from Google, you can also exclude them from the analysis. This has the advantage that you will only see issues that affect your users and google, for example excluding pages to an admin area of your site. If your site has a lot of these pages, this will also help save resources. However, for a thorough analysis of all areas of your site, you should deactivate this box.

Analyse sitemaps

Would you like the crawler to download and analyze your sitemap.xml(s)? This is important if you want to use the "sitemap.xml" report in Website Success. This also means the crawler will be able to identify pages that are missing in the sitemap. You can list the sitemap URLs in the advanced settings (see below)

Schedule regular crawls

Regular analysis of the website is vital for the sustainable quality management of your site. Click on "schedule analysis" to set a fixed interval for automatic crawls. You will be automatically notified of crawl results via email.

Figure 5: Schedule analysis

Advanced Analysis

How to analyze

In this section, you can specify more advanced settings. It is not necessary to change anything here if you want to use the default settings.

Login data

Analyzing a website that is still in a testing environment is not a problem for our Ryte crawler. Under "How to analyze" - "Login data", you can save login information, enabling you to crawl aspects of your website that are password protected. This is particularly useful for website migrations or launches, because you can test the website before going live to find and fix any problems.

Figure 6: Save login information for password-protected websites

Robots.txt behavior

The website’s robots.txt file is used to control search engine bots. Under "How to analyze" - "Robots.txt Behavior", you can specify how the Ryte crawler should handle your website’s robots.txt file. You can also upload a custom robots.txt file for a more customized analysis.The following options are available:

  • Analyze everything but create disallow statistics based on the robots.txt file

  • Only crawl pages that aren’t blocked by robots.txt

  • Analyze everything but create disallow statistics based on the custom robots.txt for your site.

  • Only crawl pages that aren’t blocked by your custom robots.txt

Figure 7: Robots.txt behavior in crawl settings

What to analyze

Homepage URL

The homepage URL determines the starting point of the Ryte crawler. By default, the bot starts with the homepage and follows your website’s internal linking. You can set any URL on your website as the homepage. This can be found under "What to analyze" - "homepage URL".

For example, you should set https://www.domain.com/ as the homepage URL if the website was moved from http to https. Changing the homepage URL can be necessary if crawling is restricted for a certain section of the website.

Analyze subfolder

This feature is useful if you want to optimize a specific directory on your website. In this case, it would be unnecessary to crawl the entire website as you would have to process irrelevant data regarding other directories. To make your analysis as efficient as possible, click on "What to analyze" - "Analyze subdomains" and choose the directory you want (e.g., /category).

Another advantage: Limiting the crawl to the relevant directory reduces the time taken to crawl your website, meaning you can start your analysis sooner.

The important thing when limiting the crawl to a specific directory is that you remember to set the appropriate homepage URL.

Example: If you activate the Subfolder-Mode for the "Insurance" directory, you have to make sure the specified homepage URL is in this directory.

Possible variants would be:

https://www.domain.com/Insurance/

https://www.domain.com/Insurance/Private_customers/.

Analyze subdomains

For smaller websites, we would recommend crawling the entire website, including all subdomains. For bigger websites, if you want to analyze a specific subdomain, you should create a separate project and then use this advanced setting to limit the crawl to the desired subdomain.

Sitemap URLs

By default, the Ryte crawler analyzes the website’s sitemap.xml file that is saved under the default path www.domain.com/sitemap.xml. However, some website operators save the sitemap.xml file under a different URL to prevent unwanted access by competitors. Other website operators might have multiple sitemaps for various subfolders. In the advanced settings, you can add any number of URLs as the sitemap.xml path. Simply click on "What to analyze" - "Sitemap URLs" and add the sitemap URLs you want to be analyzed.

Figure 8: Enter Sitemap.xml URLs

You can also disable the sitemap crawling. However, reviewing the sitemap is advisable as this helps you to quickly and easily identify URLs that are not listed in the sitemap.

Ignore/include URLs

This function is similar to a blacklist and allows you to exclude URLs from the crawl. Here, you can exclude entire directories, parameters, or individual pages from the crawl. At the same time, you can also use this feature to limit the crawl to specific pages. The Regex rules can also be applied here. This feature enables you to easily compile individualized analysis reports.

If you limit the crawl to specific parts of the website, you must set an appropriate homepage URL - the crawler should be able to crawl and analyze the specified homepage URL.

Note: Any changes in the crawl settings always take effect in the next crawl.

Previous analysis

All recently performed crawls within a project are listed in the "Previous analysis" tab. In this table, you can see when the crawl was started and finished, the URL limited, URLs found, analyzed URLs, and ignored URLs.

Figure 9: Previous analysis

Tip: If the number of found URLs is significantly higher than the number of crawled URLs, it would be a good idea to extend the Crawling Limit. If the number of excluded URLs is very high, this could indicate incorrect robots.txt settings.

Test settings

Your crawler settings can be tested directly, live with the tab "Test Settings". We always recommend testing the crawl settings before starting a new crawl to make sure you get the data you require.

To test your settings, simply click on "Test settings" and check whether the specified homepage URL is crawled using the desired settings.

Check the following criteria:

Status code of the website: Expected result: 200.

If you get a different status code, you should check the specified homepage URL and change it if necessary. You must ensure that none of your configurations exclude the homepage URL from the crawl.

Part of the project ("local file"): Expected result: green check mark.

If a red X mark is displayed, the homepage URL you specified does not match with the project’s domain. Check the URL and change it if necessary. In some cases, you might also have to revise the project. This could mean deleting the existing project or creating a new one with the desired domain.

Local links: A list with the website’s internal links should be displayed.

If you get an empty list, your page either has no internal links or one of the afore-mentioned criteria is not met. In that case, you should either revise the homepage URL or modify your website’s internal links.

If all three criteria are met, you can rest assured that the crawl will be successful and that you will get the data you want.

Figure 10: Test settings

Conclusion

Correct crawl settings help save time and effort. The various crawler settings make it easier for you to customize your analysis in the best way for your business. With customized crawl configurations, you can carry out your website analysis more efficiently and ensure that you are analyzing the data you need. You should test the settings before starting a crawl, and schedule crawls to have regular data coming in. If you're not sure how best to set up your crawler, use the recommended settings in the project setup, or get in touch with our support team for extra help.

Ryte users gain +93% clicks after 1 year. Learn how!

Published on May 8, 2019 by Olivia Willson