Troubleshooting Pages Not Crawled

Frequently Asked Questions

Can I submit a sitemap to the Moz Crawler?
No, we do not currently support submitting a sitemap to our crawler. If your sitemap is linked on your site (not just in your robots.txt file), our crawler may attempt to crawl it but please note that sitemaps are only used as a guide for a crawler but that does not mean that the crawler will be able to access all of the pages listed.

What’s covered?

In this guide you’ll learn more about how to investigate when pages you’re expecting to be crawled aren’t crawled in your Moz Pro Campaign’s Site Crawl. If you’re receiving an error that your crawl wasn’t able to be completed, please see our guide for troubleshooting when Moz can’t crawl your site.

Quick Links

Overview of How We Crawl

Our Site Crawl bot, Rogerbot, finds pages by crawling all of the HTML links on the homepage of your site. It then moves on to crawl all of those pages and their HTML links and so on. Rogerbot continues like that until all of the pages we can find are crawled for the site, subdomain, or subfolder that was entered when you created your Campaign or it reaches the crawl limit set for your Campaign - whichever comes first.

Usually, if a page can follow a link path back to the homepage, it should end up getting crawled. If it doesn't, it may be a sign that those pages aren't as accessible as they could be to search engines or that Rogerbot is being blocked in some way.

Below we’ll talk about how to investigate a few of the reasons certain pages on your site may not end up getting captured in your Site Crawl.

Please note: This guide is intended to help you investigate why pages aren’t being captured in your Campaign Site Crawl. If you’re having trouble with links being crawled for the Moz Link Index and Link Explorer, please see our Moz Isn’t Finding Your Links guide.

How to Review Your Crawled Pages

If there are pages you’re expecting to have crawled in your Site Crawl data but they are being missed, there are a variety of reasons why this could be happening.

There are a few places in your Campaign where you can monitor and review the pages included in your Site Crawl.

First, the Site Crawl Overview and the All Crawled Pages sections of your Campaign will provide a count of the total number of pages crawled with each of your weekly site crawls. If you have a rough idea of the size of your site and you’re seeing a lower number than anticipated, there may be pages on your site which aren’t being crawled.

Use the All Crawled Pages section of Site Crawl to monitor the number of pages crawled each week.

Within the All Crawled Pages section of your Campaign you can filter by a partial or complete URL to see if a page, or set of pages, you’re expecting to be included was captured in the crawl. Additionally, you can export a full list of your All Crawled Pages to CSV where you can filter and sort as needed.

Use the filter options to find specific pages or groups of pages.

Campaign Page Crawl Limit

When investigating why some pages may not have been captured in your Site Crawl, one of the easiest things to check is the Page Crawl Limit for your Campaign. If the Page Crawl Limit for your Campaign is lower than the total number of pages on your site, some pages may not end up being crawled.

To verify your Campaign’s Page Crawl Limit:

Head to Campaign Settings in the left hand navigation
Choose the Site Crawl settings option
Verify the Page Crawl Limit for this Campaign
Adjust your Page Crawl Limit if necessary

Within Campaign Settings you can verify and update your crawl limit.

If you determine that you need to adjust your Page Crawl Limit, this change would be reflected in your next Site Crawl. If the Page Crawl Limit was the reason your pages were not being captured in the Site Crawl, you should then see those pages noted in the All Crawled Pages section of your Campaign along with an increase in the total number of pages crawled.

Learn more about Campaign Settings

The Page Does Not Link Back to the Homepage

In order for a page to be captured in your Site Crawl, it needs to be able to be linked back to the homepage (or seed URL) for your Campaign. But what does that mean, exactly? Well, due to the way Rogerbot crawls your site, links which are captured in your Site Crawl must have a path it can follow to find those pages.

Here is a visual representation of a site's link path and how pages outside of it may not end up being crawled.

In the above example, mywebsite.com/wishlist should end up being crawled because Rogerbot is able to follow a path to it directly from the homepage of mywebsite.com. However, mywebsite.com/archives is not able to be linked back to mywebsite.com via a link path - it is out there on its own and may end up being missed by our crawler if it can’t find a way to link to it.

If there is a page on your site, or section of pages, you’re expecting to be crawled but isn’t being captured, be sure that the link path can be traced back to the homepage. In addition, if there are any broken links, critical crawler errors, or noindex tags along that path, those may impact our ability to crawl those pages as well. We’ll discuss those issues farther down in this article if you need help investigating those.

Within the All Crawled Pages section of your Campaign, you can search by partial or complete URL to see if each page within a link path was able to be crawled. This can help you to identify where the crawler got blocked or stopped, keeping the next page or link from being crawled.

Alternatively, you can export your All Crawled Pages to CSV to all the pages crawled in your Site Crawl and their referring URLs in one document where you can filter and sort to further investigate.

Broken or Lost Internal Links

If you’re not seeing as many pages crawled as you’re expecting, it is a good idea to check in on your broken and/or lost internal links.

Within the Site Crawl section of your Campaign, you can find links that are redirecting to 4xx errors in the Critical Crawler Issues section.

Within the Critical Crawler Issues section you can monitor pages which redirect to 4xx.

If an internal link is redirecting to a 4xx error, our crawler won’t be able to move past that 4xx to find more links and pages.

Meta Tags Banning Rogerbot

Within the Site Crawl section of your Campaign, you can find pages that are marked as nofollow in the Crawler Warnings section.

Within the Crawler Warnings section of your Moz Pro Campaign you can monitor what pages are marked as noindex or nofollow.

If a page on your site is marked as nofollow, this is telling our crawler not to follow and crawl any links on, or beyond, that page. So for example, if you have a page with 10 new pages linked on it but the page is marked as nofollow in the meta tag or x-robots tag, those 10 new pages will not be crawled, and therefore not added to your site crawl data.

Learn more about Robots Meta Directives

Robots.txt file banning Rogerbot

If there are pages you’re expecting to be in the crawl which aren’t, it’s recommended that you check your robots.txt file to make sure that our crawler isn’t being blocked from accessing those.

If there are subfolders or your site is blocking crawlers by a wild card directive or a user-agent specific directive for rogerbot, our crawler will not be able to access and crawl pages within that subfolder or any pages beyond it.

Learn more about robots.txt files

4xx or 5xx Errors Limiting the Crawl

Within the Site Crawl section of your Campaign, you can find pages that returned a 5xx or 4xx error to our crawler in the Critical Crawler Issues section.

Within the Critical Crawler Issues section of your Site Crawl, you can monitor which pages are returning a server error or 404 error to our crawler.

5xx and 4xx errors returned in your Site Crawl can be a sign that something is amiss with your site or server. Additionally, if our crawler encounters one of these errors, it’s not able to crawl any further. This means, if you have pages that are normally linked to on a page but that page returns an error to our crawler, our crawler will not find any links or pages beyond that error.

Javascript Impacting Your Crawl

Our crawler can't parse Javascript very well so if your site is built with a lot of Javascript, like Wix or another site builder platform, then we may not be able to find HTML links on your site to follow and crawl.

If you have links to your site’s pages coded in javascript or hidden in blocks of javascript, our crawler may not be able to find those. If this is the case, those pages may not end up included in the Site Crawl for your Campaign.

Learn more about the Moz Crawler and Javascript

Was this article helpful?

Yes! Amazing! Yes! It was what I needed. Meh. It wasn’t really what I was looking for. No, it wasn’t helpful at all.

Troubleshooting Pages Not Crawled

Frequently Asked Questions

What’s covered?

Quick Links

Overview of How We Crawl

How to Review Your Crawled Pages

Campaign Page Crawl Limit

The Page Does Not Link Back to the Homepage

Broken or Lost Internal Links

Meta Tags Banning Rogerbot

Robots.txt file banning Rogerbot

4xx or 5xx Errors Limiting the Crawl

Javascript Impacting Your Crawl

Related Articles

Was this article helpful?

Woo! 🎉
Thanks for the feedback.

Got it.
Thanks for the feedback.

Troubleshooting Pages Not Crawled

Frequently Asked Questions

What’s covered?

Quick Links

Overview of How We Crawl

How to Review Your Crawled Pages

Campaign Page Crawl Limit

The Page Does Not Link Back to the Homepage

Broken or Lost Internal Links

Meta Tags Banning Rogerbot

Robots.txt file banning Rogerbot

4xx or 5xx Errors Limiting the Crawl

Javascript Impacting Your Crawl

Related Articles

Was this article helpful?

Woo! 🎉Thanks for the feedback.

Got it.Thanks for the feedback.

Woo! 🎉
Thanks for the feedback.

Got it.
Thanks for the feedback.