Troubleshooting Pages Not Crawled
Frequently Asked Questions
What’s covered?
In this guide you’ll learn more about how to investigate when pages you’re expecting to be crawled aren’t crawled in your Moz Pro Campaign’s Site Crawl. If you’re receiving an error that your crawl wasn’t able to be completed, please see our guide for troubleshooting when Moz can’t crawl your site.
Quick Links
Overview of How We Crawl
Our Site Crawl bot, Rogerbot, finds pages by crawling all of the HTML links on the homepage of your site. It then moves on to crawl all of those pages and their HTML links and so on. Rogerbot continues like that until all of the pages we can find are crawled for the site, subdomain, or subfolder that was entered when you created your Campaign or it reaches the crawl limit set for your Campaign - whichever comes first.
Usually, if a page can follow a link path back to the homepage, it should end up getting crawled. If it doesn't, it may be a sign that those pages aren't as accessible as they could be to search engines or that Rogerbot is being blocked in some way.
Below we’ll talk about how to investigate a few of the reasons certain pages on your site may not end up getting captured in your Site Crawl.
Please note: This guide is intended to help you investigate why pages aren’t being captured in your Campaign Site Crawl. If you’re having trouble with links being crawled for the Moz Link Index and Link Explorer, please see our Moz Isn’t Finding Your Links guide.
How to Review Your Crawled Pages
If there are pages you’re expecting to have crawled in your Site Crawl data but they are being missed, there are a variety of reasons why this could be happening.
There are a few places in your Campaign where you can monitor and review the pages included in your Site Crawl.
First, the Site Crawl Overview and the All Crawled Pages sections of your Campaign will provide a count of the total number of pages crawled with each of your weekly site crawls. If you have a rough idea of the size of your site and you’re seeing a lower number than anticipated, there may be pages on your site which aren’t being crawled.
Within the All Crawled Pages section of your Campaign you can filter by a partial or complete URL to see if a page, or set of pages, you’re expecting to be included was captured in the crawl. Additionally, you can export a full list of your All Crawled Pages to CSV where you can filter and sort as needed.
Campaign Page Crawl Limit
When investigating why some pages may not have been captured in your Site Crawl, one of the easiest things to check is the Page Crawl Limit for your Campaign. If the Page Crawl Limit for your Campaign is lower than the total number of pages on your site, some pages may not end up being crawled.
To verify your Campaign’s Page Crawl Limit:
Head to Campaign Settings in the left hand navigation
Choose the Site Crawl settings option
Verify the Page Crawl Limit for this Campaign
Adjust your Page Crawl Limit if necessary
If you determine that you need to adjust your Page Crawl Limit, this change would be reflected in your next Site Crawl. If the Page Crawl Limit was the reason your pages were not being captured in the Site Crawl, you should then see those pages noted in the All Crawled Pages section of your Campaign along with an increase in the total number of pages crawled.
The Page Does Not Link Back to the Homepage
In order for a page to be captured in your Site Crawl, it needs to be able to be linked back to the homepage (or seed URL) for your Campaign. But what does that mean, exactly? Well, due to the way Rogerbot crawls your site, links which are captured in your Site Crawl must have a path it can follow to find those pages.
In the above example, mywebsite.com/wishlist should end up being crawled because Rogerbot is able to follow a path to it directly from the homepage of mywebsite.com. However, mywebsite.com/archives is not able to be linked back to mywebsite.com via a link path - it is out there on its own and may end up being missed by our crawler if it can’t find a way to link to it.
If there is a page on your site, or section of pages, you’re expecting to be crawled but isn’t being captured, be sure that the link path can be traced back to the homepage. In addition, if there are any broken links, critical crawler errors, or noindex tags along that path, those may impact our ability to crawl those pages as well. We’ll discuss those issues farther down in this article if you need help investigating those.
Within the All Crawled Pages section of your Campaign, you can search by partial or complete URL to see if each page within a link path was able to be crawled. This can help you to identify where the crawler got blocked or stopped, keeping the next page or link from being crawled.
Alternatively, you can export your All Crawled Pages to CSV to all the pages crawled in your Site Crawl and their referring URLs in one document where you can filter and sort to further investigate.
Broken or Lost Internal Links
If you’re not seeing as many pages crawled as you’re expecting, it is a good idea to check in on your broken and/or lost internal links.
Within the Site Crawl section of your Campaign, you can find links that are redirecting to 4xx errors in the Critical Crawler Issues section.
If an internal link is redirecting to a 4xx error, our crawler won’t be able to move past that 4xx to find more links and pages.
Meta Tags Banning Rogerbot
Within the Site Crawl section of your Campaign, you can find pages that are marked as nofollow in the Crawler Warnings section.
If a page on your site is marked as nofollow, this is telling our crawler not to follow and crawl any links on, or beyond, that page. So for example, if you have a page with 10 new pages linked on it but the page is marked as nofollow in the meta tag or x-robots tag, those 10 new pages will not be crawled, and therefore not added to your site crawl data.
Robots.txt file banning Rogerbot
If there are pages you’re expecting to be in the crawl which aren’t, it’s recommended that you check your robots.txt file to make sure that our crawler isn’t being blocked from accessing those.
If there are subfolders or your site is blocking crawlers by a wild card directive or a user-agent specific directive for rogerbot, our crawler will not be able to access and crawl pages within that subfolder or any pages beyond it.
4xx or 5xx Errors Limiting the Crawl
Within the Site Crawl section of your Campaign, you can find pages that returned a 5xx or 4xx error to our crawler in the Critical Crawler Issues section.
5xx and 4xx errors returned in your Site Crawl can be a sign that something is amiss with your site or server. Additionally, if our crawler encounters one of these errors, it’s not able to crawl any further. This means, if you have pages that are normally linked to on a page but that page returns an error to our crawler, our crawler will not find any links or pages beyond that error.
Javascript Impacting Your Crawl
Our crawler can't parse Javascript very well so if your site is built with a lot of Javascript, like Wix or another site builder platform, then we may not be able to find HTML links on your site to follow and crawl.
If you have links to your site’s pages coded in javascript or hidden in blocks of javascript, our crawler may not be able to find those. If this is the case, those pages may not end up included in the Site Crawl for your Campaign.
Related Articles
Was this article helpful?
Yes! Amazing! Yes! It was what I needed. Meh. It wasn’t really what I was looking for. No, it wasn’t helpful at all.
Thanks for the feedback.