March 10, 2023

The Fundamentals of Crawling for SEO – Whiteboard Friday

Whiteboard Friday | SEO Basics

The author's views are entirely their own (excluding the unlikely event of hypnosis) and may not always reflect the views of Moz.

In this week’s episode of Whiteboard Friday, host Jes Scholz digs into the foundations of search engine crawling. She’ll show you why no indexing issues doesn’t necessarily mean no issues at all, and how — when it comes to crawling — quality is more important than quantity.

infographic outlining the fundamentals of SEO crawling

Click on the whiteboard image above to open a high resolution version in a new tab!

Video Transcription

Good day, Moz fans, and welcome to another edition of Whiteboard Friday. My name is Jes Scholz, and today we're going to be talking about all things crawling. What's important to understand is that crawling is essential for every single website, because if your content is not being crawled, then you have no chance to get any real visibility within Google Search.

So when you really think about it, crawling is fundamental, and it's all based on Googlebot's somewhat fickle attentions. A lot of the time people say it's really easy to understand if you have a crawling issue. You log in to Google Search Console, you go to the Exclusions Report, and you see do you have the status discovered, currently not indexed.

If you do, you have a crawling problem, and if you don't, you don't. To some extent, this is true, but it's not quite that simple because what that's telling you is if you have a crawling issue with your new content. But it's not only about having your new content crawled. You also want to ensure that your content is crawled as it is significantly updated, and this is not something that you're ever going to see within Google Search Console.

But say that you have refreshed an article or you've done a significant technical SEO update, you are only going to see the benefits of those optimizations after Google has crawled and processed the page. Or on the flip side, if you've done a big technical optimization and then it's not been crawled and you've actually harmed your site, you're not going to see the harm until Google crawls your site.

So, essentially, you can't fail fast if Googlebot is crawling slow. So now we need to talk about measuring crawling in a really meaningful manner because, again, when you're logging in to Google Search Console, you now go into the Crawl Stats Report. You see the total number of crawls.

I take big issue with anybody that says you need to maximize the amount of crawling, because the total number of crawls is absolutely nothing but a vanity metric. If I have 10 times the amount of crawling, that does not necessarily mean that I have 10 times more indexing of content that I care about.

All it correlates with is more weight on my server and that costs you more money. So it's not about the amount of crawling. It's about the quality of crawling. This is how we need to start measuring crawling because what we need to do is look at the time between when a piece of content is created or updated and how long it takes for Googlebot to go and crawl that piece of content.

The time difference between the creation or the update and that first Googlebot crawl, I call this the crawl efficacy. So measuring crawling efficacy should be relatively simple. You go to your database and you export the created at time or the updated time, and then you go into your log files and you get the next Googlebot crawl, and you calculate the time differential.

But let's be real. Getting access to log files and databases is not really the easiest thing for a lot of us to do. So you can have a proxy. What you can do is you can go and look at the last modified date time from your XML sitemaps for the URLs that you care about from an SEO perspective, which is the only ones that should be in your XML sitemaps, and you can go and look at the last crawl time from the URL inspection API.

What I really like about the URL inspection API is if for the URLs that you're actively querying, you can also then get the indexing status when it changes. So with that information, you can actually start calculating an indexing efficacy score as well.

So looking at when you've done that republishing or when you've done the first publication, how long does it take until Google then indexes that page? Because, really, crawling without corresponding indexing is not really valuable. So when we start looking at this and we've calculated real times, you might see it's within minutes, it might be hours, it might be days, it might be weeks from when you create or update a URL to when Googlebot is crawling it.

If this is a long time period, what can we actually do about it? Well, search engines and their partners have been talking a lot in the last few years about how they're helping us as SEOs to crawl the web more efficiently. After all, this is in their best interests. From a search engine point of view, when they crawl us more effectively, they get our valuable content faster and they're able to show that to their audiences, the searchers.

It's also something where they can have a nice story because crawling puts a lot of weight on us and our environment. It causes a lot of greenhouse gases. So by making more efficient crawling, they're also actually helping the planet. This is another motivation why you should care about this as well. So they've spent a lot of effort in releasing APIs.

We've got two APIs. We've got the Google Indexing API and IndexNow. The Google Indexing API, Google said multiple times, "You can actually only use this if you have job posting or broadcast structured data on your website." Many, many people have tested this, and many, many people have proved that to be false.

You can use the Google Indexing API to crawl any type of content. But this is where this idea of crawl budget and maximizing the amount of crawling proves itself to be problematic because although you can get these URLs crawled with the Google Indexing API, if they do not have that structured data on the pages, it has no impact on indexing.

So all of that crawling weight that you're putting on the server and all of that time you invested to integrate with the Google Indexing API is wasted. That is SEO effort you could have put somewhere else. So long story short, Google Indexing API, job postings, live videos, very good.

Everything else, not worth your time. Good. Let's move on to IndexNow. The biggest challenge with IndexNow is that Google doesn't use this API. Obviously, they've got their own. So that doesn't mean disregard it though.

Bing uses it, Yandex uses it, and a whole lot of SEO tools and CRMs and CDNs also utilize it. So, generally, if you're in one of these platforms and you see, oh, there's an indexing API, chances are that is going to be powered and going into IndexNow. The good thing about all of these integrations is it can be as simple as just toggling on a switch and you're integrated.

This might seem very tempting, very exciting, nice, easy SEO win, but caution, for three reasons. The first reason is your target audience. If you just toggle on that switch, you're going to be telling a search engine like Yandex, big Russian search engine, about all of your URLs.

Now, if your site is based in Russia, excellent thing to do. If your site is based somewhere else, maybe not a very good thing to do. You're going to be paying for all of that Yandex bot crawling on your server and not really reaching your target audience. Our job as SEOs is not to maximize the amount of crawling and weight on the server.

Our job is to reach, engage, and convert our target audiences. So if your target audiences aren't using Bing, they aren't using Yandex, really consider if this is something that's a good fit for your business. The second reason is implementation, particularly if you're using a tool. You're relying on that tool to have done a correct implementation with the indexing API.

So, for example, one of the CDNs that has done this integration does not send events when something has been created or updated or deleted. They rather send events every single time a URL is requested. What this means is that they're pinging to the IndexNow API a whole lot of URLs which are specifically blocked by robots.txt.

Or maybe they're pinging to the indexing API a whole bunch of URLs that are not SEO relevant, that you don't want search engines to know about, and they can't find through crawling links on your website, but all of a sudden, because you've just toggled it on, they now know these URLs exist, they're going to go and index them, and that can start impacting things like your Domain Authority.

That's going to be putting that unnecessary weight on your server. The last reason is does it actually improve efficacy, and this is something you must test for your own website if you feel that this is a good fit for your target audience. But from my own testing on my websites, what I learned is that when I toggle this on and when I measure the impact with KPIs that matter, crawl efficacy, indexing efficacy, it didn't actually help me to crawl URLs which would not have been crawled and indexed naturally.

So while it does trigger crawling, that crawling would have happened at the same rate whether IndexNow triggered it or not. So all of that effort that goes into integrating that API or testing if it's actually working the way that you want it to work with those tools, again, was a wasted opportunity cost. The last area where search engines will actually support us with crawling is in Google Search Console with manual submission.

This is actually one tool that is truly useful. It will trigger crawl generally within around an hour, and that crawl does positively impact influencing in most cases, not all, but most. But of course, there is a challenge, and the challenge when it comes to manual submission is you're limited to 10 URLs within 24 hours.

Now, don't disregard it just because of that reason. If you've got 10 very highly valuable URLs and you're struggling to get those crawled, it's definitely worthwhile going in and doing that submission. You can also write a simple script where you can just click one button and it'll go and submit 10 URLs in that search console every single day for you.

But it does have its limitations. So, really, search engines are trying their best, but they're not going to solve this issue for us. So we really have to help ourselves. What are three things that you can do which will truly have a meaningful impact on your crawl efficacy and your indexing efficacy?

The first area where you should be focusing your attention is on XML sitemaps, making sure they're optimized. When I talk about optimized XML sitemaps, I'm talking about sitemaps which have a last modified date time, which updates as close as possible to the create or update time in the database. What a lot of your development teams will do naturally, because it makes sense for them, is to run this with a cron job, and they'll run that cron once a day.

So maybe you republish your article at 8:00 a.m. and they run the cron job at 11:00 p.m., and so you've got all of that time in between where Google or other search engine bots don't actually know you've updated that content because you haven't told them with the XML sitemap. So getting that actual event and the reported event in the XML sitemaps close together is really, really important.

The second thing you can do is your internal links. So here I'm talking about all of your SEO-relevant internal links. Review your sitewide links. Have breadcrumbs on your mobile devices. It's not just for desktop. Make sure your SEO-relevant filters are crawlable. Make sure you've got related content links to be building up those silos.

This is something that you have to go into your phone, turn your JavaScript off, and then make sure that you can actually navigate those links without that JavaScript, because if you can't, Googlebot can't on the first wave of indexing, and if Googlebot can't on the first wave of indexing, that will negatively impact your indexing efficacy scores.

Then the last thing you want to do is reduce the number of parameters, particularly tracking parameters. Now, I very much understand that you need something like UTM tag parameters so you can see where your email traffic is coming from, you can see where your social traffic is coming from, you can see where your push notification traffic is coming from, but there is no reason that those tracking URLs need to be crawlable by Googlebot.

They're actually going to harm you if Googlebot does crawl them, especially if you don't have the right indexing directives on them. So the first thing you can do is just make them not crawlable. Instead of using a question mark to start your string of UTM parameters, use a hash. It still tracks perfectly in Google Analytics, but it's not crawlable for Google or any other search engine.

If you want to geek out and keep learning more about crawling, please hit me up on Twitter. My handle is @jes_scholz. And I wish you a lovely rest of your day.

Video transcription by Speechpad.com