A guide to website crawling and indexing: How to make sure your website is being crawled properly
Search engine crawlers are always at work. Whether they’re searching for new content or looking for updated content, they are always looking for websites to index. If your website is not properly crawled or indexed by search engines it can cause major headaches for your SEO, meaning your target audience can’t see or access it!
This article provides an overview of what website crawling and indexing are and how to combat any problems that may arise because of them.
How do web crawling and indexing work?
Crawling and indexing are the main processes search engines like Google use to find web pages. Before going into how to combat crawling or indexation issues, it is important to understand how a search engine uses these processes to help rank your webpage.
When a user enters a query into a search engine, the results they are presented come from the search engine’s own web index which they rank based on their algorithm. Google’s web index and Bing’s web index may not be completely identical and they will rank the pages in their web index differently. In some situations, you may not want your webpage to be crawled or indexed and this will be discussed later.
To appear in the search page rankings (SERPs), your web page must be in the web index. Web pages are added to the web index once they have been analysed by web crawlers, so it’s useful to know how crawlers behave. This blog will provide an overview of how crawling works and how to identify and solve any crawl issues you may encounter.
The knock-on effect of this is that all the important pages on your site will be shown in the SERPs and your organic traffic and visitors will increase! Easy, right?
How do web crawlers work?
Web crawlers are powerful tools that are utilised by search engines to access and read web pages. They are continually running and their goal is to find new content to update the index with; this could be a new webpage or previously crawled pages that have been updated with new content. Web crawler tools have a variety of names and are referred to as ‘bots’, ‘spiders’ or ‘crawlers.’
If the goal of crawlers is to update the index with new content, then it is crucial to know:
- How do crawlers find new web pages?
- How frequently are web pages crawled and how do crawlers prioritise which web pages to crawl?
Crawlers discover new pages to crawl by following links on pages it reads. It constantly repeats this process, reading a webpage, following the links to other pages. Read. Follow links. Repeat. Any new web pages it finds or updated web pages are added to the web index.
Crawling and recrawling billions of web pages is a monumental task, so there are policies in place that help crawlers prioritise which web pages to crawl and how frequently to crawl webpages.
Crawlers will prioritise crawling web pages that are perceived as valuable. Factors such as a lot of links pointing to a webpage and traffic volume will deem them as having valuable information on them. Popular pages alongside pages with fresh content that have been recently updated will be crawled more frequently.
One other policy that crawlers have is referred to as a crawl budget. This is the number of pages crawled on a website within a certain timeframe. Anything they do not read before this time limit is up, the webpage will not be added to the web index and any links will not be followed; this could cause issues in a number of scenarios. For example, if a page the crawlers failed to read contained an internal link to another page on your site that could only be accessed from this page as neither page would be indexed.
The crawl budget is typically a concern for websites with 10,000+ pages. If you focus on providing visitors with keyword focussed content that satisfies their search query you will be crawled naturally. However, it is still good practice to ensure that pages with high conversions or valuable content are crawled and indexed properly so you don’t miss out on any potential visitors.
You can use Google Search Console to give you a comprehensive overview of how Google’s crawling behaviour. It provides a crawl stats report that provides you with information on:
- How frequently your website has been crawled
- Your average page availability
- Crawl errors
- Successful and unsuccessful crawl requests
- Page response time for crawl requests
How to make sure your pages are crawled
There are certain things you can do to help make sure that your necessary web pages are being crawled correctly.
- Create a sitemap and upload it to Google Search Console
A sitemap contains links to all the pages on your site and makes sure they’re crawled by Google. When crawlers begin the crawling process, they start by crawling websites that have provided good crawl results in the past and websites that have had a sitemap uploaded. Therefore, to make sure Google crawls the right pages on your site it’s a good idea to upload a sitemap.
Note that crawlers will only crawl web pages on the sitemap you uploaded and any updates or new webpages created after are not automatically added to this sitemap. You’ll need to include these in a fresh one to ensure they’re crawled and indexed.
- How often should you update your sitemap?
It’s good practice to update your sitemap once a month. Your pages could still be crawled naturally without a sitemap, but creating one is a good way to ensure your necessary pages are crawled and added to the web index.
- Solid internal linking structure and site architecture
A good internal linking structure will help crawlers navigate your website and provide a clear path between all pages. Make sure that there are internal links between pages. Additionally, you want to make sure that any broken links are identified and fixed as this will stop crawlers from finding new pages and may have a further impact on ranking placements.
- Page load speed
You need to make sure that your page loads quickly in order for a crawler to read and index it properly. As mentioned before with the crawl budget, each crawler has a limited time to read a page before it stops and moves on. Optimising your page speed will save on crawl budget, with faster websites being able to handle more crawl requests thus, having more pages crawled.
- Don’t waste crawl budget on unnecessary pages
Make sure unnecessary pages are removed or no-indexed to save crawl budget, allocating more crawl budget to the pages you want to be indexed. Pages such as thank you pages or terms and conditions pages do not need to appear in search engine results. You can prevent these pages from being crawled, thus freeing up the crawl budget for more important pages that you want to be indexed.
In order to restrict crawl access, you need to create and edit a Robot.txt file for your website. The Robot.txt file is used to manage crawlers, giving them direction as to how you want them to read your website, in this particular case, telling them which URLs they can or cannot follow. In the Robot.txt file, you will have to state the type of bot you want to block and the URL’s you want to prevent it from crawling.
Additionally, if you do not want your page to be indexed by Google, adding a ‘no index’ and ‘no follow’ meta tag in the HTML heading section will make it so search engines will not index the page, but will crawl it. In cases where you do not want a webpage to appear in organic search results for public access, for example, a landing page from a paid advertising campaign that contains a special offer, you will want to prevent it from being indexed. Additionally, you can also use the ‘Remove URLs Tool’ in Google Search Console to prevent pages from being indexed.
- Get rid of duplicate content
Check that you don’t have any duplicate pages on your website. Duplicate pages refer to pages that are exactly the same or very similar. Having duplicate content on your website eats up your crawl budget as the crawlers are reading the same page twice. Duplicate pages also have a further negative impact on how the algorithm ranks your page.
- Crawl errors
Crawl errors happen when crawlers have problems accessing your web pages. You can use the Crawl Errors view in Google Search Console to identify any problems crawlers encountered when trying to crawl your webpage. Crawl errors are split into two categories, URL errors and site errors and can be problems such as 404 errors or DNS errors. Moz has a good guide on how to identify and fix many of the common crawl issues you may encounter.
When do I not want a webpage to be crawled/indexed?
There may be some cases where you do not want or need a page to be crawled and this will free up your crawl budget making sure the correct pages are crawled. These are the types of pages that you usually want to prevent being crawled:
- Privacy and policy pages
- The duplicate page or contains extremely similar content to another page.
- Landing pages for adverts
- Low-value pages. These pages may contain outdated content, poor quality or little content or have a low E-A-T rating.
- Pages you do not want the public to be able to access from the SERP. For example, if the page is part of a marketing campaign you may only want users with a specific link to be able to access it in order to protect to quality of the data.
Before you take steps towards restricting pages from being crawled and removing pages from the web index, it is important to do a full site content audit in order to understand what you want to be accessed and which you do not.
Ensuring your website is crawled and indexed properly is necessary for it to appear in the SERPs and making sure users can find it when using a search engine. Although these are automatic processes carried out by search engines, the tips in this article should help you make sure your pages are being crawled correctly and none of your crawl budgets is being wasted.
If you have any questions about crawling or indexing or want to enquire about our SEO services that can help you optimise your crawl budget and more, don’t hesitate to get in touch with our team!