What is an XML Sitemap?
Search engines are a curious thing. A sprinkle of robot, a fistful of mind-boggling mathematics and a dashing of fairy dust to top it all off.
Or so some people would have you believe.
In reality they’re quite straightforward. You provide them with high-quality, well structured and discoverable content and they help people to find it.
This article is going to discuss the discoverability of your content. Particularly, how you can help search engine “bots” to crawl your pages and ensure their indexes are up to date.
What is a Search Engine Bot?
Sadly nothing as cute as this little fella but this is how we like to imagine them.
Bots are simply computer programs that search engines employ to crawl the web, discovering and indexing content. This is why they are sometimes referred to as spiders.
But that’s nowhere near as cute.
If you check your logs you’re likely to find a number of different bots that visit your site periodically. They don’t stay long. Just whizzing through the content, taking a snapshot of the text (and sometimes more) before following off to another page via a link.
Nothing sinister. If I hadn’t told you then you’d never know they were there.
This is the most rudimentary way that search engines discover content. An army of computer programs following link after link trying to map the internet and beyond.
But discovering content in this way is inefficient. It relies on other pages linking to your content. There’s no telling how long it might take for a bot to discover your new or updated content and report it back.
They’re great at what they do but there’s ways in which you can help. Arguably, the most efficient tool that website owners can use to enhance discoverability is the XML Sitemap.
Introducing the XML Sitemap
XML is an acronym for Extensible Markup Language. This is a language used to represent the structure of your site within the sitemap.
Want to see an example of an XML sitemap? Take a look at the sitemap.xml file for our own site and view the source of the page for best results.
Seems daunting right? Don’t panic. It’s quite straightforward once you understand what’s going on. In fact, the HTML used to markup the pages on your website is incredibly similar to the format of an XML file.
Let’s make some sense of the Startup Heroes sitemap.
1 <?xml version="1.0" encoding="UTF-8"?>
The first line is a straightforward XML declaration. It tells interpreters that the document uses XML specification 1.0 and is encoded in UTF-8. All sitemaps are required to adopt UTF-8 encoding, so this declaration needn’t be changed.
Read more about XML declarations.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 <urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1" xmlns:video="http://www.google.com/schemas/sitemap-video/1.1" xmlns:news="http://www.google.com/schemas/sitemap-news/0.9" xmlns:mobile="http://www.google.com/schemas/sitemap-mobile/1.0" xmlns:pagemap="http://www.google.com/schemas/sitemap-pagemap/1.0" xmlns:xhtml="http://www.w3.org/1999/xhtml" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"> ... </urlset>
The next tag defines a
urlset and declares a number of specifications which the nested items will follow. These are industry standard specifications and don’t necessarily need to be customised.
1 2 3 4 5 6 7 8 9 10 11 12 <url> <loc>https://www.startupheroes.co.uk</loc> <lastmod>2018-01-15T15:23:54+00:00</lastmod> <changefreq>weekly</changefreq> <priority>0.5</priority> </url> <url> <loc>https://www.startupheroes.co.uk/blog/</loc> <lastmod>2018-02-13T14:56:22+00:00</lastmod> <changefreq>weekly</changefreq> <priority>0.5</priority> </url>
urlset we define a nested set of
url records. Each record defines a page within your site and allows you to customise certain properties.
XML Sitemap Tags
- `loc` required
- This specifies the location of the page as a URL. It should include the protocol – http or https – and a trailing slash if required by the webserver. URLs must be shorter than 2,048 characters.
- `lastmod` optional
- A timestamp representing the last date/time that the page was modified. The timestamp should be formatted in [W3C Datetime format](https://www.w3.org/TR/NOTE-datetime) and may exclude the time if necessary.
- `changefreq` optional
- Allows the website owner to specify how often the page is expected to change. Possible values can be found on the sitemaps.org website. However, this field is largely ignored by search engines since 2015 in favour of `lastmod`.
- `priority` optional
- By assigning a value from 0.0 to 1.0, website owners may specify which pages on their site they deem more important than others. Once again, this value is largely ignored by modern search engines in favour of other factors.
You can read more about the structure of sitemaps and the sitemap protocol on sitemaps.org.
How to Construct a Sitemap
The simplest way to create a sitemap is using a browser and a text editor. Creating a sitemap by hand is straightforward once you understand the required structure.
This can become tedious once you consider a site with more than a handful of pages. Additionally, a site with regular updates to pages would require constant attention to its sitemap.
There are many tools available to help website owners in creating and maintaining an up to date sitemap. These range from online generators through to plugins for individual languages and frameworks.
A small selection of popular plugins include:
- Google XML Sitemaps for Wordpress
- SitemapGenerator Ruby Gem
- laravel-sitemap for Laravel PHP Framework
- Our very own Middleman SEO Sitemap
Convention dictates that your sitemap should be hosted at
/sitemap.xml. However this is not a requirement. Best practice is to use the
robots.txt file to identify your sitemap location.
Larger sitemaps can eat away at bandwidth. One solution is to use gzip to compress the file. If you choose to use this method you should rename your file
sitemap.xml.gz. Search engines will uncompress the file before parsing.
It’s worth noting that there are no direct benefits to SEO from compressing your sitemap.
The alternative is to create multiple sitemap files to reduce the size of each individual file. This method utilises an index file to map each sitemap within the structure.
This method proves useful on sites where there is a clear structure present within the URL. If the structure clearly defines sections within the site (e.g.
/articles) then it should be considered.
Whilst a little dated Multiple XML Sitemaps: Increased Indexation and Traffic is still relevant and worth a read.
Submitting Your Sitemap to Search Engines
Having created your sitemap and uploaded it to your webserver, you need to ensure that search engines are able to find it.
Remember those bots we learned about earlier? They’d eventually find your site whilst naturally crawling the internet. And sure enough, they’d check for a sitemap automatically.
Once discovered, they’ll use this sitemap to learn about your site. Taking the URLs listed, they will traverse these pages and report back with their findings. Sure enough, it’s far more efficient than following links within each page (although they’ll still do this).
The major advantage gained by providing a sitemap is that you no longer rely on internal/external links to pages for discovery. Instead, you are telling the search engines which pages you want them to index and where they can find them.
But why wait for the search bots to find you?
The easiest way to tell search engines about your sitemap is to submit it via their webmaster control panel. Both Google and Bing provide a control panel where you can upload sitemaps. As an added bonus, they’ll also analyse your sitemap and report back with any issues.
Once submitted, search engines will send their bots over periodically check for new and updated content. In short, the more often you update your content and sitemap, the more often the bots will return and update their index. But it will take time for them to learn that your content changes regularly.
This issue can be addressed in two ways. Firstly, you could re-submit your sitemap as above each time it changes.
Alternatively you can “ping” the search engines to let them know you’ve updated your sitemap. As a result, the search bots will be dispatched to re-index your updated content.
Sending a “ping” is as simple as making a web request to the search engine’s ping address. You should include the path to your sitemap within the request URL.
You may submit the ping via your browser. Many people choose to automate the ping process as part of their publish/build phase.
For example, our Middleman SEO Sitemap plugin, automatically pings both Google and Bing whenever the site is built. By automating the process, we take away the hassle of having to remember to ping the search engines every time we add content.
It makes publishing new content less time consuming.
The XML sitemap is a crucial tool in the SEO process. It’s often overlooked or neglected. An up-to-date sitemap helps search engines to discover and index your content quickly and efficiently.
You needn’t look further than the search engines themselves to see how much they rely on sitemaps. Webmaster control panels and ping URLs are provided to website owners to allow them to guide search bots towards new content.
When it comes to getting your website indexed by the search engines, many people see it as a battle. We see the search engines as our friends. At the end of the day, they want to index the best content on the internet just as much as we want them to index our pages.
By working together and making the most of the tools which have been provided, we can ensure that SEO becomes less of a chore.
If you have any questions about XML sitemaps or SEO in general, please don’t hesitate to get in touch with our team.
Let’s build something together!