Robots.txt Best Practice - Meaning, Example and How to Add it to Blogger and WordPress
Setting up robots.txt is very crucial, especially for bloggers using Blogspot as their primary blogging platform.
I have come across many questions from passionate amateur bloggers seeking the best robots.txt settings that will work best for their site. Because of that, I have compiled this guide, which I know will go a long way to solve this important web core issue.
What is robots.txt?
Robots.txt is a simple text file input in the blog settings of every site to inform web crawlers (bots) about what particular items they should crawl on the site, and which to avoid.
It is very important to have a custom robots.txt on a site aiming to get its content across to a specific audience.
A robots.txt file not only determines the visibility of a site or its pages, but it is also a ranking factor for blogs. You might need this text file if you have personal pages you don’t want to share publicly.
Blogger is a free blogging site that allows writers, content creators, or publishers to create their site without charging for a domain name or hosting. Although Blogger has a free custom domain, bloggers using the platform can also buy a custom domain if they prefer not to use the free custom domain, which is normally http://www.yourblog.blogspot.com.
Most mistakes bloggers make when trying to create personal blogs (blogging for themselves and not to the public) through Blogger/Blogspot is that they often fail to insert the robots.txt text file, thus restricting crawlers' access to their sites.
New Blogspot/Blogger sites do not come with a robots.txt text file. To set up a standard robots.txt file, you will need to create one similar to the example below:
Steps to Add a Robots.txt Text File
- Login to your Blogger dashboard.
- Go to Settings > Search Preferences > Crawlers and indexing > Custom robots.txt > Edit > Yes.
- Paste your preferred robots.txt text file.
- Click Save Changes after customizing your robots.txt to apply it to your blog settings.
If you are unsure if you have a robots.txt file on your site, or if you want to confirm if the robots.txt text file you placed on your site is effective, type:
http://www.yourblog.blogspot.com/robots.txt
for those using Google Blogger/Blogspot free domain extension and,
https://www.yourblog.com/robots.txt
for those with custom domains.
This is how a good and complete robots.txt file looks:
User-agent: Mediapartners-Google
Disallow:
User-agent: *
Allow: /
Disallow: /cgi-bin/
Sitemap: https://yourblogaddress.blogspot.com/sitemap.xml
Don’t forget to replace “yourblog” in the URL above with your site domain name.
Explanation of Robots.txt Text File Directives
To have the best user experience on your site and to use the robots.txt file effectively, you need to understand the meaning of the directives recognized by bots/spider crawlers.
USER-AGENT: This directs the web crawlers to which settings the robots.txt is to be applied.
WILD CARDS: The wild card plays an important role in setting the robots.txt file. Below are examples of wildcards and their significance:
Asterisk (*) means crawl all or everything.
Slash (/) means root directory.
DISALLOW: Use this with caution; inappropriate use may lead to negative effects on your site. This is where you indicate to robots or spiders not to crawl certain pages. If you want web crawlers to crawl every content on your site, leave the “Disallow” directive blank.
Disallow: / means you are telling all web crawlers not to crawl any part of your content.
Disallow: /dir/myfile.html means you are telling web crawlers not to crawl a specific page on your site, but to crawl the others.
Disallow: /dir/* means you are directing web crawlers not to crawl any file under the /dir/ directory.
ALLOW: This robots.txt directive asks web crawlers to crawl a particular file on your site. Example:
Allow: /dir/myfile.html means web crawlers are allowed to crawl any file or directory indicated in /dir/myfile.html.
SITEMAP: The sitemap directive is where you insert your site sitemap.xml file. The sitemap contains URLs on a site. Inserting the sitemap in your robots.txt file, while not compulsory, allows all web crawlers to visit your site at intervals to crawl and index it before displaying in search results.