Lesser-known facts about the robots.txt file
When managing a website, one of the often overlooked yet essential files is the robots.txt file. Missteps in configuring the robots.txt file can lead to unwanted results.
Below are some lesser known details you need to consider when working with the robots.txt file to ensure it functions as intended.
- Case Sensitivity: The robots.txt file name should always be in lowercase. The file is case-sensitive, so naming it Robots.txt or ROBOTS.TXT can lead to it being ignored by search engine crawlers. Always use robots.txt.
- Allow Blindness: The allow command is specific to Bingbot and Googlebot. Other robots don’t pay attention to that command. They’ll only acknowledge disallow commands, but not necessarily follow them..
- Crawl Delay: The Crawl-delay directive in robots.txt, that aims to slow down the rate at which a bot accesses your site, is ignored by Google but Bing will respect this directive. In case of crawl issues with Googles bot you can still file a special request to reduce the crawl rate.
- Limited Scope: The robots.txt file can only control access for well-behaved crawlers (e.g., Googlebot, Bingbot) that follow the Robots Exclusion Protocol. Malicious bots or scrapers often ignore it, so it is not a security measure. (BTW check out security.txt.)
- Not Retroactive: Once a URL has been crawled and indexed, adding a Disallow directive in robots.txt won’t remove it from search results. You need to use other methods like the noindex meta tag or the URL removal tool in Google Search Console to remove already indexed pages.
- UTF-8 Encoding: The robots.txt file must be encoded in UTF-8. Non-UTF-8 characters can lead to unpredictable behavior by search engine crawlers. This is especially important for sites with non-English content or special characters in URLs.
- Length Limitation: Some search engines, like Google, may truncate robots.txt files that exceed a certain size (Google’s limit is 500 KB). If your file is too large, not all directives may be considered. Therefore, it’s a good idea to keep the file concise.
- No Indexing Control: The robots.txt file is not meant to control whether a page gets accessed or indexed; it only controls crawling. If you want to prevent a page from being indexed or publicly accessible, use the noindex meta tag on the page itself.
- Robots.txt for Subdomains: Each subdomain needs its own robots.txt file. For example, if you have www.example.com and blog.example.com, both subdomains should have separate robots.txt files if you need different rules.
- Interaction with Canonical Tags: The robots.txt file can influence how canonical tags are interpreted. If a canonical URL is disallowed in robots.txt, search engines might not honor the canonical directive since they cannot crawl the canonical page.
Hopefully, you’ve learned something new today that will help you enhance your site’s SEO and overall performance. Looking for more interesting info on robots.txt? See John Mueller’s list on neatest robots.txt files (with artsy ASCII, job listings or funny comments).