Robots.txt Limitations | Knowledge Base

While a Robots.txt file can help with many things when it comes to guiding crawlers, there are certain limitations as well, according to Google:

Not all search engines support robots.txt rules

While robots.txt is a standard protocol for guiding web crawlers, not all spiders fully adhere to its directives. Some search engines may choose to ignore or only partially follow the instructions provided in a robots.txt file, potentially leading to unintended indexing of restricted content.

Crawlers vary in their interpretation of syntax

Different web crawlers may interpret the syntax and directives in a robots.txt file differently. This variance in interpretation can result in inconsistencies in how search engines follow the rules, possibly leading to unexpected content indexing or exclusion, depending on the specific crawler's behavior.

A disallowed page in robots.txt can still be indexed via external links

While robots.txt can prevent search engine bots from crawling a particular page directly, it doesn't prevent external websites from linking to that page. If other websites link to a disallowed page, search engines may discover and index it through those external links, bypassing the robots.txt restrictions and potentially making the page accessible in search results.