How does Google perceive the robots.txt file and its role in SEO?

It is very important to keep in mind that in order to appear in search results, the first thing we must achieve is that Google (or any other search engine) can interpret our website effectively.

Therefore, we must take into account different elements that must be considered when it comes to web indexing. The first, and one of the most important, is the robots.txt file, which is responsible for telling robots which pages and files can be accessed on our website.

What is a robots.txt file?

The robots.txt file is in charge of providing information to the robots (bots, crawlers…) about the pages or files that can request information or not on a website. By means of this file we will be able to “communicate” directly directly with the trackers.

What is the robots.txt file used for?

Principally, the robots.txt file is used to avoid overloading the server. the server with requests and so manage the traffic of the robots on the web page, since in this file we indicate the content that should be crawled and the content that should not be crawled..

*It is important to note that blocking or not blocking pages, has a different use than the “no-index” tag, which we will explain below.

How to view the robots.txt file?

The robots.txt file is located in the domain root main as for example: www.nombreweb.com/robots.txt. It is here where we will include several elements to indicate to the bots, trackers, etc. what pages should be crawled, and which pages no. This file can be created in any web editor, just keep in mind that you can create standard UTF-8 text files.

How to implement/modify the robots.txt file in WordPress?

The robots.txt file will be implemented in a general way in WordPress:

How to attract the attention of Google’s robot to visit my website?

As we have previously mentioned, the sitemap is the gateway to a web page and from here and through a good internal linking strategy is essential for a correct web positioning. In addition, the Google robot tends to visit pages with fresh, updated and continuous content, so having a web content strategy is essential.

robots

But how can you modify it? Find out below how to do it in WordPress:

First of all, from the hosting FTP or through different plugins that can be installed in WordPress such as Yoast SEO or Rank Math. It is important to keep in mind that editing the file incorrectly can significantly affect SEO results. Therefore, it is very important to know what each parameter means and how each of these affects our website.

If you use Rank Math in WordPress you will have to go to General Settings > Edit robots.txt file

texto-robots

What to consider for a correct implementation of the robots.txt file?

It is very important to take into account different aspects that Google highlights for a correct implementation:

  • There can only be one file per website and it must be under the name robots.txt.
  • It can be implemented individually on each subdomain of a web page.
  • The robots.txt file consists of one or more groups with specific directives (one per line, always), including:
    • to whom they apply (user-agent)
    • to the directories or sites that this user-agent can access and those that this user-agent cannot access.
  • The user-agent can crawl all pages that are not indicated as disallow. These groups will be processed in the order they are written in the text. So the group that includes the most specific rule and is the first, will be the one to follow.
  • If there are two conflicting rules, for Google and Bing the directive with more characters always “wins”. So if you find a disallow with /page/ and allow with /page/, the first one will have more weight. However, if both have the same length, the less restrictive one will prevail.

Know the main parameters of the robots.txt file!

Now that you know what it is, what it is for, how to implement the robots.txt file and what to keep in mind to do it correctly, discover below the main elements that are important to know in order to interpret and implement the file:

  • User-agent (user-agent): this is the way to identify the crawlers, define the directives that they will follow and that must always be included in each group. It is very important to know the different search engines such as Google with the name “Google Robots” or “Googlebot”, Bing with “Bingbot” and Baidu with “Baiduspider”. Applying the character (*), allows to apply the directive on all crawlers.
  • Allow and disallow directives: these directives directives allow the user-agent to specifically to the user-agent pages that it should (allow) and pages or files that it should not crawl (disallow). It is important to have at least one directive in each group.
    • Allow: is ideal for telling crawlers that they can crawl a particular section of a directory blocked by the disallow directive.
    • Disallow: to block a page with this directive you must specify the full name including (/) at the end.

Allow and disallow directives: how to give or deny customized access to robots

When configuring the different “allow” and “disallow” directives it is important to take into account different aspects:

*An incorrect implementation can affect the results of the page in search engines. in search engines.

If we leave the robots.txt file as follows, it will not block any directory:

robots-2

However, if for some reason the slash (/) is added, it would block the crawling of the entire web page so it would not appear in search engines. This is not recommended, unless it is for a consistent reason.

robots-3

Defining a directory and adding it between /_/ will only block it from being crawled. For example, /wp-admin/. It is very important to note that if you do not include the final /, robots will not be able to crawl any page starting with /wp-admin.

robots-4

In the case of wanting to exclude any subdirectories that could be tracked, they will be included as an Allow directive.

robots-5

Other parameters to be taken into account for the robots.txt file

Previously we have explained how through the user-agent, as well as directories or URLs and the directives allow or disallow you can indicate to the robots the different parameters that they can crawl or not of a web page. However, below we are going to detail other parameters that you will be able to find and that will be very useful. It is important to know that each website is different, and that depending on your objectives you should analyze well if you are interested in any of these parameters and why.

The (*): indicates “any”.

To allow bots to crawl the entire website, you must use “User-agent: *”. This acts as a wildcard and allows you to indicate that “any” robot can crawl the website. Using it in “use-agent” will allow you to indicate to all bots that they can crawl the website and by using the disallow parameter, you can specify the directories that you do not want the bots to access.

robots-6

Also, you can use (*) in the URLS, either in the beginning or in the middle, and it will allow you to do the same as the previous method: “all/any”. This will allow you to block any URL as www.miweb.com/retail/red/jumper or www.miweb.com/retail/small/jumper.

The ($): indicates the end of a URL

With the $ symbol you indicate to the robots the end of a URL. For example in the disallow parameter, if you add “*.php$, you will block the URL ending like this. However, if you use “.php/anyterm” it will allow you to crawl some particular URLs of these files.

robots-8

Blocking access to website trackers

If we want to indicate that some specific robot does not crawl the web site, either by strategy or because we are not interested, it should be indicated as follows:

robots-9

The (#): allows to explain comments

In case you want to comment on any aspect without addressing the robots, you must do so using the # symbol. Bots do not read everything after #.

robots-10

What is the difference between Disallow and the “No-index” label?

In the robots.txt file you can use other parameters that will help you block the robots from entering URLs, such as URLs with parameters, which are created when users use a search engine of a web page or filter products with specific parameters. Or also when we do not want that some page is indexed in the search engines because it is not relevant, as for example it can be the page of legal notice, privacy policy, etc. Even so, before adding a disallow directive in the robots.txt file, it is important to analyze whether its inclusion benefits the strategy of the website, depending on the specific objectives of each page.

Indexing control with the “robots” meta tag

The “robots” meta tag allows you to specify, at the page level, how content should be treated in search results, especially when you do not want a page to appear. However, for a robot to apply the directive correctly, it is essential that it can read it. Therefore, blocking a URL in the robots.txt file with the “no-index” tag would be an error, preventing access to that page and the reading of the directive.

Disallow vs “No-index”: Which is the best option for URLs with parameters?

It is important to ask yourself the following questions, as the best option will vary depending on the objectives of each website:

  • Is it relevant for robots to parse URLs with parameters that are created when a user uses the website’s search engine?
  • Is it relevant that robots spend time parsing URLs when a user uses product filters?

Based on the answers, it is time to start designing the strategy:

robots-11

Finally, despite deciding that you are interested in the searches on the website made by users are blocked, you can make specific exceptions, as they may be terms of interest, which will help you to increase visibility. Here is an example:

robots-12

How to block URLs with the canonical tag?

The canonical tag is ideal for avoiding duplicate content on a website. Often, this tag is applied to URLs with parameters that have very similar content to the main page of a product or category in order to avoid duplicate content problems. However, by blocking URLs with parameters in the robots.txt file, robots will be prevented from accessing the information and therefore will not be able to identify the “home” page. John Mueller of Google stated that this is especially relevant when using product filters.

etiqueta robots

Finally, another option we can consider is to block URLs with specific parameters, using the Google Search Console tool.

Is it mandatory to include the sitemap in the robots.txt file?

Including the sitemap in the robots.txt file is not mandatory. However, it provides information about the structure of the web page, and it is recommended to include it, in order to indicate to Google the content that we are interested in crawling.

How to verify a correct implementation of the robots.txt file?

Finally, and once all the previously mentioned aspects have been reviewed and adjusted, it remains to ensure that the robots.txt file is being implemented correctly on the website. Using the “Robots.txt Tester” tool of Google Search Console or directly manually for each URL.

robots-14

Learn more about the robots.txt file and how Google interprets it.

For more information about how Google interprets the robots.txt file of your website, please contact us.