How to disable indexing so that Windows does not slow down. How to disable indexing so that Windows does not slow down Caring search cgi


Sometimes it happens that you want to download a free music album from 2007 released by an artist that three and a half people know. You find a torrent file, launch it, the download reaches 14.7% and... that's it. Days and weeks pass, and the download remains in place. You start searching for the album on Google, scouring forums and finally finding links to some file hosting services, but they haven’t been working for a long time.

This is happening more and more often - copyright holders are constantly closing useful resources. And while finding popular content is still not a problem, finding a television series from seven years ago in Spanish can be extremely difficult.

Whatever you need on the Internet, there are a number of ways to find it. We offer all of the following options solely for viewing the content, but in no case for theft.

Usenet

Usenet is a distributed network of servers between which data is synchronized. Usenet's structure resembles a hybrid of a forum and e-mail. Users can connect to special groups (Newsgroups) and read or write something in them. As with mail, messages have a subject line, which helps define the topic of the group. Today, Usenet is used primarily for file sharing.

Until 2008, large Usenet providers only stored files for 100–150 days, but then files began to be stored forever. Smaller providers leave content for 1,000 days or more, which is often sufficient.

Around mid-2001, Usenet began to be noticed by copyright holders, forcing providers to remove copyrighted content. But enthusiasts quickly found a workaround: they began to give the files confusing names, protect the archives with passwords, and add them to special sites that can only be accessed by invitation.

In Russia, almost no one knows about the existence of Usenet, which cannot be said about countries where the authorities are diligently fighting piracy. Unlike the BitTorrent protocol, Usenet cannot determine a user's IP address without the help of a service provider or Internet service provider.

How to connect to Usenet

In most cases, you won't be able to connect for free. You will have to be content with either short file storage time, low speed, or access only to text groups.

Providers offer two types of paid access: a monthly subscription with an unlimited amount of downloaded data or unlimited time plans with limited traffic. The second option is for those who only occasionally need to download something. The largest providers of such services are Altopia, Giganews, Eweka, NewsHosting, Astraweb.

Now you need to understand where to get NZB files with meta information - something like torrent files. For this purpose, special search engines are used - indexers.

Indexers

Public indexers are full of spam and , but they are still good for finding files downloaded five or more years ago. Here are some of them:

Free indexers that require registration are more suitable for finding new files. They are well structured, the content has not only titles, but also descriptions with pictures. You can try the following:

There are also indexers only for certain types of content. For example, anizb is suitable for anime fans, and albumsindex is suitable for those looking for music.

Download from Usenet

As an example, let's take Fraser Park (The FP), a little-known film from 2011, the 1080p version of which is almost impossible to find. You need to find the NZB file and run it through a program like NZBGet or SABnzbd.

How to download via IRC

You will need an IRC client. Almost anyone will do - the vast majority support DCC. Connect to the server you are interested in and start downloading.

Largest servers with books:

  • irc.undernet.org, room #bookz;
  • irc.irchighway.net, room #ebooks.

Movies:

  • irc.abjects.net, room #moviegods;
  • irc.abjects.net, room #beast-xdcc.

Western and Japanese animation:

  • irc.rizon.net, room #news;
  • irc.xertion.org, room #cartoon-world.

You can use the!find or @find commands to search for files. The bot will send the results as a private message. If possible, prefer the @search command - it launches a special bot that provides search results as a single file, rather than a huge stream of text.

Let's try downloading How Music Got Free, a book about the music industry written by Stephen Witt.


medium.com

The bot responded to the @search request and sent the results as a ZIP file via DCC.

medium.com

We send a download request.

medium.com

And we accept the file.


medium.com

If you found a file using the indexer, then you do not need to search for it on the channel. Simply send a download request to the bot using the command from the indexer site.

DC++

In a DC network, all communications are carried out through a server called a hub. In it you can search for specific types of files: audio, video, archives, documents, disk images.

Sharing files in DC++ is very easy: just check the box next to the folder you want to share. Due to this, you can find something completely unimaginable - something that you yourself have long forgotten about, but that may suddenly be useful to someone.

How to download via DC++

Any client will do. For Windows, the best option is FlylinkDC++. Linux users can choose between and AirDC++ Web.

Searching and downloading are implemented conveniently: enter a query, select a content type, click “Search” and double-click on the result to download the file. You can also view a list of all files opened by the user and download all files from the selected folder. To do this, right-click on the search result and select the appropriate item.


medium.com

If you don't find something, try again later. Often people turn on the DC client only when they themselves need to download something.

Indexers

The built-in search only finds files in the lists of online users. To find rare content, you need an indexer.

The only known option is spacelib.dlinkddns.com, as well as its mirror dcpoisk.no-ip.org. The results are presented in the form of magnet links, when clicked on, the files immediately begin to be downloaded through the DC client. It is worth considering that sometimes the indexer is unavailable for a long time - sometimes up to two months.

eDonkey2000 (ed2k), Kad

Like DC++, ed2k is a decentralized data transfer protocol with a centralized hub for searching and connecting users with each other. In eDonkey2000 you can find almost the same thing as in DC++: old TV series with different voice acting, music, programs, games, old games, as well as books on mathematics and biology. However, there are also new releases here.

Recently I needed to install a search engine for indexing HTML pages. I settled on mnoGoSearch. While reading the documentation, I wrote down some points that might be useful later, so that I wouldn’t have to delve into the manuals again. The result is something like a small cheat sheet. In case it's useful to anyone, I'm posting it here.

indexer -E create - creates all the necessary tables in the database (assuming that the database itself has already been created).

indexer -E blob - creates an index on all indexed information (must be executed every time after running indexer if the blob storage method is used, otherwise the search will only be carried out on old information located in the database for which indexer -E blob was previously executed) .

indexer -E wordstat - creates an index of all detected words. search.cgi uses it when the Suggest option is enabled. If you enable this option, then if the search does not produce results, search.cgi will offer options for correct spelling of the query in case the user made a mistake.

Documents are only indexed when they are considered out of date. The expiration period is set by the Period option, which can be specified to the config several times before each definition of the URL that needs to be indexed. If you need to reindex all documents, ignoring this instruction, you should run indexer -a.

Indexer has the -t, -g, -u, -s, -y switches to limit work with only part of the link database. -t corresponds to a restriction by tag, -g corresponds to a restriction by category, -u - restriction by part of the URL (SQL LIKE patterns with % and _ characters are supported), -s - restriction by HTTP document status, -y - restrictions by Content-Type . All restrictions for the same key are combined with the OR operator, and groups of different keys are combined with the AND operator.

To clear the entire database, you should use the indexer -C command. You can also delete only part of the database using the subsection keys -t, -g, -u, -s, -y.

Database statistics for SQL servers

If you run indexer -S, it will display database statistics including the total number of documents and the number of stale documents for each status. The subsection keys also apply to this command.

Status code meanings:

  • 0 - new (never indexed) document
  • If the status is not 0, it is equal to the HTTP response code, some HTTP response codes are:
  • 200 - "OK" (url successfully indexed)
  • 301 - "Moved Permanently" (redirected to another URL)
  • 302 - "Moved Temporarily" (redirected to another URL)
  • 303 - "See Other" (redirected to another URL)
  • 304 - "Not modified" (url not modified since previous indexing)
  • 401 - "Authorization required" (login/password required for this document)
  • 403 - "Forbidden" (no access to this document)
  • 404 - "Not found" (the specified document does not exist)
  • 500 - "Internal Server Error" (error in cgi, etc.)
  • 503 - "Service Unavailable" (Host unavailable, connection timeout)
  • 504 - "Gateway Timeout" (timeout when receiving a document)
HTTP response code 401 indicates that the document is password protected. You can use the AuthBasic command in indexer.conf to specify login:password for the URL.

Checking links (only for SQL servers)

When run with the -I switch, indexer shows pairs of URLs and the page linking to it. This is useful for finding broken links on pages. You can also use subsection restriction keys for this mode. For example, indexer -I -s 404 will show the addresses of all documents not found, along with the addresses of the pages containing links to those documents.

Parallel indexing (SQL servers only)

MySQL and PostgreSQL users can run multiple indexers simultaneously with the same indexer.conf configuration file. Indexer uses the MySQL and PostgreSQL locking mechanism to avoid double indexing of the same documents by different simultaneously running indexers. Parallel indexing may not work correctly with other supported SQL servers. You can also use the multi-threaded version of indexer with any SQL server that supports parallel connections to the database. The multi-threaded version uses its own locking mechanism.

It is not recommended to use the same database with different indexer.conf configuration files! One process can add some documents to the database, while another can delete the same documents, and both can work without stopping.

On the other hand, you can run multiple indexers with different configuration files and different databases for any supported SQL server.

Reaction to HTTP response codes

Pseudo-language is used for description:

  • 200 OK
  • 1. If the -m ("force reindex") key is specified, then go to 4. 2. Compare the new and old document checksums stored in the database 3. If the checksums are equal, then next_index_time = Now() + Period, go to 7 4. Parsing the document, creating a list of words, adding new hypertext links to the database 5. Removing the old list of words and sections from the database 6. Inserting a new list of words and sections 7. End
  • 304 Not Modified
  • 1. next_index_time = now() + Period 2. End
  • 301 Moved Permanently
  • 302 Moved Temporarily
  • 303 See Other
  • 1. Removing the words of this document from the database 2. next_index_time = Now() + Period 3. Adding the URL from the Location header to the database: 4. End
  • 300 Multiple Choices
  • 305 Use Proxy (proxy redirect)
  • 400 Bad Request
  • 401 Unauthorized
  • 402 Payment Required
  • 403 Forbidden
  • 404 Not found
  • 405 Method Not Allowed
  • 406 Not Acceptable
  • 407 Proxy Authentication Required
  • 408 Request Timeout
  • 409 Conflict
  • 410 Gone
  • 411 Length Required
  • 412 Precondition Failed
  • 413 Request Entity Too Large
  • 414 Request-URI Too Long
  • 415 Unsupported Media Type
  • 500 Internal Server Error
  • 501 Not Implemented
  • 502 Bad Gateway
  • 505 Protocol Version Not Supported
  • 1. Removing document words from the database 2. next_index_time=Now()+Period 3. End
  • 503 Service Unavailable
  • 504 Gateway Timeout
  • 1. next_index_time=Now()+Period 2. End
Content-Encoding support

The mnoGoSearch search engine supports compression of HTTP requests and responses (Content encoding). Compressing HTTP server requests and responses can significantly improve the performance of processing HTTP requests by reducing the amount of transmitted data.

Using HTTP request compression allows you to reduce traffic by two or more times.

The HTTP 1.1 specification (RFC 2616) defines four methods for encoding the content of server responses: gzip, deflate, compress, and identity.

If Content-encoding support is enabled, indexer sends the Accept-Encoding header to the http server: gzip,deflate,compress.

If the http server supports any of the gzip, deflate or compress encoding methods, it will send a response encoded with that method.

To build mnoGoSearch with support for HTTP request compression, you must have the zlib library.

To enable Content encoding support, you must configure mnoGoSearch with the following key:
./configure --with-zlib

Boolean search

To specify complex queries, you can build Boolean search queries. To do this, you need to specify the search mode bool in the search form.

MnoGoSearch understands the following Boolean operators:

& - logical AND. For example, mysql & odbc. mnoGoSearch will search for URLs containing both the words "mysql" and "odbc". You can also use the + sign for this operator.

| - logical OR. For example, mysql | odbc. mnoGoSearch will search for URLs containing either the word "mysql" or the word "odbc".

~ - logical NOT. For example, mysql & ~odbc. mnoGoSearch will search for URLs that contain the word "mysql" and at the same time do not contain the word "odbc". Attention! ~ just excludes some documents from the search result. The query "~odbc" will not find anything!

() - grouping operator for creating more complex search queries. For example, (mysql | msql) & ~postgres.

" - phrase selection operator. For example, "russian apache" & "web server". You can also use the " sign for this operator.

It's common for Microsoft to come up with a cool feature that is designed to significantly improve the comfort of working on a computer. But the end result, as always, is a significant deterioration in working conditions :) This happened in the case of the function of indexing the contents of disks, invented by Microsoft in order to speed up the search for information.

This service runs in the background and gradually scans files. Collecting all the information takes a significant amount of time, but we shouldn't notice it. They SHOULD NOT, but in practice, especially with large volumes of information and connecting external drives, a process of braking of the entire system occurs, to which there is no end in sight. The SearchFilterHost process can start 5-10 minutes after the system starts and load the computer to its limit, and for those who have a laptop, this problem can be especially relevant.

How the Indexing Service Works in Windows

It works as follows: the file system is scanned and all information is entered into a special database (index), and then a search is performed in this database. This database includes file names and paths, creation time, content key phrases (if it is a document or an HTML page), document property values ​​and other data. Thus, when searching using standard means, for example from the START menu, the operating system does not go through all the files, but simply accesses the database.

Time passes, we install new programs, download new files, new types of files are added to the system that are subject to content indexing, and the operating system sometimes gets too carried away with the indexing process, greatly slowing down the work. This can be easily noticed if you do nothing, and the hard drive groans incessantly, while the searchfilterhost.exe process hangs in the “Task Manager”, which eats up 30-50% of processor resources.

You can, of course, wait until the process is finished, but what if you have to wait 30-40 minutes? Therefore, it is better to deal with this problem immediately. We have three ways to resolve the issue.

Terminate the SearchFilterHost process and turn off the indexing service completely

You can do it in the task manager. In principle, this is a good option, it will add stability to the system, the free space on the system disk will increase, and the brakes associated with indexing will disappear. Personally, I use the search function in the Total Commander file manager and find it much more convenient than the standard Windows 7/10 search. If you also use a third-party program and have not heard of searching by document content, then indexing is simply not needed. And if you have a virtual machine, then it is even recommended to disable indexing. This is done very simply:


Pause the indexing service

In Windows XP there were special settings for the indexing system, with which you could lower the priority of the service in favor of running programs. But in Windows 7-10 there is no such thing, and we can only pause indexing. This can be done if the SearchFilterHost process greatly interferes with its work, but you don’t want to turn off the service completely. To do this, enter the words “index options” in the search bar of the Start menu and select “Indexing options” from the search results.

In the parameters window, click “Pause” and enjoy comfortable work :)

Disable indexing of individual drives

You can not disable the service at all, but disable indexing on individual disks. To do this, go to “My Computer” and right-click on the desired drive, for example, which has many, many files, and select “Properties”. In the properties window, uncheck “Allow indexing of this volume”

I hope this article was interesting and useful. We looked at possible problems with the indexing service in Windows 7/8/10 and figured out how to defeat the insatiable SearchFilterHost process. You can also simplify your life even more, and in new articles I will return to the issue of optimization more than once, so I advise you to subscribe to blog updates and be the first to know the news.

See how quickly you can take off your T-shirt!

Consistently fill in all required fields. As you direct, you will see your Robots.txt filled with directives. All directives in the Robots.txt file are described in detail below.

Flag, copy and paste the text into a text editor. Save the file as "robots.txt" in the root directory of your site.

Description of the robots.txt file format

The robots.txt file consists of entries, each of which consists of two fields: a line with the name of the client application (user-agent), and one or more lines starting with the Disallow directive:

Directive ":" meaning

Robots.txt must be created in Unix text format. Most good text editors already know how to convert Windows line feeds to Unix. Or your FTP client should be able to do this. For editing, do not try to use an HTML editor, especially one that does not have a text mode for displaying code.

Directive User-agent:

For Rambler: User-agent: StackRambler For Yandex: User-agent: Yandex For Google: User-Agent: googlebot

You can create instructions for all robots:

User-agent: *

Directive Disallow:

The second part of the entry consists of the Disallow lines. These lines are directives (instructions, commands) for this robot. Each group entered by the User-agent line must have at least one Disallow statement. The number of Disallow instructions is unlimited. They tell the robot which files and/or directories the robot is not allowed to index. You can prevent a file or directory from being indexed.

The following directive disables indexing of the /cgi-bin/ directory:

Disallow: /cgi-bin/ Note the / at the end of the directory name! To prohibit visiting the directory "/dir" specifically, the instruction should look like: "Disallow: /dir/" . And the line “Disallow: /dir” prohibits visiting all server pages whose full name (from the server root) begins with “/dir”. For example: "/dir.html", "/dir/index.html", "/directory.html".

The directive written as follows prohibits indexing of the index.htm file located in the root:

Disallow: /index.htm

Directive Allow Only Yandex understands.

User-agent: Yandex Allow: /cgi-bin Disallow: / # prohibits downloading everything except pages starting with "/cgi-bin" For other search engines you will have to list all closed documents. Consider the structure of the site so that documents closed for indexing are collected in one place if possible.

If the Disallow directive is empty, this means that the robot can index ALL files. At least one Disallow directive must be present for each User-agent field for robots.txt to be considered valid. A completely empty robots.txt means the same thing as if it didn’t exist at all.

The Rambler robot understands * as any symbol, so the Disallow: * instruction means prohibiting indexing of the entire site.

Allow, Disallow directives without parameters. The absence of parameters for the Allow and Disallow directives is interpreted as follows: User-agent: Yandex Disallow: # same as Allow: / User-agent: Yandex Allow: # same as Disallow: /

Using special characters "*" and "$".
When specifying the paths of the Allow-Disallow directives, you can use the special characters "*" and "$", thus specifying certain regular expressions. The special character "*" means any (including empty) sequence of characters. Examples:

User-agent: Yandex Disallow: /cgi-bin/*.aspx # prohibits "/cgi-bin/example.aspx" and "/cgi-bin/private/test.aspx" Disallow: /*private # prohibits not only " /private", but also "/cgi-bin/private" Special character "$".
By default, a "*" is appended to the end of each rule described in robots.txt, for example: User-agent: Yandex Disallow: /cgi-bin* # blocks access to pages starting with "/cgi-bin" Disallow: /cgi- bin # the same thing, to cancel the "*" at the end of the rule, you can use the special character "$", for example: User-agent: Yandex Disallow: /example$ # prohibits "/example", but does not prohibit "/example.html" User -agent: Yandex Disallow: /example # disallows both "/example" and "/example.html" User-agent: Yandex Disallow: /example$ # disallows only "/example" Disallow: /example*$ # the same like "Disallow: /example" disallows both /example.html and /example

Directive Host.

If your site has mirrors, a special mirror robot will identify them and form a group of mirrors for your site. Only the main mirror will participate in the search. You can specify it using robots.txt using the "Host" directive, specifying the name of the main mirror as its parameter. The "Host" directive does not guarantee the selection of the specified main mirror, however, the algorithm takes it into account with high priority when making a decision. Example: #If www.glavnoye-zerkalo.ru is the main mirror of the site, then robots.txt for #www.neglavnoye-zerkalo.ru looks like this User-Agent: * Disallow: /forum Disallow: /cgi-bin Host: www.glavnoye -zerkalo.ru For compatibility with robots that do not fully follow the standard when processing robots.txt, the "Host" directive must be added in the group starting with the "User-Agent" entry, immediately after the "Disallow" ("Allow") directives . The argument to the "Host" directive is a domain name followed by a port number (80 by default) separated by a colon. The Host directive parameter must consist of one valid host name (that is, one that complies with RFC 952 and is not an IP address) and a valid port number. Incorrectly composed "Host:" lines are ignored.

Examples of ignored Host directives:

Host: www.myhost-.ru Host: www.-myhost.ru Host: www.myhost.ru:100000 Host: www.my_host.ru Host: .my-host.ru:8000 Host: my-host.ru. Host: my..host.ru Host: www.myhost.ru/ Host: www.myhost.ru:8080/ Host: 213.180.194.129 Host: www.firsthost.ru,www.secondhost.ru # in one line - one domain! Host: www.firsthost.ru www.secondhost.ru # in one line - one domain!! Host: crew-communication.rf # need to use punycode

Directive Crawl-delay

Sets the timeout in seconds with which the search robot downloads pages from your server (Crawl-delay).

If the server is heavily loaded and does not have time to process download requests, use the "Crawl-delay" directive. It allows you to set the search robot a minimum period of time (in seconds) between the end of downloading one page and the start of downloading the next. For compatibility with robots that do not fully follow the standard when processing robots.txt, the "Crawl-delay" directive must be added in the group starting with the "User-Agent" entry, immediately after the "Disallow" ("Allow") directives.

The Yandex search robot supports fractional Crawl-Delay values, for example, 0.5. This does not guarantee that the search robot will visit your site every half second, but it gives the robot more freedom and allows it to crawl the site faster.

User-agent: Yandex Crawl-delay: 2 # sets the timeout to 2 seconds User-agent: * Disallow: /search Crawl-delay: 4.5 # sets the timeout to 4.5 seconds

Directive Clean-param

Directive for excluding parameters from the address bar. those. requests containing such a parameter and those not containing them will be considered identical.

Blank lines and comments

Blank lines are allowed between groups of instructions entered by the User-agent.

The Disallow statement is only taken into account if it is subordinate to any User-agent line - that is, if there is a User-agent line above it.

Any text from the hash sign "#" to the end of the line is considered a comment and is ignored.

Example:

Next simple file robots.txt prohibits all robots from indexing all pages of the site, except the Rambler robot, which, on the contrary, is allowed to index all pages of the site.

# Instructions for all robots User-agent: * Disallow: / # Instructions for the Rambler robot User-agent: StackRambler Disallow:

Common mistakes:

Inverted syntax: User-agent: / Disallow: StackRambler And it should be like this: User-agent: StackRambler Disallow: / Several Disallow directives in one line: Disallow: /css/ /cgi-bin/ /images/ Correctly like this: Disallow: / css/ Disallow: /cgi-bin/ Disallow: /images/
    Notes:
  1. It is unacceptable to have empty line breaks between the "User-agent" and "Disallow" ("Allow") directives, as well as between the "Disallow" ("Allow") directives themselves.
  2. According to the standard, it is recommended to insert an empty line feed before each "User-agent" directive.

The robots.txt file is one of the most important when optimizing any website. Its absence can lead to a high load on the site from search robots and slow indexing and re-indexing, and incorrect settings can lead to the site completely disappearing from the search or simply not being indexed. Consequently, it will not be searched in Yandex, Google and other search engines. Let's look at all the nuances of properly setting up robots.txt.

First, a short video that will give you a general idea of ​​what a robots.txt file is.

How does robots.txt affect site indexing?

Search robots will index your site regardless of the presence of a robots.txt file. If such a file exists, then robots can be guided by the rules that are written in this file. At the same time, some robots may ignore certain rules, or some rules may be specific only to some bots. In particular, GoogleBot does not use the Host and Crawl-Delay directives, YandexNews has recently begun to ignore the Crawl-Delay directive, and YandexDirect and YandexVideoParser ignore more general directives in robots (but are guided by those specified specifically for them).

More about exceptions:
Yandex exceptions
Robot Exception Standard (Wikipedia)

The maximum load on the site is created by robots that download content from your site. Therefore, by indicating what exactly to index and what to ignore, as well as at what time intervals to download, you can, on the one hand, significantly reduce the load on the site from robots, and on the other hand, speed up the download process by prohibiting crawling of unnecessary pages .

Such unnecessary pages include ajax, json scripts responsible for pop-up forms, banners, captcha output, etc., order forms and a shopping cart with all the steps of making a purchase, search functionality, personal account, admin panel.

For most robots, it is also advisable to disable indexing of all JS and CSS. But for GoogleBot and Yandex, such files must be left for indexing, since they are used by search engines to analyze the convenience of the site and its ranking (Google proof, Yandex proof).

Robots.txt directives

Directives are rules for robots. There is a W3C specification from January 30, 1994, and an extended standard from 1996. However, not all search engines and robots support certain directives. In this regard, it will be more useful for us to know not the standard, but how the main robots are guided by certain directives.

Let's look at them in order.

User-agent

This is the most important directive that determines for which robots the rules follow.

For all robots:
User-agent: *

For a specific bot:
User-agent: GoogleBot

Please note that robots.txt is case-insensitive. Those. The user agent for Google can just as easily be written as follows:
user-agent: googlebot

Below is a table of the main user agents of various search engines.

Bot Function
Google
Googlebot Google's main indexing robot
Googlebot-News Google News
Googlebot-Image Google Images
Googlebot-Video video
Mediapartners-Google
Mediapartners Google AdSense, Google Mobile AdSense
AdsBot-Google landing page quality check
AdsBot-Google-Mobile-Apps Googlebot for apps
Yandex
YandexBot Yandex's main indexing robot
YandexImages Yandex.Pictures
YandexVideo Yandex.Video
YandexMedia multimedia data
YandexBlogs blog search robot
YandexAddurl a robot that accesses a page when adding it through the “Add URL” form
YandexFavicons robot that indexes website icons (favicons)
YandexDirect Yandex.Direct
YandexMetrika Yandex.Metrica
YandexCatalog Yandex.Catalog
YandexNews Yandex.News
YandexImageResizer mobile service robot
Bing
Bingbot Bing's main indexing robot
Yahoo!
Slurp main indexing robot Yahoo!
Mail.Ru
Mail.Ru main indexing robot Mail.Ru
Rambler
StackRambler Previously the main indexing robot Rambler. However, as of June 23, 2011, Rambler ceases to support its own search engine and now uses Yandex technology on its services. No longer relevant.

Disallow and Allow

Disallow blocks pages and sections of the site from indexing.
Allow forces pages and sections of the site to be indexed.

But it's not that simple.

First, you need to know the additional operators and understand how they are used - these are *, $ and #.

* is any number of characters, including their absence. In this case, you don’t have to put an asterisk at the end of the line; it is assumed that it is there by default.
$ - indicates that the character before it should be the last one.
# is a comment; everything after this character in the line is not taken into account by the robot.

Examples of using:

Disallow: *?s=
Disallow: /category/$

Secondly, you need to understand how nested rules are executed.
Remember that the order in which the directives are written is not important. Inheritance of the rules of what to open or close from indexing is determined by which directories are specified. Let's look at it with an example.

Allow: *.css
Disallow: /template/

http://site.ru/template/ - closed from indexing
http://site.ru/template/style.css - closed from indexing
http://site.ru/style.css - open for indexing
http://site.ru/theme/style.css - open for indexing

If you need all .css files to be open for indexing, you will have to additionally register this for each of the closed folders. In our case:

Allow: *.css
Allow: /template/*.css
Disallow: /template/

Again, the order of the directives is not important.

Sitemap

Directive for specifying the path to the XML Sitemap file. The URL is written in the same way as in the address bar.

For example,

Sitemap: http://site.ru/sitemap.xml

The Sitemap directive is specified anywhere in the robots.txt file without being tied to a specific user-agent. You can specify multiple Sitemap rules.

Host

Directive for specifying the main mirror of the site (in most cases: with www or without www). Please note that the main mirror is specified WITHOUT http://, but WITH https://. Also, if necessary, the port is indicated.
The directive is supported only by Yandex and Mail.Ru bots. Other robots, in particular GoogleBot, will not take the command into account. Host is registered only once!

Example 1:
Host: site.ru

Example 2:
Host: https://site.ru

Crawl-delay

Directive for setting the time interval between the robot downloading website pages. Supported by Yandex robots, Mail.Ru, Bing, Yahoo. The value can be set in integer or fractional units (separator is a dot), time in seconds.

Example 1:
Crawl-delay: 3

Example 2:
Crawl-delay: 0.5

If the site has a small load, then there is no need to set such a rule. However, if indexing pages by a robot leads to the site exceeding the limits or experiencing significant loads to the point of server outages, then this directive will help reduce the load.

The higher the value, the fewer pages the robot will download in one session. The optimal value is determined individually for each site. It is better to start with not very large values ​​- 0.1, 0.2, 0.5 - and gradually increase them. For search engine robots that are less important for promotion results, such as Mail.Ru, Bing and Yahoo, you can initially set higher values ​​than for Yandex robots.

Clean-param

This rule tells the crawler that URLs with the specified parameters should not be indexed. The rule specifies two arguments: a parameter and the section URL. The directive is supported by Yandex.

Clean-param: author_id http://site.ru/articles/

Clean-param: author_id&sid http://site.ru/articles/

Clean-Param: utm_source&utm_medium&utm_campaign

Other options

In the extended robots.txt specification you can also find the Request-rate and Visit-time parameters. However, they are not currently supported by major search engines.

The meaning of the directives:
Request-rate: 1/5 — load no more than one page in five seconds
Visit-time: 0600-0845 - load pages only between 6 a.m. and 8:45 a.m. GMT.

Closing robots.txt

If you need to configure your site to NOT be indexed by search robots, then you need to specify the following directives:

User-agent: *
Disallow: /

Make sure that these directives are written on the test sites of your site.

Correct setting of robots.txt

For Russia and the CIS countries, where Yandex’s share is significant, directives should be prescribed for all robots and separately for Yandex and Google.

To properly configure robots.txt, use the following algorithm:

  1. Close the site admin panel from indexing
  2. Close your personal account, authorization, and registration from indexing
  3. Block your shopping cart, order forms, delivery and order data from indexing
  4. Close ajax and json scripts from indexing
  5. Close the cgi folder from indexing
  6. Block plugins, themes, js, css from indexing for all robots except Yandex and Google
  7. Disable search functionality from indexing
  8. Close from indexing service sections that do not provide any value for the site in search (404 error, list of authors)
  9. Block technical duplicate pages from indexing, as well as pages on which all content in one form or another is duplicated from other pages (calendars, archives, RSS)
  10. Block pages with filter, sorting, comparison parameters from indexing
  11. Block pages with UTM tags and session parameters from indexing
  12. Check what is indexed by Yandex and Google using the “site:” parameter (type “site:site.ru” in the search bar). If the search contains pages that also need to be closed from indexing, add them to robots.txt
  13. Specify Sitemap and Host
  14. If necessary, enter Crawl-Delay and Clean-Param
  15. Check the correctness of robots.txt using Google and Yandex tools (described below)
  16. After 2 weeks, check again to see if new pages have appeared in the search results that should not be indexed. If necessary, repeat the above steps.

Example robots.txt

# An example of a robots.txt file for setting up a hypothetical site https://site.ru User-agent: * Disallow: /admin/ Disallow: /plugins/ Disallow: /search/ Disallow: /cart/ Disallow: */?s= Disallow : *sort= Disallow: *view= Disallow: *utm= Crawl-Delay: 5 User-agent: GoogleBot Disallow: /admin/ Disallow: /plugins/ Disallow: /search/ Disallow: /cart/ Disallow: */?s = Disallow: *sort= Disallow: *view= Disallow: *utm= Allow: /plugins/*.css Allow: /plugins/*.js Allow: /plugins/*.png Allow: /plugins/*.jpg Allow: /plugins/*.gif User-agent: Yandex Disallow: /admin/ Disallow: /plugins/ Disallow: /search/ Disallow: /cart/ Disallow: */?s= Disallow: *sort= Disallow: *view= Allow: /plugins/*.css Allow: /plugins/*.js Allow: /plugins/*.png Allow: /plugins/*.jpg Allow: /plugins/*.gif Clean-Param: utm_source&utm_medium&utm_campaign Crawl-Delay: 0.5 Sitemap: https://site.ru/sitemap.xml Host: https://site.ru

How to add and where is robots.txt located

After you have created the robots.txt file, it must be placed on your website at site.ru/robots.txt - i.e. in the root directory. The search robot always accesses the file at the URL /robots.txt

How to check robots.txt

Robots.txt is checked using the following links:

  • In Yandex.Webmaster - on the Tools>Robots.txt Analysis tab
  • IN Google Search Console- on the Scanning tab>Robots.txt file inspection tool

Typical errors in robots.txt

At the end of the article I will give a few typical errors in the robots.txt file

  • robots.txt is missing
  • in robots.txt the site is closed from indexing (Disallow: /)
  • the file contains only the most basic directives, there is no detailed elaboration of the file
  • in the file, pages with UTM tags and session identifiers are not blocked from indexing
  • the file contains only directives
    Allow: *.css
    Allow: *.js
    Allow: *.png
    Allow: *.jpg
    Allow: *.gif
    while the css, js, png, jpg, gif files are closed by other directives in a number of directories
  • the Host directive is specified several times
  • the HTTP protocol is not specified in Host
  • the path to the Sitemap is incorrect, or the wrong protocol or site mirror is specified

P.S.

P.S.2

Useful video from Yandex (Attention! Some recommendations are only suitable for Yandex).







2024 gtavrl.ru.