What search engines do you know? Search engines

Useful tips

Searching for information on the Internet is one of the most popular operations on the Internet. Internet visitors often have to search for documents on a particular topic. If you have the exact address of the document on the Internet, then in this case there are no problems with searching: in the browser in address bar you can type a known resource address, and if the connection is successful, the browser will display the desired page.

If the exact address of the document is not available, you can use the services of a search engine. Search engine? it is “a specialized server on the Internet that offers a variety of document search facilities.” An example of a search server is the Rambler server (Rambler.ru), located at http://rambler.ru. The view of the server's main page is shown in the figure.

Rice. 1.

Search servers usually create their own directories of Internet resources. Search server catalogs are regularly updated with information about resources created on the network, which comes from search robots. Search robots or spiders are special network programs, which access available on this moment Internet servers, analyze documents and fill up the tables of their search engine. Search robots perform the work of searching and systematizing resources in background around the clock.

Another source of information about existing sites reaching search servers is the explicit registration of resources by the owners of web pages. The server has forms that resource owners fill out. The form specifies the resource address, a brief description of, keywords, the target audience etc. This information is analyzed and added to the server directories automatically by special programs or “manually” by experts - specialists who monitor the formation of resource directories.

Understanding the mechanisms of information retrieval on the Internet allows web page developers to prepare their documents so that they can be found in the future search engines and are placed in the appropriate sections of the resource catalog.

Search by keywords on the Internet

One of the popular ways to search for documents on the WWW is to search using keywords. When you specify keywords in the search form, the search engine will search for documents containing the specified keywords. Of course, to fulfill a query, a search engine will not search the content of thousands of computers operating on the Internet - you would have to wait many days for the result of such a search. The search is carried out among those resources (catalogs, tables) of the search engine that were previously collected and systematized with the help of robots and experts.

Since the volume of network resources becomes truly limitless, upon request to search for a document using a keyword, a search engine can find several thousand documents containing the specified keyword. It is clear that with so many documents it is difficult to find the one that best corresponds to a given topic. However, search engines usually provide the opportunity to formulate a more detailed query.

The request can have a complex form and be composed using keywords and logical functions AND (AND), OR (OR), negation (NOT). Or the search request can be generated using special characters, allowing you to set (or cancel) word forms of keywords. Such mechanisms help to more accurately formulate the requirements for selecting documents. Every search engine has help system, which will help the visitor create a search query.

Introduction………………………………………………………………………………….2

1 Search engines: composition, functions, principle of operation

1.1 Composition of search engines………………………………….………………3

1.2 Features of search engines…………………………………………..4

1.3 Principles of search engines……………………………………..4

2 Overview of the functioning of search engines

2.1 Foreign search engines: composition and principles of operation…………12

2.2 Russian-language search engines: composition and operating principles….…..14

Conclusion……………………………………………………………..……………16

List of references……………………………..………….17

Introduction

Search engines have long become an integral part of the Russian Internet. Due to the fact that they, although by various means, independently provide all stages of information processing from its receipt from primary source nodes to providing the user with the ability to search, they are often called autonomous search engines systems .

Search engines are now huge and complex mechanisms that represent not only an information search tool, but also tempting areas for business. These systems can differ in the principle of information selection, which is present to one degree or another in the algorithm of the automatic index scanning program, and in the rules of conduct for catalog employees responsible for registration. Typically, two main indicators are compared:

The spatial scale at which the IPS operates is

And her specialty.

Most users of search engines have never thought (or thought about it, but did not find an answer) about the principle of operation of search engines, about the scheme for processing user requests, about what these systems consist of and how they function... Search engines can be compared to a help desk, whose agents go around the enterprises, collecting information into a database. When you contact the service, information is retrieved from this database. The data in the database becomes outdated, so agents periodically update it. Some enterprises themselves send information about themselves, and agents do not have to come to them. In other words, help desk has two functions: creation and constant update data in the database and searching for information in the database at the request of the client.

1 Search engines: composition, functions, principle of operation

1.1 Composition of search engines

A search system is a software and hardware complex designed to search the Internet and respond to a user request specified in the form of a text phrase ( search query), issuing a list of links to sources of information, in order of relevance (according to the request). The largest international search engines: Google, Yahoo, MSN. On the Russian Internet these are Yandex, Rambler, Aport.

Similarly, a search engine consists of two parts: the so-called robot (or spider), which crawls the Web servers and creates a search engine database.

The robot's base is mainly formed by itself (the robot itself finds links to new resources) and, to a much lesser extent, by resource owners who register their sites in a search engine. In addition to the robot (network agent, spider, worm) that forms the database, there is a program that determines the rating of the links found.

The principle of operation of a search engine is that it queries its internal catalog (database) for the keywords that the user specifies in the query field and produces a list of links ranked by relevance.

It should be noted that, when processing a specific user request, the search engine operates precisely on internal resources (and does not embark on a journey across the Web, as inexperienced users often believe), and internal resources are, naturally, limited. Despite the fact that the search engine database is constantly updated, the search engine cannot index all Web documents: their number is too large. Therefore, there is always a possibility that the resource you are looking for is simply unknown to a specific search engine.

1.2 Features of search engines

In the work, the search process is presented in four stages: formulation (occurs before the search begins); action (starting search); overview of results (the result that the user sees after searching); and refinement (after reviewing the results and before returning to the search with a different formulation of the same need). A more convenient nonlinear information search scheme consists of the following steps:

Fixing information needs in natural language;

Selection of the necessary network search services and precise formalization of recording information needs in specific information retrieval languages (IRL);

Execution of created queries;

Pre-processing and selection of received lists of links to documents;

Contacting selected addresses for the required documents;

Preview the contents of found documents;

Saving relevant documents for later study;

Extracting links from relevant documents to expand the query;

Studying the entire array of saved documents;

If the information need is not fully satisfied, then return to the first stage.

1.3 How search engines work

The goal of any search engine is to deliver to people the information they are looking for. Teach people to make the “correct” requests, i.e. queries that comply with the operating principles of search engines are impossible. Therefore, developers create algorithms and operating principles for search engines that would allow users to find exactly the information they are looking for. This means the search engine must “think” the same way the user thinks when searching for information.

Most search engines work on the principle of pre-indexing. The database of most search engines works on the same principle.

There is another principle of construction. Direct search. It lies in what you are looking for keyword turn the book page by page. Of course, this method is much less effective.

In the version with an inverted index, search engines are faced with the problem of file size. As a rule, they are significantly large. This problem is usually solved in two ways. The first is that everything unnecessary is removed from the files, and only what is really needed for the search remains. The second method is that for each position not absolute address, and relative i.e. address difference between the current and previous positions.

Thus, the two main processes performed by the search engine are indexing sites, pages and searching. In general, the indexing process does not cause problems for search engines. The problem is processing a million requests per day. It's connected with large volumes information that is processed by large computer complexes. The main factor determining the number of servers participating in the search is the search load. This explains some of the oddities that arise when searching for information.

Search engines consist of five separate software components:

spider: a browser-like program that downloads web pages.

crawler: a “traveling” spider that automatically follows all links found on a page.

indexer: a “blind” program that analyzes web pages downloaded by spiders.

the database: storage of downloaded and processed pages.

search engine results engine (results delivery system): retrieves search results from the database.

Spider: A spider is a program that downloads web pages. It works just like your browser when you connect to a website and load a page. The spider has no visual components. You can observe the same action (downloading) when you view a certain page and when you select “view HTML code” in your browser.

Crawler: Just as a spider downloads pages, it can strip the page and find all the links. It is its job to determine where the spider should go next, based on links or based on a predetermined list of addresses.

Indexer: The indexer parses the page into its various parts and analyzes them. Elements such as page titles, headings, links, text, structural elements, BOLD, ITALIC elements and other style parts of the page are isolated and analyzed.

Database: The database is the repository of all the data that the search engine downloads and analyzes. This often requires enormous resources.

Search Engine Results: The results system is responsible for ranking pages. It decides which pages satisfy the user's request and in what order they should be sorted. This happens according to search engine ranking algorithms. This information is the most valuable and interesting for us - it is with this component of the search engine that the optimizer interacts, trying to improve the site’s position in the search results, so in the future we will consider in detail all the factors influencing the ranking of results.

The search index works in three stages, of which the first two are preparatory and invisible to the user. First, the search index collects information from World Wide Web . For this they use special programs, similar browsers. They are able to copy a given Web page to a search index server, scan it, find all the hyperlinks that have those resources found there, look again for the hyperlinks they contain, etc. Similar programs called worms, spiders, caterpillars, crawlers, spiders and other similar names. Each search index uses its own for this purpose. unique program, which he often develops himself. Many modern search engines were born from experimental projects related to the development and implementation of automatic programs that monitor the Network. Theoretically, with a successful entry spider is able to comb the entire Web space in one dive, but this takes a lot of time, and he still needs to periodically return to previously visited resources in order to monitor the changes occurring there and identify “dead” links, that is, those that have lost their relevance.

Main element modern Internet- This search engines or search engines, Yandex, Rambler, Google and others. There is a sea on the Internet various information, and it is search engines that help the user quickly find the necessary information.

Textbooks or scientific books have a list of important terms - an alphabetical index or index. The index lists the most important terms of this book (keywords) and the page numbers on which they appear.

The work of search engines is based on a similar principle. Essentially, when a user enters a search term (keyword), they are referred to an Internet subject index or index - a list of all the Internet keywords, along with the pages where they appear.

Search engine is a program that compiles and stores an Internet subject index (index), and also finds specified keywords in it.

Stages of compiling an index and searching it:

Collecting web page addresses on the Internet

An initial list of website page addresses is loaded into the search engine. Then the search engine, or rather its component – search robot , collects all hypertext links from each given pages to other pages and adds all the addresses found in the links to its original list of addresses. Thus, the initial list quickly grows.

Pumping out pages

A search robot or spider crawls pages, downloads text material from them and stores it on the disks of its computers, then transfers it to the indexing robot for indexing.

Index compilation

To begin with, the text of the indexed page is cleared of all non-text elements (graphics, markup HTML language etc.). Next, the words selected from the text are reduced to their stems or nominative case. The collected word stems are arranged in alphabetical order indicating page numbers, where the basis is taken, and occurrence numbers, where was the base on this page.

Search

When a user enters a word into a query string, the search engine accesses the index. Finds all page numbers related to a given word and shows the search result (list of pages) to the user.

Search engine quality

A synonym for search quality is its relevance. In relation to search engines the word relevant(related to the matter) is almost the main term. The relevance of a search engine's search results means that those results contain pages that are relevant to the meaning of the search query. Relevance or search quality is quite a complex thing.

Another important criterion for the quality of a search engine’s work is accuracy.

Accuracy is a measure of the quality of the results returned, it is calculated as the number of relevant pages in the total volume of pages returned in the search results. However, not only the accuracy of the search is important, but also ranging search results.

Ranging– arrangement of search results by relevance.

It is impossible to say which search engine is better. For the user better search engine, delivering the most relevant and accurate results. For the site owner, a good machine is one in which the site is clearly visible and which brings greatest number target visitors.

Operating principle, advantages and disadvantages of search engines

Along with catalogs (and even much more often), search engines are used. This is already more modern and convenient way navigation and search on the Internet. Unlike directories, a search engine is a fully automated structure.

The advantages of search engines include: a small number of outdated links in search results; a much larger number of Web sites that are searched; more high speed search; high search relevance; the presence of additional service functions that facilitate the user’s work, for example, the ability to translate the text of a document into a foreign language, the ability to select all documents from a specific site, narrowing criteria during a search, finding documents “based on a sample,” and so on.

The operation of search engines is based on completely different technological principles. The task of search engines is to provide a detailed search for information in the electronic universe, which can only be achieved by taking into account (indexing) the entire content of the maximum possible number of web pages. Unlike directories, search engines operate in an automated mode and have the same principle of operation. Search engines have two basic components. The first component is a robot program whose task is to travel from server to server, find new or changed documents there and download them to main computer systems. At the same time, the robot, viewing the contents of the document, finds new links, like to other documents of this server, and to external sites. The program independently follows the specified links, finds new documents and links in them, after which the process is repeated again, reminiscent of the “snowball method” well known in bibliography. The identified documents are processed (indexed) by the second component of the search engine. In this case, as a rule, the entire content of the page is taken into account, including text, illustrations, audio and video files. All words in a document are indexed, which makes it possible to use search engines to detailed search on the narrowest topic. The resulting giant index files store information about which word is used, how many times, in which document and on which server, and form a database that is accessed by users who enter combinations of keywords into the query string. Brown Marcus: Methods for searching information on the Internet. - M.: New Publishing House, 2005. - 136 pages

The results are delivered using a special module that intelligently ranks the results. In this case, the location of the term in the document (title, heading, body text), the frequency of its repetition, the percentage of the searched term to the rest of the page text, as well as the number and authority of external links to this page from other sites are taken into account.

However, search engines have some disadvantages: limited search area. If a site has not been entered into the search engine database, it does not “exist” for it, and its documents cannot be included in the search results; relative difficulty of use. In order for the search request to be compiled to exactly correspond to what exactly needs to be found, you need to have at least a little understanding of how a search engine works and be able to use the simplest logical operators. Search catalogs in this sense are simpler and more familiar; a less visual form of presenting query results. The directory displays the name of the site with a brief summary and other useful information. The results of the search engine are less clear; Since the search engine database is replenished by robot programs, dishonest owners of advertising sites can “deceive” them, which is why the relevance of the search can be significantly reduced.

Search engines (search engines) are more common than catalogues, and their number, now several dozen, continues to increase steadily. Professional work with them requires special skills, since simply entering the desired term into the search bar will most likely lead to obtaining a list of hundreds of thousands of documents containing this concept, which is almost equivalent but to zero results.(http://www.gogle.com /)

This search engine was launched in 1998. IN currently In all significant respects, this system is the sole leader among global search engines. Google is one of the most popular search engines. This search engine got its name from the word “Googol”, which means a number written as one followed by 100 zeros. Google has subdomains for large quantity countries - for Russia, for example, this is www.google.com.ru.

The Google search engine will find not only hypertext documents, but also files doc format, pdf, mp3 and so on. Google boasts of its high-quality “engine” that searches the Internet based on user requests. Relevance - the degree to which the search results found correspond to the query - is often higher with Google than with Russian search engines, for example Yandex. It is for this reason that more and more Internet users are starting to use Google as their main search engine. Google search engine uses the PageRank link ranking algorithm, which determines the authority of a site when generating a list of search results. PageRank is similar to the Yandex citation index and depends on the quality and quantity of links to this site. PageRank helps users find exactly what they are looking for on the Internet.

The Google search engine copies all pages into its database (caches), so that the user can view the page by opening it from Google cache, and not from the original source, which can significantly reduce search time. A special feature of Google is that the search engine indexes all pages completely. Also worth noting Google opportunity search images various quality, size, format. By entering an arithmetic expression into the search bar, you can get the correct answer from Google. To take advantage Google search it is not necessary to go to www.google.com.ru - you can install the program Google Toolbar, which will create a toolbar in the browser with search bar, where you can enter your request.

In addition to the listed global search engines, in some cases, rather out of inertia, outdated search services, among which the most notable are HotBot (http://www.hotbot.com/) and Excite (http://www.excite.com/). The small size of their index files today does not allow us to rely on the information they provide. A “young” search engine like Ask (http://www.ask.com/), despite the impressive volume of indexed documents, is not yet of particular interest. Ask, for example, is not able to search for documents in Russian.

One of the main ways to find information on the Internet is through search engines. Search engines crawl the Internet every day: they visit web pages and enter them into giant databases. This allows the user to type in some keywords, hit submit, and see which pages match their query.

Understanding how search engines work is essential for webmasters. For them, the correct structure of documents and the entire server or website from the point of view of search engines is vital. Without this, documents will not appear often enough in response to user requests to the search engine or may not even be indexed at all.

Webmasters want to increase the ranking of their pages, and this is understandable: after all, any request to a search engine can produce hundreds and thousands of corresponding links to documents. In most cases, only the first 10 links are sufficiently relevant to the query.

Naturally, you want the document to be in the top ten, since most users rarely view the links following the top ten. In other words, if the link to the document is the eleventh, then it is as bad as if it did not exist at all.

Major search engines

Which of the hundreds of search engines are really important for a webmaster? Well, of course, widely known and often used. But at the same time, you should take into account the audience for which your server is designed. For example, if your server contains highly specialized information about the latest methods milking cows, then you probably shouldn’t rely on search engines general purpose. In this case, I would advise exchanging links with your colleagues who are dealing with similar issues :) So, first, let’s define the terminology.

There are two types information bases data about web pages: search engines and directories.

Search engines: (spiders, crawlers) constantly explore the Internet in order to replenish their document databases. Usually this does not require any effort on the part of the person. An example would be the Altavista search engine.

The design of each document is quite important for search engines. Great importance have title, meta tags and page content.

Catalogs: unlike search engines, information is entered into a catalog at the initiative of a person. The added page must be strictly linked to the categories accepted in the catalogue. An example of a directory is Yahoo. The design of the pages does not matter. Below we will mainly talk about search engines.

Altavista

The system was opened in December 1995. Owned by DEC. Since 1996 he has been collaborating with Yahoo.

Excite Search

Launched at the end of 1995, the system developed rapidly. In July 1996, Magellan was purchased, in September 1996, WebCrawler was acquired. However, both use it separately from each other. Perhaps in the future they will work together.

There is also a directory in this system - Excite Reviews. Getting into this directory is luck, since not all sites are included there. However, the information from this directory is not used by the search engine by default, but it is possible to check it after viewing the search results.

HotBot

Launched in May 1996. Owned by Wired. Based on Berkeley Inktomi search engine technology.

InfoSeek

Launched a little earlier than 1995, it is widely known, highly searchable and easily accessible. Currently, "Ultrasmart/Ultraseek" contains about 50 million URLs.

The default search option is Ultrasmart. In this case, the search is performed in both directories. With the Ultraseek option, query results are returned without additional information. Truly new search technology also allows you to make searches easier and many other features that you can read about InfoSeek. There is a separate directory from the search engine: InfoSeek Select.

Lycos

One of the oldest search engines, Lycos, has been operating since approximately May 1994. Widely known and often used. It includes the Point search engine (operating since 1995) and the A2Z catalog (operating since February 1996).

OpenText

The OpenText system appeared a little earlier than 1995. Since June 1996, it began to partner with Yahoo. It is gradually losing its position and will soon cease to be among the main search engines.

WebCrawler

Opened on April 20, 1994 as research project University of Washington. In March 1995, it was acquired by America Online. There is a WebCrawler Select directory.

Yahoo

Yahoo's oldest directory was launched in early 1994. Widely known, frequently used and most respected. In March 1996, another Yahoo catalog was launched - Yahooligans for children. More and more regional and top Yahoo directories are appearing.

Because Yahoo is subscription-based, some sites may not be included. If a Yahoo search does not produce suitable results, users can use the search engine. This is done very simply. When a query is made to Yahoo, the directory forwards it to any of the major search engines. The first links in the list of addresses that satisfy the request are addresses from the directory, and then there are addresses received from search engines, in particular from Altavista.

Features of search engines

Each search engine has a number of features. These features should be taken into account when making your pages.

Search engine type

“Full text” search engines index every word on a web page, excluding only some stop words. “Abstract” search engines create a kind of extract of each page.

For webmasters, full-text engines are more useful because any word found on a web page is analyzed to determine its relevance to user queries. However, for abstract search engines it may happen that pages are indexed better than for full-text ones. This can come from the extraction algorithm, for example, by the frequency of use of the same words on the page.

Size

The size of a search engine is determined by the number of pages indexed. For example, a large size search engine may index almost all of your pages, a medium size search engine may only partially index your server, and a low size search engine may not include your pages in search engine directories at all.

Update period

Some search engines immediately index the page based on the user's request, and then continue to index pages that have not yet been indexed
others can more often “crawl” along the most popular pages networks than others

Document index date

Some search engines show the date when a particular document was indexed. This helps the user understand how “fresh” the link is returned by the search engine. Others leave users to just guess about it.

Submitted pages

Ideally, search engines should find any page on any server as a result of following links. The real picture looks different. Server pages appear in search engine indexes much earlier if they are directly specified (Add URL).

Non-submitted pages

If at least one server page is specified, then search engines will definitely find the following pages using the links from the specified one. However, this takes more time. Some machines immediately index the entire server, but most still, after recording the specified page in the index, leave indexing the server for the future.

Indexing depth

This setting only applies to unspecified pages. It shows how many pages after the specified one the search engine will index.

Most large machines have no restrictions on indexing depth. In practice, this is not entirely true. Here are a few reasons why not all pages may be indexed:

not very careful use of frame structures (without duplicating links in the control (frameset) file)
using imagemaps without duplicating them with regular links

Frame support

If a search robot does not know how to work with frame structures, then many structures with frames will be missed during indexing.

ImageMap support

This is about the same problem as with server frame structures.

Password-protected directories and servers

Some search engines can index such servers if you provide them with Username and Password. Why is this necessary? So that users can see what is on your server. This allows you to at least know that such information exists, and perhaps they will then subscribe to your information.

Link frequency

Major search engines can determine a document's popularity by how often it is linked to from other places on the Web. Some machines, based on such data, “conclude” whether it is or is not worth spending time on indexing such a document.

Learning ability

If the server is updated frequently, the search engine will re-index it more often; if it is updated rarely, it will be re-indexed less often.

Indexation control

Shows what tools can be used to manage a particular search engine. All major search engines follow the instructions of the robots.txt file. Some also support control using META tags from the indexed documents themselves.

Redirect

Some sites redirect visitors from one server to another, and this parameter indicates which URL will be associated with your documents. This is important because if the search engine does not handle the redirection, problems with non-existent files may arise.

Safe words

Some search engines do not include certain words in their indexes or may not include those words in user queries. These words are usually considered prepositions or simply very frequently used words. But they are not included to save space on media. For example, Altavista ignores the word web and for queries like web developer, links will be returned only for the second word. There are ways to avoid this.

Impact on the relevance determination algorithm

Search engines necessarily use the location and frequency of repetition of keywords in a document. However, additional mechanisms The increases in relevance are different for each machine. This parameter shows exactly what mechanisms exist for a particular machine.

Spam fines

All major search engines do not like it when a site tries to increase its ranking by, for example, indicating itself multiple times through Add URL or mentioning the same keyword multiple times, etc. In most cases, such actions (spamming, stacking) are punished, and the site’s rating, on the contrary, falls.

What search engines do you know? Search engines

Search by keywords on the Internet

Collecting web page addresses on the Internet

Pumping out pages

Index compilation

Search

Search engine quality

Major search engines

Features of search engines

Popular articles

Latest articles

Sections

Pages

Special projects

Contacts