Step-by-step development of a thesaurus. The meaning of the word sample in the Russian language thesaurus


Thesaurus(from the Greek thesauros - treasure) in modern linguistics - a special type of dictionaries of general or special vocabulary, which indicate semantic relationships (synonyms, antonyms, paronyms, hyponyms, hyperonyms, etc.) between lexical units. Thus, thesauri, especially in electronic format, are one of the most effective tools for describing individual subject areas.

Unlike an explanatory dictionary, a thesaurus allows you to identify the meaning not only with the help of a definition, but also by correlating a word with other concepts and their groups, due to which it can be used in artificial intelligence systems.

In the past, the term thesaurus primarily denoted dictionaries that presented the vocabulary of a language with maximum completeness with examples of its use in texts.

Paronymy- partial sound similarity of words with their semantic difference (complete or partial). Paronyms are often a source of speech errors.

Examples of the same root paronyms: dress - put on, human - humane, pay - pay - pay.

Examples of completely unrelated paronyms: biology - bryology, broth - brouillon, compote - complot, texture - fraktura.

However, a thesaurus is more than an information retrieval tool. A thesaurus can be considered as a universal model of a terminological system, and therefore as a formal system of knowledge contained in the language of a specific scientific field.

General purpose thesaurus

Thesaurus in the most general definition is a dictionary with semantic connections between vocabulary units. Since the late 50s, thesauri have been used in machine translation systems and information retrieval systems (IRS).

Unlike semantic dictionaries, which are designed to describe in detail general vocabulary, thesauruses are created to store and classify extremely specific words and phrases. For example, the word substance is in the ROSS dictionary (Russian General Semantic Dictionary), and all the names of chemical compounds are already in the thesaurus.

What connections are described in the thesaurus? Usually:

    genus-species (AKO)

    part-whole (POF)

    synonymy/antonymy

    associative.

An example of a genus-species relationship

Example of semantic parsing

This paradigmatic(stable connections that exist between words in a language). And that's not all.

Syntagmatic(text) connections are not represented in the thesaurus.

Example: WORDNET - intelligent computer thesaurus

http://wordnet.princeton.edu/perl/webwn

Created at Princeton University and freely distributed.

Key Features.

The words in it are grouped into synonymous groups ( synsets). They are divided into 4 dictionaries - nouns, adjectives, verbs and adverbs.

Synsets are united both in hierarchical connections (hyponyms and hyperonyms), and in the relation of antonymy and also meronymy (to be part of something or to consist of parts).

The problem of morphology has also been solved - the word, after being addressed to WN, returns in its original form.

Information retrieval thesaurus

In the field of information retrieval, the benefits of using thesauruses come from moving from text to descriptors that describe a real-world object. Moving to descriptors allows for advanced (redundant) indexing.

In an information retrieval thesaurus, PARADIGMATIC relationships between descriptors are explicitly expressed (not all, but those that are most often important for increasing the completeness of information retrieval). It has been experimentally determined that the most important paradigmatic relationships are

    subordination

    similarity

    species-genus (genus-species)

    cause-effect

    part-whole.

Example dictionary entry:

Agreecultural machines. Agreecultural equipment

Syn. agricultural machinery, agricultural machinery,

View: potato harvester, seeder, etc.

Redundant indexing example

Request "Agreecultural machines. Agreecultural equipment"

Example: Socio-political thesaurus of the Russian language University information system RUSSIA

http://www.cir.ru/index.jsp

Developed by the Autonomous Non-Profit Organization “Information Research Center” (ANO ITSI)

A thesaurus is a terminological resource implemented in the form of a dictionary of concepts and terms with connections between them. The main purpose of a thesaurus is to help with information retrieval: based on thesaurus links, the query is expanded, navigation through thesaurus links helps to formulate the query itself more clearly.

A feature of the hierarchy of the Thesaurus of the UIS "Russia" is the multiplicity of classification, that is, for most concepts, not a single classifying concept is searched for (relationship ABOVE - BELOW), but different points of view on a specific concept are described, for example, the concept STORE can be considered both as a BUILDING and as a RETAIL ORGANIZATION.

Thesaurus on socio-political topics includes more than 26,000 concepts, 62,000 terms, 100,000 direct and 700,000 inherited relationships between concepts. The current version of the Thesaurus describes the terminology used in the socio-political field, including economic, political, military, legislative, social, international relations and other areas.

The full name of the Thesaurus is Information and retrieval thesaurus on socio-political topics for automatic indexing. All definitions are important here:

    “information retrieval” – as it is designed specifically for use in information retrieval to assist the user in forming (clarifying) a query and for automatically expanding the query conditions during search;

    “on socio-political topics” - since it covers 95-99% of the vocabulary and terminology of the Russian-language text on socio-political topics;

    “for automatic indexing” - since it is the basis for the process of automatically determining the subject of documents - grouping terms close in the thesaurus hierarchy into thematic nodes, automatic categorization and automatic annotation.

Thesauri - conclusion

For many well-known thesauri (WordNet, Roget, EuroWordNet) big problem What remains is automatic inference from thesaurus connections - when the expansion to the nearest neighborhood is correct, but not complete, and attempts to expand the neighborhood lead to errors.

N. V. Lukashevich

[email protected]

B. V. Dobrov

Research Computing Center of Moscow State University. M.V. Lomonosov;

ANO Center for Information Research

[email protected]

Keywords: thesaurus, information retrieval, automatic text processing,

The vast majority of technologies working with large collections of texts are based on statistical and probabilistic methods. This is due to the fact that lexical resources that could be used to process text collections using linguistic methods should have a volume of tens of thousands of dictionary entries and have a number of important properties, which need to be specifically monitored when developing a resource. In the report, we examine the basic principles of developing lexical resources for automatic processing of large text collections using the example of the Russian language thesaurus for computer text processing RuTez, created in 1997, which is currently a hierarchical network of more than 42 thousand concepts. We describe current state thesaurus based on a comparison of its lexical composition and the text corpus of the University Information System RUSSIA (www.cir.ru) - 400 thousand documents. Examples of thesaurus use in various automatic word processing applications are discussed.

  1. Introduction

Currently, millions of documents have become available in electronic form, thousands of information systems have been created and electronic libraries. At the same time, information systems that use lexical and terminological resources for searching are calculated in fractions of a percent. This is due to the serious challenges of creating such linguistic resources for automatic processing of modern collections electronic documents.

First, these collections are usually very large; the resource must include descriptions of thousands of words and terms. Secondly, collections are a set of documents of different structures with various syntactic structures, which makes it difficult to automatically process text sentences. In addition, important information is often distributed between different sentences of the text.

All this acutely raises the question of what a linguistic resource should be, which, on the one hand, would be useful for automatic processing and searching in electronic collections, on the other hand, could be created in a foreseeable time and maintained with relatively little effort.

In this article we will look at the basic principles of developing lexical resources for automatic processing of large text collections. These principles will be examined using the example of the Russian language thesaurus created by the ANO Center for Information Research since 1997 for computer text processing RuTez. RuTez is currently a hierarchical network of more than 42 thousand concepts, which includes more than 95 thousand Russian words, expressions, and terms. We will describe the current state of the thesaurus based on a comparison of its lexical composition and the vocabulary of the text corpus of the University Information System RUSSIA, supported by the Research Computing Center of Moscow State University. M.V. Lomonosov and ANO TSII. UIS RUSSIA (www.cir.ru) contains 400 thousand documents on socio-political topics (about 3 GB of texts, 200 million words). The article will also discuss examples of using thesaurus in various automatic word processing applications.

  1. Principles for developing a linguistic resource

for information retrieval tasks

To ensure effective automatic processing of electronic documents (automatic indexing, categorization, comparison of documents), it is necessary to build a basis for their comparison - a list of what was mentioned in the document. For such an index to be more effective than a word-by-word index, it is necessary to overcome the lexical diversity of the text: synonyms, polysemy, parts of speech, stylistics, and reduce it to an invariant - a concept that becomes the basis for comparing different texts. Thus, concepts should become the basis of a linguistic resource, and linguistic expressions: words, terms - become only text inputs that initialize the corresponding concept.

In order to be able to compare different but similar concepts, relationships must be established between them. Traditionally in linguistic resources for automatic text processing on natural language certain sets of semantic relations were used, such as part, source, reason and so on. However, when working with large and heterogeneous text collections, we must understand that with the current state of word processing technology, computer system will not be able to detect these relations in the text with any stability in order to carry out the procedures that we have associated with certain relations. Therefore, the relations between concepts must first of all describe certain invariant properties that do not depend or weakly depend on the topic of the specific text in which the concept is mentioned.

The main function of this relationship is to answer the following question:

if it is known that the text is dedicated to discussing C1, and C2 is related

attitudeRwith C1, can we say that the topic of the text(*)

related to C2?

When creating a linguistic resource for automatic processing, it is important to determine which properties of the concepts C1 and C2 allow us to establish correct (*) relationships between them.

So, for example, no matter what texts are written about birches, we can always say that these lyrics are about trees. But despite the popularity and frequent discussion of the relationship tree as part forests, very few texts about trees are texts about forests. Note that the problem is not related to the name of the relationship. So clearing is part of the forest, and texts about clearings are texts about forests.

Invariance of relations relative to the range of possible topics of texts subject area is largely determined by deeper properties than those reflected by the names of relations, namely its quantifier and existential properties. So the quantifier properties of relations describe whether all examples of the concept have this attitude, whether this relationship persists throughout the entire life cycle of the example. Problem with using relation treeforest It is precisely due to the fact that not every specific tree is located in the forest, but the clearing cannot be outside the forest.

An example of a description of the existential properties of relations - does it follow from the existence of the concept C1 the existence of the concept C2 (for example, the existence of the concept GARAGE requires the existence of a concept AUTOMOBILE) or the existence of examples C1 depends on the existence of examples C2 (so specific FLOOD inseparable from concrete example RIVERS). The discussion in the text of the dependent concept C2, especially dependent on the example, suggests that the text is also related to the main concept C1.

Let's consider the relationship between concepts FOREST and TREE in details. In fact, part of the concept FOREST is TREE IN THE FOREST, while there are FREE-STANDING TREE,TREE IN THE GARDEN etc. In any case, it is necessary to break the relationship of subordination of the concept TREE concept FOREST.

On the other side, FOREST is a species COLLECTIONS OF TREES, does not exist without trees (as well as GARDEN). Thus, the concept FOREST must be in relation to the concept TREE. Starting with an analysis of the needs of specific application problems, we came to the conclusion that it is important to describe the deep properties of relations that were previously very little reflected in linguistic resources, but which are of paramount importance for the tasks of automatic processing of large text collections, and, possibly, for many other tasks.

Now we model the description of quantifier and existential properties of concepts with a set of traditional thesaurus relations ABOVE-BELOW (66% of all relations), PART-WHOLE (30% of relations), ASSOCIATION (4%), in combination with a certain set of additional modifiers (20% of relations are marked ). Note that the PART-WHOLE and ASSOCIATION relationships are interpreted taking into account the rule (*). In total, about 160 thousand direct connections between concepts are described, which, taking into account the transitivity of relationships, gives a total number of different connections of more than 1350 thousand connections, that is, on average, each concept is connected with 30 others.

  1. RuTez Thesaurus: general structure

The RuTez thesaurus is a hierarchical network of concepts corresponding to the meanings of individual words, text expressions or synonymous series. Thus, the main elements of a thesaurus are concepts, linguistic expressions, relationships between linguistic expressions and concepts, and relationships between concepts.

In the thesaurus in unified system collected both linguistic knowledge - descriptions of lexemes, idioms and their connections, traditionally related to lexical, semantic knowledge, and knowledge about terms and relationships within subject areas, traditionally related to the field of activity of terminologists, described in information retrieval thesauri. The thesaurus describes such subject areas as economics, legislation, finance, international relations, which are so important for Everyday life person that they have significant lexical representation in traditional explanatory dictionaries. In them, lexical and terminological are strongly interconnected and strongly interact with each other.

Linguistic expressions are individual lexemes (nouns, adjectives and verbs), nominal and verbal groups. Thus, the thesaurus does not currently include adverbs and function words as linguistic expressions. Multiword groups may include terms, idioms, lexical functions ( influence e).

For each linguistic expression the following is described:

Its polysemy is a connection with one or more concepts, which means that a given linguistic expression can serve as a textual expression of this concept. Attribution of a linguistic expression to different concepts is also an implicit indication of its polysemy;

Its morphological composition (part of speech, number, case);

Writing features (for example, with capital letters) and so on.

Each thesaurus concept has a unique name, a list of linguistic expressions with which this concept can be expressed in the text, and a list of relationships with other concepts.

One of its unambiguous text expressions is usually chosen as a unique name for a concept. But the name of a concept can also be formed by a pair of its ambiguous text expressions - synonyms, written separated by commas and unambiguously defining it (for example, the concept THICK). An ambiguous text expression of the name of a concept can also be provided with a mark or a shortened fragment of interpretation, for example, concept CROWD (GROUP OF PEOPLE).

  1. Example dictionary entry

We chose as an example the dictionary entry for the concept FOREST, corresponding to one of the meanings of the word forest. This dictionary entry is interesting because it includes different types of knowledge, traditionally classified as lexical (semantic) knowledge and encyclopedic knowledge (knowledge about the subject area, terminology).

Synonyms for the concept FOREST(total 13):

forest(M), forest zone, forest environment,

forest, forest quarter, forest landscape,

forest area, woodland, wooded area,

forest area, little forest,

array of forests.

Below concepts with synonyms:

JUNGLE(jungle);

FOREST PARK(city ​​garden, green area,

green area, forest park,

forest management, forest park

belt, park(M), park area);

FORESTRY;

LEAVED FOREST(soft-leaved forest, hard-leaved

forest);

GROVE(oak grove);

CONIFEROUS FOREST (coniferous forest, dark coniferous forest)

Concepts-parts with synonyms:

WINDBREAK(windfall, windfall);

CUTTING(cutting area);

FOREST CULTURE(forest species, forestry

culture);

FOREST LAND (forest lands; lands covered

forest; forest lands, forest territory;

forested land, forested

area);

FOREST PLANTATIONS(forest plantations, forest plantations,

afforestation);

EDGE OF THE FOREST(edge, edge);

UNDERFLOWER(undergrowth);

PROSEKA;

DRY WOOD(deadwood).

Here the symbols (M) reflect a note about the ambiguity of the text input.

Concept FOREST also has other relationships, the so-called dependency relationships (in modern version called ASC 2 - asymmetrical association): FOREST FIRE(forest fire, fire in the forest; FOREST USE (forest use, use of forest fund areas); FORESTRY; FOREST SCIENCE (forest science). As already noted in paragraph 2, the concept of FOREST depends on the concept of TREE, which in the thesaurus is denoted by the relation ASC 1.

Total concept FOREST is connected directly with 28 other concepts, taking into account the transitivity of relations - with 235 concepts (in total more than 650 text inputs).

  1. Assessment of the current state

Russian language thesaurus RuTez

5.1. Lexical composition

Currently, the thesaurus network includes more than 95 thousand linguistic expressions, of which 61 thousand are single-word.

This volume of work forced us to decide what words and linguistic expressions needed to be included in the Thesaurus descriptions. The natural desire was to see how the most frequent words in the Russian language were represented in the thesaurus. For this purpose, the text collection of the University Information System RUSSIA (400 thousand documents) was used. The collection contains official documents from various bodies Russian Federation(55 thousand documents since 1992), as well as press materials since 1999 (newspapers Izvestia, Nezavisimaya Gazeta, Komsomolskaya Pravda, Arguments and Facts, Expert magazine and others), materials from scientific journals ( "Bulletin of Moscow University", "Sociological Journal"). A comparison was made between the list of lemmas included in the Thesaurus and the list of the most frequent 100,000 lemmas in the text collection (frequency more than 25).

Polexeme marking of the list showed that among these hundred thousand lemmas, 35 thousand are described in RuTez, only about 7 thousand lexemes deserve inclusion in the Thesaurus, the rest are lemmatic variants of various proper names. Therefore, replenishment has ceased to be a priority task and is carried out gradually, starting with the most frequent words. It is assumed that as soon as this list is mostly exhausted, another comparison will be made with the text array of the information system, new lexemes with a frequency of more than 25 will be selected. Next, the viewing threshold is supposed to be lowered. The presence in the text collection of an extensive amount text examples allows you to quickly respond to “lexical innovations” (for example, installation,blockbuster, beau monde, thriller) and include them in the appropriate places in the Thesaurus hierarchical system.

Constant work with a current text collection provides unique opportunities for checking the significance and quality of lexical descriptions proposed in dictionaries. For example, an unusually high frequency use of the word Mother See(more than 400 times). Checking the array showed that the word is indeed often used as a synonym for the word Moscow, while explanatory dictionaries often mark this word as obsolete. Another example of a frequently used word (more than 300 times) marked as obsolete in dictionaries is the word blissful.

5.2 Description of word meanings

Comparison with the text collection shows that many of the frequency words in the array are well represented in the Thesaurus in at least one of their (usually basic) meanings. Finding out to what extent the Thesaurus represents the range of meanings of polysemantic words in the Russian language is our primary task at the present time.

As is known, various dictionary sources often give various set meanings of polysemous words, highlight shades of meaning, and the same type of polysemy can be described differently for different words, even in the same dictionary. Therefore, the task of consistently and representatively describing the meanings of lexemes is important task for creators of any dictionary resource.

However, if the resource is intended for automatic processing, then the task of balanced description of values ​​becomes much more important. Excessive inflation of values ​​may result in the computer system being unable to select desired value, which in turn leads to a significant decrease in the efficiency of the automatic text processing system. So, one of the disadvantages of the WordNet resource as a resource for automatic word processing is the excessive number of meanings described for some words (in WordNet 1.6: 53 meanings for run, 47 for play and so on.). These meanings are difficult to distinguish even for humans when semantically annotating texts. It is clear that the computer system also cannot cope with choosing the appropriate value. Therefore, various authors suggest various ways combining values ​​to improve processing quality.

At the same time, the opposite factor operates: if the meanings really differ in their set of dictionary connections (in our case, thesaurus connections) - they cannot be glued into one unit (one concept) - this will also lead to a deterioration in the quality of automatic processing.

Let's take an example of the words school And church, each of which can be considered as an organization and as a building.

Each school organization has a building (most often one). All parts of the school building (classrooms, blackboards) are related to school how to an organization. There are no specific types of school buildings. Therefore the description schools As buildings, it is inappropriate to separate them into a separate concept. However, the description of such a collective concept SCHOOL as an organization and as a building must have a specially designed relationship with the concept BUILDING. When describing such relationships in the Thesaurus, a mark on the relationship is used - the modifier “A” (“aspect”; during automatic analysis, “confirmation” by other concepts is required to take this relationship into account).

SCHOOL

HIGHER EDUCATIONAL INSTITUTION

ABOVE A PUBLIC BUILDING

Corresponding meanings of the word church not that close. Churches How can an organization have a large number of church-buildings in different places, and also has many other buildings. Church-building is closely related to religion and confession, but can change affiliation church organizations. Church-organization And church-building have different subspecies. That's why CHURCH (ORGANIZATION) And CHURCH (BUILDING) are presented in RuTez as different concepts.

The significant divergence in thesaurus connections correlates in an interesting way with the ability of the denotations corresponding to the meanings to exist separately from each other. Thus, a church-building does not cease to exist and even be called a church even when its use changes, unlike a school-building.

The process of verifying the representation of values ​​in the Thesaurus is constantly underway, starting with the most frequent lemmas. For each frequency lexeme, it is checked how its meanings are described in explanatory dictionaries, what meanings are used in the collection and how they are presented in the Thesaurus. As a result, a list of 10,000 lexemes has now been generated, the ambiguity of which still requires either additional analysis or additional description. The list was obtained based on 30 thousand of the most frequent lemmas.

It should be noted that in the Thesaurus the problem of ambiguity is partially removed due to the fact that between different meanings words can be described by thesaurus connections, and therefore the highest concept in the hierarchy can be selected by default. It was definitely discussed in the text. For example, the word photo has three meanings: photography as a field of activity, photography as a photographic image, photography as a photo studio:

PHOTOGRAPHY(photographing, photo business, ..., photo )

PART PHOTOGRAPHIC IMAGE

(photo, photograph, photo )

PART PHOTO STUDIO (photo ).

Thus, if it was not possible to figure out what meaning the word was used photo, the default is to assume that a photo was taken (of a process, result, or location), which is sufficient for many automatic text processing applications.

  1. Application of the RuTez thesaurus

for automatic text processing

Since 1995, the socio-political terminology RuTez (socio-political thesaurus) has been actively and successfully used for various applications automatic text processing, such as automatic conceptual indexing, automatic rubrication using several rubricators, automatic annotation of texts, including English ones. Socio-political thesaurus (27 thousand concepts, 62 thousand text inputs) - a basic search tool in search engine UIS RUSSIA (www.cir.ru).

All vocabulary of the RuTez thesaurus is used in procedures for automatically categorizing texts using complex hierarchical rubricators. In the existing technology, each category is described as a Boolean expression of terms, after which the original formula is expanded along the thesaurus hierarchy. The resulting Boolean expression may already include hundreds and thousands of conjuncts and disjuncts.

Let us give, as an example, a fragment of a description using thesaurus concepts (and linguistic expressions after expanding the formula) of the “Image of a Woman” rubric of the SOFIST 2 rubricator, used by VTsIOM to classify public opinion poll questionnaires:

(WOMAN[N]

|| GIRL[N]

|| RELATIVE [L] (grandmother, granddaughter, cousin,

daughter, sister-in-law, mother, stepmother, daughter-in-law, stepdaughter, ...))

(CHARACTER TRAIT[L] (thrifty, heartless, forgetful,

frivolous, mocking, intolerant, sociable, ...)

|| IMAGE[E] (representation, appearance, appearance,

appearance, appearance, image, appearance)

|| PLEASANT [L] (..., interesting, beautiful, cute,

attractive, cute, attractive, ...)

|| UNPLEASANT[L] (unsympathetic, rude, nasty, ...)

|| APPRECIATE[L] (to revere, adore, adore,

worship, adore, ...)

|| PREFER[N]

The symbol “E” denotes full expansion along the thesaurus hierarchy, the symbol “L” - according to species relations (“BELOW”), the symbol “N” - do not expand.

Research is being carried out to develop a combined technology for automatic text categorization, combining thesaurus knowledge and machine learning procedures.

The issues of using a thesaurus to expand a query formulated in natural language are being explored (currently, only the socio-political part of the thesaurus is used to expand a terminological query in the information retrieval system of the UIS RUSSIA), and searching for answers to questions in large text collections.

7. Conclusion

The paper presents the basic principles of developing linguistic resources for automatic processing of large text collections. The created linguistic resource - Thesaurus of the Russian language RuTez - is intended for use in such automatic text processing applications as conceptual indexing of documents, automatic rubrication according to complex hierarchical rubricators, automatic expansion of natural language queries.

This work is partially supported by the Russian Humanitarian Foundation grant No. 00-04-00272a.

Literature

  1. Lukashevich N.V., Saliy A.D., Representation of knowledge in the system of automatic text processing //NTI, Ser.2. 1997. No. 3. P. 1‑6.
  2. Zhuravlev S.V., Yudina T.N., Information system RUSSIA //NTI, Ser.2. 1995. No. 3. P. 18‑20.
  3. Winston M., Chaffin R., Herman D., A Taxonomy of Part-Whole Relations // Cognitive Science. 1987. No. 11. P. 417‑444.
  4. Priss U.E., The Formalization of WordNet by Methods of Relational Concept Analysis // WordNet. An Electronic Lexical Database/Ed. by C. Fellbaum. Cambridge, Massachusetts, London, England.: The MIT Press 1998. P. 179‑196.
  5. Guarino N., Welty C., A Formal Ontology of Properties // Proceedings of the ECAI-00 Workshop on Applications of Ontologies and Problem Solving Methods. Berlin: 2000. P. 121-128. (http://citeseer.nj.nec.com/guarino00formal.html).

Some Ontological Principles for Designing Upper Level Lexical Resources // First Int. Conf. on Language Resources and Evaluation. 1998.

  1. Lukashevich N.V., Dobrov B.V., Modifiers of conceptual relations in thesaurus for automatic indexing // NTI, Ser.2. 2000, No. 4, pp. 21‑28.
  2. Large explanatory dictionary of the Russian language / Ed. S.A. Kuznetsova. St. Petersburg: Norint, 1998.
  3. Ozhegov S.I., Shvedova N.Yu., Dictionary Russian language - 3rd edition. M.: Az, 1996.
  4. Apresyan Yu.D., Selected works, volume I. Lexical semantics: 2nd ed. M.: School “Languages ​​of Russian Culture”, Ed. Firm "Oriental Literature" RAS, 1995.
  5. G. Miller, R. Beckwith, C. Fellbaum, D. Gross and K. Miller, Five papers on WordNet, CSL Report 43. Cognitive Science Laboratory, Princeton University, 1990.
  6. Chugur, J. Gonzalo and F. Verdjeo, Sense distinctions in NLP applications // Proceedings of “OntoLex-2000”: Ontologies and Lexical Knowledge Bases. Sofia: OntoTextLab. 2000.
  7. Loukachevitch N., Dobrov B., Thesaurus-Based Structural Thematic Summary in Multilingual Information Systems // Machine Translation Review. 2000. No. 11. P. 10‑20. (http://www.bcs.org.uk/siggroup/nalatran/mtreview/mtr-11/mtr-11-8.htm).

Thesaurus of Russian language for natural language processing

of large text collections

Natalia V. Loukachevitch, Boris V. Dobrov

Keywords: thesaurus, natural language processing, informational retrieval

In our presentation we consider the main principles of developing lexical resources for automatic processing of large text collections and describe the structure of Thesaurus of Russian Language, which is developed since 1997 specially as a tool for automatic text processing. Now the Thesaurus is a hierarchical net of 42 thousand concepts. We describe the current stage of the Thesaurus developing in comparison with 100,000 the most frequent lemmas of the text collection of University Information System RUSSIA (www.cir.ru), including 400 thousand documents. Also we consider the use of the Thesaurus in different applications of automatic text processing.

The first stage of creating a thesaurus was the search for information about the structure of thesauri, its types and operating programs. The second stage was the choice of a programming language and a scheme for constructing your future thesaurus. The third stage is the search for information to fill it out; for this I used the “Educational and Methodological Complex Computer Networks”.

Here are a couple of examples of thesauri (see Figure 1.1 and Figure 1.2):

Figure 1.1 - Information retrieval system “Thesaurus.com”

Figure 1.2 - Dictionary of gender terms

After the meeting necessary information, the creation of the thesaurus began. To create the thesaurus, the programming language chosen was HTML. Hyper Text Markup Language - “HTML” (hypertext markup language) many have long ceased to consider it just a programming language. Since the very concept of HTML includes various methods design of hypertext documents, design, hypertext editors, browsers and much more. A user who has mastered this language gains the ability to do serious things simple methods and, most importantly, quickly, that in modern world considered very good!

In the HTML language, you can create your own multimedia products and distribute them on any media, and all these products, made in the form of sets of HTML pages, do not require the development of specialized software, since everything needed to work with data (Web browsers) has become part of the standard software most personal computers.

The code for the future Web page is usually typed in standard text editor, but there are other programs and programming languages, for example: Adobe Dreamweaver CS3, JavaScript, Pascal, C, C++, BASIC, Prolog.

To begin with, the thesaurus will consist of three frames: a title frame, a links frame, and a content frame, as shown in Figure 1.3.

Figure 1.3 - Thesaurus diagram

The following tags and attributes were used to create the thesaurus sketch HTML language:

text- site title;

- two frames horizontally measuring 120px and the remaining space;

- canceling the ability to stretch frame boundaries;

- vertical frames;

- specifies the name of the frame for the possibility of sending information to this frame.

To fill the frames with information, we write the code in the documents: “new.txt” - the “Title” frame, “nav.txt” - the “Links” frame, “main.txt” - the “Contents” frame.

The document “new.txt” contains the code responsible for the name of the thesaurus itself. Main tags: