Operational analytical processing. Personnel turnover at the enterprise


For many years, information technology has focused on building systems to support the processing of corporate transactions. Such systems must be visually fault-tolerant and provide fast response. Effective solution OLTP was provided, which focused on a distributed relational database environment.

A more recent development in this area was the addition of a client-server architecture. Many tools have been published for the development of OLTP applications.

Access to data is often required by both OLTP applications and decision support information systems. Unfortunately, trying to service both types of requests can be problematic. Therefore, some companies have chosen the path of dividing the database into OLTP type and OLAP type.

OLAP (Online Analytical Processing - operational analytical processing) is an information process that allows the user to query the system, conduct analysis, etc. V operational mode(online). Results are generated within seconds.

On the other hand, in an OLTP system, huge volumes of data are processed as quickly as they are received as input.

OLAP systems are designed for end users, while OLTP systems are made for professional IS users. OLAP includes activities such as generating queries, querying ad hoc reports, performing statistical analysis, and building multimedia applications.

Providing OLAP requires working with a data warehouse (or multidimensional warehouse) as well as a set of tools, typically multidimensional capabilities. These tools could be query tools, spreadsheets, data mining tools ( Data Mining), data visualization tools, etc.

The OLAP concept is based on the principle of multidimensional data representation. E. Codd examined the shortcomings of the relational model, first of all pointing out the inability to combine, view and analyze data from the point of view of multiple dimensions, that is, in the most understandable way for corporate analysts, and identified general requirements for OLAP systems that expand the functionality of relational DBMSs and include multidimensional analysis as one of its characteristics.

In a large number of publications, the acronym OLAP denotes not only a multidimensional view of data, but also the storage of the data itself in a multidimensional database. Generally speaking, this is not true, since Codd himself notes that relational databases were, are and will be the most suitable technology for storing enterprise data. The need does not exist in new technology DB, but rather in analysis tools that complement the functions of existing DBMSs and are flexible enough to provide and automate various types of intelligent analysis inherent in OLAP.

According to Codd, a multidimensional conceptual view is a multiple perspective consisting of several independent dimensions along which specific sets of data can be analyzed. Simultaneous analysis across multiple dimensions is defined as multivariate analysis. Each dimension includes areas of data consolidation, consisting of a series of successive levels of generalization, where each higher level corresponds to a greater degree of data aggregation for the corresponding dimension. Thus, the Performer dimension can be determined by the direction of consolidation, consisting of levels of generalization “enterprise - division - department - employee”. The Time dimension can even include two directions of consolidation - “year - quarter - month - day” and “week - day”, since counting time by month and by week is incompatible. In this case, it becomes possible to arbitrarily select the desired level of detail of information for each of the dimensions. The operation of descent corresponds to the movement from the highest stages of consolidation to the lowest; on the contrary, the operation of ascent means movement from lower levels to higher ones.

Codd defined 12 rules that an OLAP class software product must satisfy. These rules:

1. Multidimensional conceptual representation of data.

2. Transparency.

3. Availability.

4. Steady performance.

5. Client - server architecture.

6. Equality of measurements.

7. Dynamic processing of sparse matrices.

8. Support for multi-user mode.

9. Unlimited support for cross-dimensional operations.

10. Intuitive data manipulation.

11. Flexible report generation mechanism.

12. Unlimited number of dimensions and aggregation levels.

The set of these requirements, which served as the actual definition of OLAP, should be considered as recommendations, and specific products should be assessed according to the degree of closeness to ideal full compliance with all requirements.

Data mining.

Data mining (DMA), or Data Mining, is a term used to describe knowledge discovery in databases, knowledge extraction, data mining, data mining, data sample processing, data cleaning and data mining; This also means accompanying software. All these actions are carried out automatically and allow even non-programmers to get quick results.

The request is made by the end user, possibly in natural language. The request is converted to SQL format. SQL query it goes over the network to the DBMS, which manages the database or data storage. The DBMS finds the answer to the request and delivers it back. The user can then design the presentation or report as per their requirements.

Many important decisions in almost any area of ​​business and social sphere are based on the analysis of large and complex databases. IBP can be very helpful in these cases.

Data mining methods are closely related to OLAP technologies and data warehouse technologies. That's why the best option is an integrated approach to their implementation.

In order for existing data warehouses to support management decision making, information must be presented to the analyst in in the required form, that is, it must have developed tools for accessing and processing storage data.

Very often, information and analytical systems, created with the expectation of direct use by decision makers, turn out to be extremely easy to use, but severely limited in functionality. Such static systems are called Executive Information Systems. They contain predefined sets of queries and, while sufficient for everyday review, are unable to answer all questions about the available data that may arise when making decisions. The results of such a system, as a rule, are multi-page reports, after careful study of which the analyst has a new series of questions. However, each new request that was not foreseen when designing such a system must first be formally described, coded by the programmer, and only then executed. The waiting time in this case can be hours and days, which is not always acceptable. Thus, the external simplicity of statistical decision support information systems, for which most customers of information and analytical systems are actively fighting, results in a loss of flexibility.

Dynamic decision support systems, on the contrary, are focused on processing unregulated (ad hoc) analyst requests for data. The work of analysts with these systems consists of an interactive sequence of forming queries and studying their results.

But dynamic decision support systems can operate not only in the field of online analytical processing (OLAP). Support for making management decisions based on accumulated data can be performed in three basic areas.

1. Scope of detailed data. This is the scope of most information retrieval systems. In most cases, relational DBMSs cope well with the tasks that arise here. The generally accepted standard for the language for manipulating relational data is SQL. Informational – search engines, providing an end-user interface in tasks of searching for detailed information, can be used as add-ons both over individual databases of transactional systems and over a general data warehouse.

2. The scope of aggregate indicators. A comprehensive look at the information collected in a data warehouse, its generalization and aggregation, and multidimensional analysis are the tasks of OLAP systems. Here you can either focus on special multidimensional DBMSs, or remain within the framework of relational technologies. In the second case, pre-aggregated data can be collected in a star-shaped database, or information aggregation can be performed in the process of scanning detailed tables of a relational database.

3. The sphere of patterns. Intellectual processing is carried out using data mining methods, the main objectives of which are to search for functional and logical patterns in the accumulated information, build models and rules that explain the found anomalies and/or predict the development of certain processes.

Complete information structure analytical system built on the basis of a data warehouse is shown in Fig. 3.2. In specific implementations individual components this scheme is often missing.

Fig.3.2. Structure of the corporate information and analytical system.

The structure of the warehouse database is usually designed in such a way as to facilitate the analysis of information as much as possible. It should be convenient to “lay out” the data in different directions (called dimensions). For example, today a user wants to see a summary of parts shipments by supplier to compare their activities. Tomorrow, the same user will need a picture of changes in the volume of supplies of parts by month in order to track the dynamics of supplies. The database structure should support these types of analyzes by allowing the extraction of data that corresponds to a given set of dimensions.

The basis of operational analytical data processing is the principle of organizing information into a hypercubic model. The simplest three-dimensional data cube for parts supplies for the previously discussed test database is shown in Fig. 3.11. Each cell corresponds to a “fact” – for example, the volume of delivery of a part. Along one side of the cube (one dimension) are the months during which the deliveries reflected by the cube were made. The second dimension consists of part types, and the third dimension corresponds to suppliers. Each cell contains the delivery quantity for the corresponding combination of values ​​in all three dimensions. It should be noted that when filling the cube, the values ​​for deliveries of each month from the test database were aggregated.


3.11. A simplified hypercube option for analyzing parts supply

OLAP class systems differ in the way they present data.

Multidimensional OLAP (MOLAP) – these systems are based on a multidimensional data structure based on dynamic arrays with corresponding access methods. MOLAP is implemented using patented technologies for organizing multidimensional DBMS. The advantage of this approach is the convenience of performing calculations on hypercube cells, because Corresponding cells are created for all combinations of measurements (like in a spreadsheet). Classic representatives of such systems include Oracle Express and SAS Institute MDDB.



Relational OLAP (ROLAP)– supports multidimensional analytical models over relational databases. This class of systems includes Meta Cube Informix, Microsoft OLAP Services, Hyperion Solutions, SAS Institute Relational OLAP.

Desktop OLAP– tools for generating multidimensional queries and reports for local information systems (spreadsheets, flat files). The following systems can be distinguished: Business Objects, Cognos Power Play.

E.F. Codd defined twelve rules that an OLAP product must satisfy, including multidimensional conceptual representation of data, transparency, availability, robust performance, client-server architecture, dimensional equality, dynamic processing of sparse matrices, multi-user support, unlimited support for cross-dimensional operations, intuitive data manipulation , flexible report generation mechanism, unlimited number of dimensions and aggregation levels.



The most common systems are ROLAP class. They allow you to organize information model over a relationally complete storage of any structure or over a special data mart.

Rice. 3.12. Star-type diagram of an analytical showcase for parts supply

For most data warehouses, the most effective way modeling an N-dimensional cube is a “star”. In Fig. Figure 3.11 shows a hypercube model for analyzing the supply of parts, in which information is consolidated along four dimensions (supplier, part, month, year). The star schema is based on a fact table. The fact table contains a column indicating the quantity of delivery, as well as columns indicating foreign keys for all dimension tables. Each cube dimension is represented by a table of values, which is a reference in relation to the fact table. To organize levels of information generalization, categorical inputs are organized above the measurement reference books (for example, “material-part”, “supplier city”).

The reason why the diagram in Fig. 3.12 is called a “star”, quite obvious. The ends of the star are formed by the dimension tables, and their connections to the fact table located in the center form the rays. With this database structure, most business analysis queries combine a central fact table with one or more dimension tables. For example, a query to obtain the volume of shipments of all parts in 2004 by month, broken down by supplier, looks like this:

SELECT SUM(VALUE), SUPPLIER.SUPPLIER_NAME, FACT.MONTH_ID

FROM FACT, SUPPLIER

WHERE FACT.YEAR_ID=2004

AND FACT.SUPPLIER_CODE=SUPPLIER.SUPPLIER_CODE

GROUP_BY SUPPLIER_CODE, MONTH_ID

ORDER_BY SUPPLIER_CODE, MONTH_ID.

In Fig. Figure 3.13 shows a fragment of the report generated as a result of the specified request.

On-Line Analytical Processing (OLAP)

The technology for complex multidimensional data analysis is called OLAP (On-Line Analytical Processing). OLAP is a key component of data warehousing. The OLAP concept was described in 1993 by Edgar Codd and has the following requirements for multidimensional analysis applications:

multidimensional conceptual representation of data, including full support for hierarchies and multiple hierarchies (a key requirement of OLAP);

providing the user with analysis results in an acceptable time (usually no more than 5 s), at the cost of a less detailed analysis;

the ability to carry out any logical and statistical analysis characteristic of this application, and saving it in a form accessible to the end user;

multi-user access to data with support for appropriate locking mechanisms and authorized access means;

the ability to access any necessary information, regardless of its volume.

An OLAP system consists of many components. Actually high level presentation system includes a data source, a multidimensional database (MDB), which provides the ability to implement a reporting mechanism based on OLAP technologies, OLAP server and client. The system is built on the client-server principle and provides remote and multi-user access to the MDB server.

Let's look at the components of an OLAP system.

Sources. The source in OLAP systems is the server that supplies data for analysis. Depending on the use of the OLAP product, the source may be a data warehouse, a legacy database containing common data, a set of tables that aggregate financial data, or any combination of the above.

Data store. Source data is collected and stored in a warehouse designed according to data warehousing principles. The data warehouse is a relational database (RDB). The main table DW (fact table) contains numeric values indicators for which statistical information is collected.

Multidimensional database. A data warehouse serves as a provider of information to a multidimensional database, which is a collection of objects. The main classes of these objects are dimensions and measures. Dimensions include sets of values ​​(parameters) by which data is indexed, for example, time, regions, type of institution, etc. Each dimension is filled with values ​​from the corresponding dimension tables of the data warehouse. The set of measurements determines the space of the process under study. Indicators mean multidimensional cubes data (hypercubes). The hypercube contains the data itself, as well as aggregate sums for the dimensions included in the indicator. Indicators constitute the main content of the MDB and are filled in in accordance with the fact table. Along each axis of a hypercube, data can be organized into a hierarchy representing different levels of detail. This allows you to create hierarchical dimensions, which will be used to aggregate or drill down the data presentation during subsequent data analysis. A typical example of a hierarchical dimension is a list of territorial objects grouped by districts, regions, and districts.

Server. The application part of the OLAP system is the OLAP server. This component does all the work (depending on the system model), and stores all the information to which active access is provided. Server architecture is governed by various concepts. In particular, the main functional characteristic OLAP products use MDB or RDB to store data.

Client application. Data structured accordingly and stored in the MDB is available for analysis using a client application. The user gets the opportunity remote access to data, formulation complex queries, generating reports, obtaining arbitrary subsets of data. Obtaining a report comes down to selecting specific measurement values ​​and constructing a section of a hypercube. The cross section is determined by the selected measurement values. Data for other measurements are summarized.

The main concepts of a multidimensional data model are: Data Hypercube, Dimension, Memders, Cell and Measure.

A data hypercube contains one or more dimensions and is an ordered collection of cells. Each cell is defined by one and only one set of dimension values—labels. The cell can contain data - a measure or be empty.

A dimension is a set of marks that form one of the faces of a hypercube. An example of a time dimension is a list of days, months, quarters. An example of a geographical dimension could be a list of territorial objects: settlements, districts, regions, countries, etc.

To access the data, the user must specify one or more cells by selecting the dimension values ​​that correspond to required cells. The process of selecting measurement values ​​is called fixing labels, and the set of selected measurement values ​​is called a set of fixed labels.

Advantages of using server OLAP tools compared to client OLAP tools: in the case of using server tools, the calculation and storage of aggregate data occurs on the server, and the client application receives only the results of queries against them, which generally allows reducing network traffic, query execution time and resource requirements consumed by the client application.

1. Multidimensional data representation - end-user tools that provide multidimensional visualization and manipulation of data; The multidimensional representation layer abstracts from the physical structure of the data and treats the data as multidimensional.

2. Multidimensional processing - a means (language) for formulating multidimensional queries (traditional relational language SQL turns out to be unsuitable here) and a processor that can process and execute such a query.

3. Multidimensional storage - means of physical organization of data, ensuring the effective execution of multidimensional queries.

The first two levels are mandatory in all OLAP tools. The third level, although widespread, is not necessary, since data for a multidimensional representation can also be extracted from ordinary relational structures.

In any data warehouse - both regular and multidimensional - along with detailed data extracted from operating systems, aggregated indicators (total indicators), such as the sum of sales volumes by month, by product category, etc., are also stored.

The main disadvantages are the increase in the volume of stored information (when adding new dimensions, the volume of data that makes up the cube grows exponentially) and the time it takes to load them.

The degree of increase in data volume when calculating aggregates depends on the number of dimensions of the cube and the structure of these dimensions, i.e. the ratio of the number of “parents” and “descendants” at different levels of measurement. To solve the problem of storing units, they use complex circuits, which make it possible to achieve a significant increase in query performance when calculating not all possible aggregates.

Both raw and aggregate data can be stored in either relational or multidimensional structures. In this regard, three methods of storing multidimensional data are currently used:

MOLAP (Multidimensional OLAP) - source and aggregate data are stored in a multidimensional database. Storing data in multidimensional structures allows you to manipulate the data as a multidimensional array, due to which the speed of calculating aggregate values ​​is the same for any of the dimensions. However, in this case, the multidimensional database is redundant, since the multidimensional data entirely contains the original relational data.

These systems provide full cycle OLAP processing. They either include, in addition to the server component, their own integrated client interface, or use external spreadsheet programs to communicate with the user.

ROLAP (Relational OLAP) - the original data remains the same relational database data where it originally resided. Aggregate data is placed in service tables specially created for storing it in the same database.

HOLAP (Hybrid OLAP) - the original data remains in the same relational database where it was originally located, and the aggregate data is stored in a multidimensional database.

Some OLAP tools only support data storage in relational structures, some - only in multidimensional ones. However, most modern server OLAP tools support all three data storage methods. The choice of storage method depends on the volume and structure of the source data, requirements for the speed of query execution and the frequency of updating OLAP cubes.

3.4 Methods of analytical data processing

In order for existing data warehouses to facilitate management decision-making, the information must be presented to the analyst in the required form, i.e., he must have developed tools for accessing and processing warehouse data.

Very often, information and analytical systems created with the expectation of direct use by decision makers turn out to be extremely easy to use, but severely limited in functionality. Such static systems are called Executive Information Systems (IIS), or Executive Information Systems (EIS). They contain many queries and, while sufficient for everyday review, are unable to answer all the questions that may arise when making decisions. The result of such a system, as a rule, is multi-page reports, after careful study of which the analyst has a new series of questions. However, each new request that was not foreseen when designing such a system must first be formally described, coded by the programmer, and only then executed. The waiting time in this case can be hours and days, which is not always acceptable.

On-line analytical processing. Or On-Line Analytical Processing, OLAP is a key component of organizing data warehouses. The OLAP concept was described in 1993 by Edgar Codd and has the following requirements for multidimensional analysis applications:

– multidimensional conceptual representation of data, including full support for hierarchies and multiple hierarchies (a key requirement of OLAP);

– providing the user with analysis results in an acceptable time (usually no more than 5 s), even at the cost of a less detailed analysis;

– the ability to perform any logical and statistical analysis specific to a given application and save it in a form accessible to the end user;

– multi-user access to data with support for appropriate locking mechanisms and authorized access means;

– the ability to access any necessary information, regardless of its volume and storage location.

An OLAP system consists of many components. At the highest level of presentation, the system includes a data source, a multidimensional database (MDB), which provides the ability to implement a reporting mechanism based on OLAP technology, an OLAP server and a client. The system is built on the client-server principle and provides remote and multi-user access to the MDB server.

Let's look at the components of an OLAP system.

Sources. The source in OLAP systems is the server that supplies data for analysis. Depending on the area of ​​use of the OLAP product, the source can be a data warehouse, an inherited database containing common data, a set

tables combining financial data or any combination of the above.

Data store. Source data is collected and stored in a warehouse designed according to data warehousing principles. The data warehouse is a relational database (RDB). The main data table (fact table) contains numerical values ​​of indicators for which statistical information is collected.

Multidimensional database.A data warehouse serves as a provider of information to a multidimensional database, which is a collection of objects. The main classes of these objects are dimensions and measures. Dimensions include sets of values ​​(parameters) by which data is indexed, for example, time, regions, type of institution, etc. Each dimension is filled with values ​​from the corresponding dimension tables of the data warehouse. The set of measurements determines the space of the process under study. Indicators refer to multidimensional data cubes (hypercubes). The hypercube contains the data itself, as well as aggregate sums for the dimensions included in the indicator. Indicators constitute the main content of the MDB and are filled in in accordance with the fact table. Along each axis of a hypercube, data can be organized into a hierarchy representing different levels of detail. This allows you to create hierarchical dimensions, which will be used to aggregate or drill down the data presentation during subsequent data analysis. A typical example of a hierarchical dimension is a list of territorial objects grouped by districts, regions, and districts.

Server. The application part of the OLAP system is the OLAP server. This component does all the work (depending on the system model), and stores all the information to which active access is provided. Server architecture is governed by various concepts. In particular, the main functional characteristic of OLAP products is the use of MDB or RDB for data storage.

Client Application.Data structured accordingly and stored in the MDB is available for analysis using a client application. The user gets the opportunity to remotely access data, formulate complex queries, generate reports, and obtain arbitrary subsets of data. Obtaining a report comes down to selecting specific measurement values ​​and constructing a section of a hypercube. The cross section is determined by the selected measurement values. Data for other measurements are summarized.

OLAPon the client and on the server. Multidimensional data analysis can be carried out using various tools, which can be divided into client and server OLAP tools.

OLAP client tools (for example, Pivot Tables in Excel 2000 from Microsoft or ProClarity from Knosys) are applications that calculate aggregate data and display them. At the same time, the aggregate data itself is contained in a cache inside the address space of such an OLAP tool.

If the source data is contained in a desktop DBMS, the calculation of aggregate data is performed by the OLAP tool itself. If the source of the initial data is a server DBMS, many of the client OLAP tools send SQL queries to the server and as a result receive aggregate data calculated on the server.

Typically, OLAP functionality is implemented in statistical data processing tools and in some spreadsheets.

Many development tools contain libraries of classes or components that allow you to create applications that implement simple OLAP functionality (such as the Decision Cube components in Borland Delphi and Borland C++ Builder). In addition, many companies offer ActiveX controls and other libraries that implement similar functionality.

Client OLAP tools are used, as a rule, with a small number of dimensions (usually no more than six) and a small variety of values ​​for these parameters - since the resulting aggregate data must fit into the address space of such a tool, and their number grows exponentially as the number of dimensions increases.

Many OLAP client tools allow you to save the contents of the cache with aggregate data as a file, so as not to recalculate them. However, this opportunity is often used to alienate aggregate data for the purpose of transferring it to other organizations or for publication.

The idea of ​​saving a cache with aggregate data in a file has received its fruition further development in server OLAP tools (for example, Oracle Express Server or Microsoft OLAP Services), in which saving and changing aggregate data, as well as maintaining the store containing them, is carried out by a separate application or process called an OLAP server. Client applications can request such multidimensional storage and receive certain data in response. Some client applications may also create such stores or update them based on changed source data.

The advantages of using server OLAP tools compared to client OLAP tools are similar to the advantages of using server DBMSs compared to desktop ones: when using server tools, the calculation and storage of aggregate data occurs on the server, and the client application receives only the results of queries against them, which allows in general, reduce network traffic, request execution time, and resource requirements consumed by the client application.

3.5 Technical aspects of multidimensional data storage

Multidimensionality in OLAP applications can be divided into three levels:

1. Multidimensional data representation– end-user tools that provide multidimensional visualization and data manipulation; The multidimensional representation layer abstracts from the physical structure of the data and treats the data as multidimensional.

    Multidimensional processing– a means (language) for formulating multidimensional queries (the traditional relational language SQL is unsuitable here) and a processor that can process and execute such a query.

    Multidimensional storage– means of physical organization of data, ensuring the effective execution of multidimensional queries.

The first two levels are mandatory in all OLAP tools. The third level, although widespread, is not necessary, since data for a multidimensional representation can also be extracted from ordinary relational structures. The multidimensional query processor, in this case, translates multidimensional queries into SQL queries that are executed by the relational DBMS.

In any data warehouse - both regular and multidimensional - along with detailed data extracted from operational systems, aggregated indicators (total indicators), such as the sum of sales volumes by month, by product category, etc., are also stored. Aggregates are stored explicitly for the sole purpose of speeding up query execution. After all, on the one hand, a very large amount of data is usually accumulated in the warehouse, and on the other hand, analysts in most cases are interested in generalized rather than detailed indicators. And if millions of individual sales had to be added up each time to calculate the total sales for the year, the speed would most likely be unacceptable. Therefore, when loading data into a multidimensional database, all total indicators or part of them are calculated and saved.

However, there are disadvantages to using aggregated data. The main disadvantages are the increase in the volume of stored information (when adding new dimensions, the volume of data that makes up the cube grows exponentially) and the time it takes to load them. Moreover, the volume of information can increase tens and even hundreds of times. For example, in one of the published standard tests, a complete calculation of aggregates for 10 MB of raw data required 2.4 GB, i.e. the data grew 240 times!

The degree of increase in data volume when calculating aggregates depends on the number of dimensions of the cube and the structure of these dimensions, that is, the ratio of the number of “parents” and “children” at different levels of measurement. To solve the problem of storing aggregates, complex schemes are used, which make it possible to achieve a significant increase in query performance when calculating not all possible aggregates.

Both raw and aggregate data can be stored either in

relational or in multidimensional structures. In this regard, three methods of storing multidimensional data are currently used:

MOLAP (Multidimensional OLAP) – source and aggregate data are stored in a multidimensional database. Storing data in multidimensional structures allows you to manipulate the data as a multidimensional array, due to which the speed of calculating aggregate values ​​is the same for any of the dimensions. However, in this case, the multidimensional database is redundant, since the multidimensional data entirely contains the original relational data.

These systems provide a full cycle of OLAP processing. They either include, in addition to the server component, their own integrated client interface, or use external spreadsheet programs to communicate with the user.

ROLAP (Relational OLAP) - the original data remains in the same relational database where it was originally located. Aggregate data is placed in service tables specially created for storing it in the same database.

HOLAP (Hybrid OLAP) – the original data remains in the same relational database where it was originally located, and the aggregate data is stored in a multidimensional database.

Some OLAP tools support storing data only in relational structures, some only in multidimensional ones. However, most modern server OLAP tools support all three data storage methods. The choice of storage method depends on the volume and structure of the source data, requirements for the speed of query execution and the frequency of updating OLAP cubes.

3.6 Data mining (DataMining)

The term Data Mining refers to the process of finding correlations, trends and relationships through various mathematical and statistical algorithms: clustering, regression and correlation analysis, etc. for decision support systems. In this case, the accumulated information is automatically generalized to information that can be characterized as knowledge.

The basis of modern Data technologies Mining is based on the concept of patterns reflecting patterns inherent in subsamples of data and constituting the so-called hidden knowledge.

The search for patterns is carried out using methods that do not use any a priori assumptions about these subsamples. An important feature of Data Mining is the non-standard and non-obvious nature of the patterns being sought. In other words, Data Mining tools differ from statistical data processing tools and OLAP tools in that instead of checking relationships pre-assumed by users

between data, they, based on the available data, are able to independently find such relationships, as well as build hypotheses about their nature.

In general, the Data Mining process consists of three stages

    identifying patterns (free search);

    using identified patterns to predict unknown values ​​(predictive modeling);

    exception analysis, designed to identify and interpret anomalies in found patterns.

Sometimes an intermediate stage of checking the reliability of the found patterns between their discovery and use (validation stage) is explicitly identified.

There are five standard types patterns identified by Data Mining methods:

1.Association allows you to identify stable groups of objects between which there are implicit connections. The frequency of occurrence of an individual item or group of items, expressed as a percentage, is called prevalence. Low level prevalence (less than one thousandth of one percent) suggests that such an association is not significant. Associations are written in the form of rules: A=> B, Where A - package, IN - consequence. To determine the importance of each resulting association rule, it is necessary to calculate a value called confidence A To IN(or relationship A and B). Confidence shows how often when A appears IN. For example, if d(A/B)=20%, this means that when purchasing a product A in every fifth case the goods are also purchased IN.

A typical example of the use of association is the analysis of purchase patterns. For example, when conducting a study in a supermarket, you can find that 65% of those who buy potato chips also buy Coca-Cola, and if there is a discount for such a set, they buy Coke in 85% of cases. Such results are valuable in shaping marketing strategies.

2.Sequence - it is a method of identifying associations over time. In this case, rules are defined that describe the sequential occurrence of certain groups of events. Such rules are necessary for constructing scenarios. In addition, they can be used, for example, to formulate a typical set of previous sales that may lead to subsequent sales of a particular product.

3.Classification - generalization tool. It allows us to move from the consideration of individual objects to generalized concepts that characterize certain collections of objects and are sufficient to recognize objects belonging to these collections (classes). The essence of the concept formation process is to find patterns characteristic of classes. Many different features (attributes) are used to describe objects. The problem of forming concepts based on feature descriptions was formulated by M.M. Bongart. Its solution is based on the application of two main procedures: training and testing. In training procedures, a classification rule is constructed based on processing the training set of objects. The verification (examination) procedure consists of using the resulting classification rule to recognize objects from a new (examination) sample. If the test results are considered satisfactory, then the learning process ends; otherwise, the classification rule is refined in the process of re-training.

4.Clustering – this is the distribution of information (records) from the database into groups (clusters) or segments with the simultaneous definition of these groups. Unlike classification, analysis here does not require preliminary assignment of classes.

5.Time series forecasting is a tool for determining trends in changes in the attributes of the objects under consideration over time. Analysis of the behavior of time series allows us to predict the values ​​of the characteristics under study.

To solve such problems, various Data Mining methods and algorithms are used. Due to the fact that Data Mining has developed and is developing at the intersection of such disciplines as statistics, information theory, machine learning, and database theory, it is quite natural that most Data Mining algorithms and methods were developed based on various methods from these disciplines.

From the variety of existing data mining methods, the following can be distinguished:

    regression, variance and correlation analysis(implemented in most modern statistical packages, in particular, in products of SAS Institute, StatSoft, etc.);

    analysis methods in a specific subject area, based on empirical models (often used, for example, in inexpensive financial analysis tools);

    neural network algorithms– a method of simulating processes and phenomena that allows one to reproduce complex dependencies. The method is based on the use of a simplified model biological brain and lies in the fact that the initial parameters are considered as signals that are transformed in accordance with the existing connections between “neurons”, and the response of the entire network to the initial data is considered as the response that is the result of the analysis. In this case, connections are created using the so-called network training through a large sample size containing both initial data and correct answers. Neural networks are widely used to solve classification problems;

    fuzzy logic used to process data with fuzzy truth values ​​that can be represented by a variety of linguistic variables. Fuzzy knowledge representation is widely used to solve classification and forecasting problems, for example, in the XpertRule Miner system (Attar Software Ltd., UK), as well as in AIS, NeuFuz, etc.;

    inductive inferences allow you to obtain generalizations of facts stored in the database. The process of inductive learning can involve a specialist who provides hypotheses. This method is called supervised learning. The search for generalization rules can be carried out without a teacher by automatically generating hypotheses. In modern software As a rule, both methods are combined, and statistical methods are used to test hypotheses. An example of a system using inductive leads is XpertRule Miner, developed by Attar Software Ltd. (Great Britain);

    reasoning based on similar cases(“nearest neighbor” method) (Case-based reasoning – CBR) are based on searching in the database for situations whose descriptions are similar in a number of ways to a given situation. The principle of analogy allows us to assume that the results of similar situations will also be close to each other. The disadvantage of this approach is that it does not create any models or rules that generalize previous experience. In addition, the reliability of the inferred results depends on the completeness of the description of the situations, as in inductive inference processes. Examples of systems using CBR are: KATE Tools (Acknosoft, France), Pattern Recognition Workbench (Unica, USA);

    decision trees– a method of structuring a problem in the form of a tree graph, the vertices of which correspond to production rules that allow you to classify data or analyze the consequences of decisions. This method gives a visual representation of the system of classification rules, if there are not very many of them. Simple tasks are solved using this method much faster than using neural networks. For complex problems and for some types of data, decision trees may not be appropriate. In addition, this method is characterized by the problem of significance. One consequence of hierarchical data clustering is the lack of large number training examples for many special cases, and therefore the classification cannot be considered reliable. Decision tree methods are implemented in many software tools, namely: C5.0 (RuleQuest, Australia), Clementine (Integral Solutions, UK), SIPINA (University of Lyon, France), IDIS (Information Discovery, USA);

    evolutionary programming– search and generation of an algorithm expressing the interdependence of data, based on an initially specified algorithm, modified during the search process; sometimes the search for interdependencies is carried out among certain types of functions (for example, polynomials);

limited search algorithms, computing combinations of simple logical events in subgroups of data.

3.7 IntegrationOLAPAndDataMining

Online analytical processing (OLAP) and data mining (Data Mining) are two components of the decision support process. However, today, most OLAP systems focus only on providing access to multidimensional data, and most pattern mining tools deal with one-dimensional data perspectives. To increase the efficiency of data processing for decision support systems, these two types of analysis must be combined.

Currently, the compound term “OLAP Data Mining” (multidimensional mining) is emerging to refer to such a combination.

There are three main ways to form “OLAP Data Mining”:

    "Cubing then mining". The ability to perform intelligent analysis should be provided over any query result for a multidimensional conceptual representation, that is, over any fragment of any projection of a hypercube of indicators.

    "Mining then cubing". Like data retrieved from a warehouse, mining results must be presented in hypercube form for subsequent multidimensional analysis.

    "Cubing while mining". This flexible method of integration allows you to automatically activate the same type of intellectual processing mechanisms over the result of each step of multidimensional analysis (transition) between levels of generalization, extraction of a new fragment of a hypercube, etc.).

    11th grade [Text... them How Part all systems ... assistant professor ... Cheboksary, 2009. No. 10. P. 44 -49... . Authors- compilers: N... noteslectures, ...

  • Educational and methodological manual

    ... lectures. Preparation lectures mathematics. Writing noteslectures lectures. Usage informationtechnologies ...

  • And Kondaurova with Lebedev, research activities of the future mathematics teacher, creative tasks in elementary mathematics and methods of teaching it

    Educational and methodological manual

    ... lectures. Preparation lectures mathematics. Writing noteslectures. Preparation of visual aids. Reading technique lectures. Usage informationtechnologies ...

  • M MEDIA MONITORING Modernization of vocational education March - August 2011

    Summary

    ... 11 .08.2011 "Dead Souls-2" AT RNIMU them ... 3,11 -3,44 . ... public lectures leaders... Cheboksary... and scribbling notes audience - ... informationalsystems And technologies. ... system education, says assistant professor ... compilers ... parts increase in real content ...

OLAP (Online Analytical Processing) is an information process that allows the user to query the system, conduct analysis, etc. in operational mode (online). Results are generated within seconds.

OLAP systems are made for end users, while OLTP systems are made for professional IS users. OLAP includes activities such as generating queries, querying ad hoc reports, performing statistical analysis, and building multimedia applications.

Providing OLAP requires working with a data warehouse (or multidimensional warehouse) as well as a set of tools, usually with multidimensional capabilities. These tools can be query tools, spreadsheets, data mining tools, data visualization tools, etc.

The OLAP concept is based on the principle of multidimensional data representation. E. Codd examined the shortcomings of the relational model, first of all pointing out the inability to combine, view and analyze data from the point of view of multiple dimensions, that is, in the most understandable way for corporate analysts, and identified general requirements for OLAP systems that expand the functionality of relational DBMSs and include multidimensional analysis as one of its characteristics.

12 rules that an OLAP class software product must satisfy. These rules:

1. Multidimensional conceptual representation of data.

2. Transparency.

3. Availability.

4. Steady performance.

5. Client - server architecture.

6. Equality of measurements.

7. Dynamic processing of sparse matrices.

8. Support for multi-user mode.

9. Unlimited support for cross-dimensional operations.

10. Intuitive data manipulation.

11. Flexible report generation mechanism.

12. Unlimited number of dimensions and aggregation levels.

The set of these requirements, which served as the actual definition of OLAP, should be considered as recommendations, and specific products should be assessed according to the degree of closeness to ideal full compliance with all requirements.


Data Mining and Knowledge Mining. Management and analysis of large volumes of data (Big data). Business intelligence systems (BI).

Data mining (IDA) is a general term for data analysis with the active use of mathematical methods and algorithms (optimization methods, genetic algorithms, pattern recognition, statistical methods, Data Mining, etc.), using the results of applying methods of visual data presentation.



In general, the IAD process consists of three stages:

1) identifying patterns (free search);

2) using identified patterns to predict unknown values ​​(forecasting);

3) analysis of exceptions to identify and interpret anomalies in the found patterns.

Sometimes there is an intermediate stage of checking the reliability of the found patterns (validation stage) between their discovery and use.

All IDA methods, based on the principle of working with source data, are divided into two groups:

Case-based reasoning techniques – raw data can be stored in explicit granular form and directly used for prediction and/or exception analysis. The disadvantage of this group of methods is the complexity of their use on large volumes data.

Methods for identifying and using formalized patterns that require extracting information from primary data and converting it into some formal designs, the type of which depends on the specific method.

Data Mining (DM) is a technology for discovering in “raw” data previously unknown, non-trivial, practically useful and interpretable knowledge necessary for decision-making in various areas of human activity. The algorithms used in Data Mining require large quantity computing, which was previously a limiting factor for widespread practical application these methods, however, the increase in productivity modern processors alleviated the severity of this problem.

The Business Intelligence market consists of 5 sectors:

1. OLAP products;

2. Data mining tools;

3. Tools for building Data Warehousing and Data Showcases;

4. Management information systems and applications;

5. End user tools for querying and reporting.

Currently, among the leaders of corporate BI platforms we can highlight MicroStrategy, Business Objects, Cognos, Hyperion Solutions, Microsoft, Oracle, SAP, SAS Institute and others (see Appendix B for comparative analysis some functionality BI systems).







2024 gtavrl.ru.