Introduction to modern Data Mining. Data Mining Methods Data mining methods are used for


Data Mining

Data Mining is a methodology and process for discovering previously unknown, non-trivial, practically useful and interpretable knowledge in large amounts of data accumulated in companies’ information systems, knowledge necessary for decision-making in various areas of human activity. Data Mining is one of the stages of the larger Knowledge Discovery in Databases methodology.

The knowledge discovered in the Data Mining process must be non-trivial and previously unknown. Non-triviality implies that such knowledge cannot be discovered through simple visual analysis. They must describe the relationships between the properties of business objects, predict the values ​​of some characteristics based on others, etc. The knowledge found should be applicable to new objects.

The practical usefulness of knowledge is due to the possibility of its use in the process of supporting management decision-making and improving the company’s activities.

Knowledge must be presented in a form understandable to users who do not have special mathematical training. For example, logical constructs “if, then” are easiest for humans to perceive. Moreover, such rules can be used in various DBMSs as SQL queries. In the case where the extracted knowledge is not transparent to the user, there must be post-processing methods to bring it into an interpretable form.

Data Mining is not one, but a combination of a large number of different knowledge discovery methods. All problems solved by Data Mining methods can be divided into six types:

Data Mining is multidisciplinary in nature, as it includes elements of numerical methods, mathematical statistics and probability theory, information theory and mathematical logic, artificial intelligence and machine learning.

Business analysis tasks are formulated in different ways, but the solution to most of them comes down to one or another Data Mining task or a combination of them. For example, risk assessment is a solution to a regression or classification problem, market segmentation is clustering, demand stimulation is association rules. In fact, Data Mining tasks are the elements from which you can “assemble” a solution to most real business problems.

To solve the above problems, various Data Mining methods and algorithms are used. Due to the fact that Data Mining has developed and is developing at the intersection of disciplines such as mathematical statistics, information theory, machine learning and databases, it is quite natural that most Data Mining algorithms and methods were developed based on various methods from these disciplines. For example, the k-means clustering algorithm was borrowed from statistics.

What is Data Mining

Classification of Data Mining tasks

Association rules search problem

Clustering problem

Features of Data Miner in Statistica 8

Analysis tools STATISTICA Data Miner

Example of working in Data Minin

Creating reports and summaries

Sorting information

Analysis of prices of residential plots

Analysis of predictors of survival

Conclusion


What is Data Mining

The modern computer term Data Mining translates as “information extraction” or “data mining”. Often, along with Data Mining, the terms Knowledge Discovery and Data Warehouse are used. The emergence of these terms, which are an integral part of Data Mining, is associated with a new round in the development of tools and methods for processing and storing data. So, the goal of Data Mining is to identify hidden rules and patterns in large (very large) volumes of data.

The fact is that the human mind itself is not adapted to perceive huge amounts of heterogeneous information. The average person, with the exception of some individuals, is not able to grasp more than two or three relationships, even in small samples. But traditional statistics, which has long claimed to be the main tool for data analysis, also often fails when solving real-life problems. It operates with average characteristics of the sample, which are often fictitious values ​​(average solvency of the client, when, depending on the risk function or loss function, you need to be able to predict the solvency and intentions of the client; average signal intensity, while you are interested in the characteristic features and preconditions of signal peaks, etc. .d.).

Therefore, methods of mathematical statistics turn out to be useful mainly for testing pre-formulated hypotheses, while determining a hypothesis is sometimes a rather complex and time-consuming task. Modern Data Mining technologies process information in order to automatically search for patterns (patterns) characteristic of any fragments of heterogeneous multidimensional data. Unlike online analytical processing (OLAP), Data Mining shifts the burden of formulating hypotheses and identifying unexpected patterns from humans to computers. Data Mining is not one, but a combination of a large number of different knowledge discovery methods. The choice of method often depends on the type of data available and what information you are trying to obtain. Here, for example, are some methods: association (union), classification, clustering, time series analysis and forecasting, neural networks, etc.

Let us consider the properties of the discovered knowledge given in the definition in more detail.

The knowledge must be new, previously unknown. The effort spent on discovering knowledge that is already known to the user does not pay off. Therefore, it is new, previously unknown knowledge that is valuable.

Knowledge must be non-trivial. The results of the analysis should reflect non-obvious, unexpected patterns in the data, which constitute the so-called hidden knowledge. Results that could be obtained by simpler methods (for example, visual inspection) do not justify the use of powerful Data Mining methods.

Knowledge must be practically useful. The knowledge found must be applicable, including on new data, with a sufficiently high degree of reliability. Usefulness lies in the fact that this knowledge can bring certain benefits when applied.

Knowledge must be accessible to human understanding. The patterns found must be logically explainable, otherwise there is a possibility that they are random. In addition, the discovered knowledge must be presented in a form that is understandable to humans.

In Data Mining, models are used to represent the acquired knowledge. The types of models depend on the methods used to create them. The most common are: rules, decision trees, clusters and mathematical functions.

The scope of Data Mining is not limited in any way - Data Mining is needed wherever there is any data. The experience of many such enterprises shows that the return on data mining can reach 1000%. For example, there are reports of an economic effect that is 10-70 times higher than the initial costs from 350 to 750 thousand dollars. Information is provided about a $20 million project that paid for itself in just 4 months. Another example is annual savings of $700 thousand. through the implementation of Data Mining in a chain of supermarkets in the UK. Data Mining is of great value to managers and analysts in their daily activities. Business people have realized that with the help of Data Mining methods they can gain tangible competitive advantages.

Classification of DataMining tasks

DataMining methods allow you to solve many problems that an analyst faces. The main ones are: classification, regression, search for association rules and clustering. Below is a brief description of the main tasks of data analysis.

1) The classification task comes down to determining the class of an object based on its characteristics. It should be noted that in this problem the set of classes to which an object can be classified is known in advance.

2) The regression problem, like the classification problem, allows you to determine the value of some of its parameters based on the known characteristics of an object. Unlike the classification problem, the value of the parameter is not a finite set of classes, but a set of real numbers.

3) Association task. When searching for association rules, the goal is to find frequent dependencies (or associations) between objects or events. The found dependencies are presented in the form of rules and can be used both to better understand the nature of the analyzed data and to predict the occurrence of events.

4) The task of clustering is to search for independent groups (clusters) and their characteristics in the entire set of analyzed data. Solving this problem helps you understand the data better. In addition, grouping homogeneous objects makes it possible to reduce their number and, therefore, facilitate analysis.

5) Sequential patterns - establishing patterns between events related in time, i.e. detection of the dependence that if event X occurs, then after a given time event Y will occur.

6) Analysis of deviations - identifying the most uncharacteristic patterns.

The listed tasks are divided into descriptive and predictive according to their purpose.

Descriptive tasks focus on improving understanding of the data being analyzed. The key point in such models is the ease and transparency of the results for human perception. It is possible that the patterns discovered will be a specific feature of the particular data being studied and will not be found anywhere else, but it can still be useful and therefore should be known. This type of task includes clustering and searching for association rules.

Solving predictive problems is divided into two stages. At the first stage, a model is built based on a data set with known results. In the second stage, it is used to predict results based on new data sets. In this case, it is naturally required that the constructed models work as accurately as possible. This type of task includes classification and regression problems. This can also include the problem of searching for association rules, if the results of its solution can be used to predict the occurrence of certain events.

Based on the methods of solving problems, they are divided into supervised learning (learning with a teacher) and unsupervised learning (learning without a teacher). This name comes from the term Machine Learning, often used in English literature and denoting all Data Mining technologies.

In the case of supervised learning, the problem of data analysis is solved in several stages. First, using some Data Mining algorithm, a model of the analyzed data – a classifier – is built. The classifier is then trained. In other words, the quality of its work is checked and, if it is unsatisfactory, additional training of the classifier occurs. This continues until the required level of quality is achieved or it becomes clear that the selected algorithm does not work correctly with the data, or the data itself does not have a structure that can be identified. This type of task includes classification and regression problems.

Unsupervised learning combines tasks that identify descriptive patterns, such as patterns in purchases made by customers at a large store. Obviously, if these patterns exist, then the model should represent them and it is inappropriate to talk about its training. Hence the name - unsupervised learning. The advantage of such problems is the possibility of solving them without any prior knowledge about the analyzed data. These include clustering and searching for association rules.

Classification and Regression Problem

When analyzing, it is often necessary to determine which of the known classes the objects under study belong to, i.e., to classify them. For example, when a person approaches a bank for a loan, the bank employee must decide whether the potential client is creditworthy or not. Obviously, such a decision is made on the basis of data about the object under study (in this case, a person): his place of work, salary, age, family composition, etc. As a result of analyzing this information, the bank employee must classify the person as one of two well-known classes "creditworthy" and "uncreditworthy".

Another example of a classification task is email filtering. In this case, the filtering program must classify the incoming message as spam (unsolicited email) or as a letter. This decision is made based on the frequency of occurrence of certain words in the message (for example, the recipient’s name, impersonal address, words and phrases: purchase, “earn,” “advantageous offer,” etc.).

OLAP systems provide the analyst with a means of testing hypotheses when analyzing data, that is, the main task of the analyst is to generate hypotheses, which he solves based on his knowledge and experience. However, not only a person has knowledge, but also the accumulated data that is analyzed . Such knowledge is contained in a huge amount of information that a person cannot research on his own. Because of this, there is a risk of missing hypotheses that could provide significant benefits.

To detect “hidden” knowledge, special methods of automatic analysis are used, with the help of which it is necessary to practically extract knowledge from “blockages” of information. The term “data mining” or “data mining” has been assigned to this area.

There are many definitions of DataMining that complement each other. Here are some of them.

Data Mining is the process of discovering non-trivial and practically useful patterns in databases. (BaseGroup)

Data Mining is the process of extracting, exploring and modeling large volumes of data to discover previously unknown patterns (patterns) in order to achieve business advantages (SAS Institute)

Data Mining is a process that aims to discover new significant correlations, patterns and trends by sifting through large amounts of stored data using pattern recognition techniques plus the application of statistical and mathematical techniques (GartnerGroup)

Data Mining is the research and discovery by a “machine” (algorithms, artificial intelligence tools) of hidden knowledge in raw data.were previously unknown, non-trivial, practically useful, accessible for interpretationtions by man. (A. Bargesyan “Data Analysis Technologies”)

DataMining is the process of discovering useful knowledge about business. (N.M. Abdikeev “KBA”)

Properties of discovered knowledge

Let's consider the properties of the discovered knowledge.

  • The knowledge must be new, previously unknown. The effort spent on discovering knowledge that is already known to the user does not pay off. Therefore, it is new, previously unknown knowledge that is valuable.
  • Knowledge must be non-trivial. The results of the analysis should reflect non-obvious, unexpectedpatterns in data that constitute so-called hidden knowledge. Results that could be obtained by simpler methods (for example, visual inspection) do not justify the use of powerful Data Mining methods.
  • Knowledge must be practically useful. The knowledge found must be applicable, including on new data, with a sufficiently high degree of reliability. Usefulness lies in the fact that this knowledge can bring certain benefits when applied.
  • Knowledge must be accessible to human understanding. The patterns found must be logically explainable, otherwise there is a possibility that they are random. In addition, the discovered knowledge must be presented in a form that is understandable to humans.

In DataMining, models are used to represent the acquired knowledge. The types of models depend on the methods used to create them. The most common are: rules, decision trees, clusters and mathematical functions.

DataMining Tasks

Let us recall that DataMining technology is based on the concept of templates, which are patterns. As a result of the discovery of these patterns, hidden from the naked eye, DataMining problems are solved. Different types of patterns that can be expressed in a human-readable form correspond to specific DataMining tasks.

There is no consensus on which tasks should be classified as DataMining. Most authoritative sources list the following: classification,

clustering, prediction, association, visualization, analysis and discovery

deviations, assessment, analysis of connections, summing up.

The purpose of the description that follows is to give a general idea of ​​DataMining problems, compare some of them, and also present some methods by which these problems are solved. The most common Data Mining tasks are classification, clustering, association, forecasting and visualization. Thus, tasks are divided according to the type of information produced, this is the most general classification of DataMining tasks.

Classification

The task of dividing a set of objects or observations into a priori specified groups, called classes, within each of which they are assumed to be similar to each other, having approximately the same properties and characteristics. In this case, the solution is obtained based on analysis values ​​of attributes (features).

Classification is one of the most important tasks Data Mining . It is used in marketing when assessing the creditworthiness of borrowers, determining customer loyalty, pattern recognition , medical diagnostics and many other applications. If the analyst knows the properties of objects of each class, then when a new observation belongs to a certain class, these properties are automatically extended to it.

If the number of classes is limited to two, thenbinary classification , to which many more complex problems can be reduced. For example, instead of defining such degrees of credit risk as “High”, “Medium” or “Low”, you can use only two - “Issue” or “Refuse”.

DataMining uses many different models for classification: neural networks, decision trees , support vector machines, k-nearest neighbors method, covering algorithms, etc., in the construction of which supervised learning is used whenoutput variable(class label ) is specified for each observation. Formally, classification is made based on the partitionfeature spaces into areas, within each of whichmultidimensional vectors are considered identical. In other words, if an object falls into a region of space associated with a certain class, it belongs to it.

Clustering

Short description. Clustering is a logical continuation of the idea

classifications. This is a more complex task; the peculiarity of clustering is that object classes are not initially predefined. The result of clustering is the division of objects into groups.

An example of a method for solving a clustering problem: “unsupervised” training of a special type of neural networks - self-organizing Kohonen maps.

Associations

Short description. When solving the problem of searching for association rules, patterns are found between related events in a data set.

The difference between association and the two previous DataMining tasks: the search for patterns is carried out not on the basis of the properties of the analyzed object, but between several events that occur simultaneously. The most well-known algorithm for solving the problem of finding association rules is the Apriori algorithm.

Sequence or sequential association

Short description. Sequence allows you to find temporal patterns between transactions. The sequence task is similar to association, but its goal is to establish patterns not between simultaneously occurring events, but between events related in time (i.e., occurring at some specific interval in time). In other words, a sequence is determined by a high probability of a chain of events related in time. In fact, an association is a special case of a sequence with a time lag of zero. This DataMining task is also called the sequential pattern finding task.

Sequence rule: after event X, event Y will occur after a certain time.

Example. After purchasing an apartment, residents in 60% of cases purchase a refrigerator within two weeks, and within two months in 50% of cases they purchase a TV. The solution to this problem is widely used in marketing and management, for example, in Customer Lifecycle Management.

Regression, forecasting (Forecasting)

Short description. As a result of solving the forecasting problem, missing or future values ​​of target numerical indicators are estimated based on the characteristics of historical data.

To solve such problems, methods of mathematical statistics, neural networks, etc. are widely used.

Additional tasks

Deviation Detection, variance or outlier analysis

Short description. The goal of solving this problem is to detect and analyze data that is most different from the general set of data, identifying so-called uncharacteristic patterns.

Estimation

The estimation task comes down to predicting continuous values ​​of a feature.

Link Analysis

The task of finding dependencies in a data set.

Visualization (GraphMining)

As a result of visualization, a graphic image of the analyzed data is created. To solve the visualization problem, graphical methods are used to show the presence of patterns in the data.

An example of visualization techniques is presenting data in 2-D and 3-D dimensions.

Summarization

A task whose goal is to describe specific groups of objects from the analyzed data set.

Quite close to the above classification is the division of DataMining tasks into the following: research and discovery, forecasting and classification, explanation and description.

Automatic exploration and discovery (free search)

Example task: discovering new market segments.

To solve this class of problems, cluster analysis methods are used.

Prediction and classification

Example problem: predicting sales growth based on current values.

Methods: regression, neural networks, genetic algorithms, decision trees.

Classification and forecasting tasks constitute a group of so-called inductive modeling, which results in the study of the analyzed object or system. In the process of solving these problems, a general model or hypothesis is developed based on a set of data.

Explanation and Description

Example problem: characterizing customers based on demographics and purchasing history.

Methods: decision trees, rule systems, association rules, connection analysis.

If the client's income is more than 50 conventional units and his age is more than 30 years, then the client's class is first.

Comparison of clustering and classification

Characteristic

Classification

Clustering

Controllability of training

Controlled

Uncontrollable

Strategies

Tutored training

Unsupervised learning

Availability of class label

Training set

accompanied by a label indicating

class to which it belongs

observation

Trainer class labels

sets are unknown

Basis for classification

New data is classified based on the training set

A lot of data is given for the purpose

establishing the existence

classes or data clusters

Areas of application of DataMining

It should be noted that today DataMining technology is most widely used in solving business problems. Perhaps the reason is that it is in this direction that the return on use of DataMining tools can be, according to some sources, up to 1000% and the costs of its implementation can quickly pay off.

We will look at four main areas of application of DataMining technology in detail: science, business, government research and the Web.

business tasks. Main areas: banking, finance, insurance, CRM, manufacturing, telecommunications, e-commerce, marketing, stock market and others.

    Should I issue a loan to the client?

    Market segmentation

    Attraction of new clients

    Credit card fraud

Application of DataMining for solving problems at the state level. Main directions: search for tax evaders; means in the fight against terrorism.

Application of DataMining for scientific research. Main areas: medicine, biology, molecular genetics and genetic engineering, bioinformatics, astronomy, applied chemistry, research related to drug addiction, and others.

Using DataMining to solve Web tasks. Main areas: search engines, counters and others.

E-commerce

In the field of e-commerce, DataMining is used to generate

This classification allows companies to identify specific customer groups and conduct marketing policies in accordance with the identified interests and needs of customers. DataMining technology for e-commerce is closely related to WebMining technology.

The main tasks of DataMining in industrial production:

· comprehensive system analysis of production situations;

· short-term and long-term forecast of development of production situations;

· development of options for optimization solutions;

· forecasting the quality of a product depending on certain parameters

technological process;

· detection of hidden trends and patterns in the development of production

processes;

· forecasting patterns of development of production processes;

· detection of hidden influence factors;

· detection and identification of previously unknown relationships between

production parameters and influencing factors;

· analysis of the interaction environment of production processes and forecasting

changes in its characteristics;

processes;

· visualization of analysis results, preparation of preliminary reports and projects

feasible solutions with assessments of the reliability and effectiveness of possible implementations.

Marketing

In the field of marketing, DataMining is widely used.

Basic marketing questions: “What is sold?”, “How is it sold?”, “Who is

consumer?"

The lecture on classification and clustering problems describes in detail the use of cluster analysis to solve marketing problems, such as consumer segmentation.

Another common set of methods for solving marketing problems is methods and algorithms for searching for association rules.

The search for temporal patterns is also successfully used here.

Retail

In retail trade, as in marketing, the following are used:

· algorithms for searching for association rules (to determine frequently occurring sets of

goods that buyers buy at the same time). Identifying such rules helps

place goods on store shelves, develop strategies for purchasing goods

and their placement in warehouses, etc.

· use of time sequences, for example, to determine

required volumes of goods in the warehouse.

· classification and clustering methods to identify groups or categories of clients,

knowledge of which contributes to the successful promotion of goods.

Stock market

Here is a list of stock market problems that can be solved using Data technology

Mining: · forecasting future values ​​of financial instruments and their indicators

past values;

· trend forecast (future direction of movement - growth, decline, flat) financial

the instrument and its strength (strong, moderately strong, etc.);

· identification of the cluster structure of the market, industry, sector according to a certain set

characteristics;

· dynamic portfolio management;

· volatility forecast;

· risk assessment;

· predicting the onset of a crisis and forecasting its development;

· selection of assets, etc.

In addition to the areas of activity described above, DataMining technology can be used in a wide variety of business areas where there is a need for data analysis and a certain amount of retrospective information has been accumulated.

Application of DataMining in CRM

One of the most promising areas for using DataMining is the use of this technology in analytical CRM.

CRM (CustomerRelationshipManagement) - customer relationship management.

When these technologies are used together, the extraction of knowledge is combined with the “extraction of money” from customer data.

An important aspect in the work of the marketing and sales departments is the compilationa holistic view of clients, information about their characteristics, characteristics, and the structure of the client base. CRM uses so-called profilingclients, providing a complete view of all necessary information about clients.

Customer profiling includes the following components: customer segmentation, customer profitability, customer retention, customer response analysis. Each of these components can be examined using DataMining, and analyzing them together as profiling components can ultimately provide knowledge that is impossible to obtain from each individual characteristic.

WebMining

WebMining can be translated as “data mining on the Web.” WebIntelligence or Web.

Intelligence is ready to “open a new chapter” in the rapid development of electronic business. The ability to determine the interests and preferences of each visitor by observing his behavior is a serious and critical competitive advantage in the e-commerce market.

WebMining systems can answer many questions, for example, which of the visitors is a potential client of the Web store, which group of Web store customers brings the most income, what are the interests of a particular visitor or group of visitors.

Methods

Classification of methods

There are two groups of methods:

  • statistical methods based on the use of average accumulated experience, which is reflected in retrospective data;
  • cybernetic methods, including many heterogeneous mathematical approaches.

The disadvantage of this classification is that both statistical and cybernetic algorithms rely in one way or another on a comparison of statistical experience with the results of monitoring the current situation.

The advantage of this classification is its ease of interpretation - it is used to describe the mathematical means of a modern approach to extracting knowledge from arrays of initial observations (operative and retrospective), i.e. in Data Mining tasks.

Let's take a closer look at the groups presented above.

Statistical methods Data mining

In these methods represent four interrelated sections:

  • preliminary analysis of the nature of statistical data (testing hypotheses of stationarity, normality, independence, homogeneity, assessing the type of distribution function, its parameters, etc.);
  • identifying connections and patterns(linear and nonlinear regression analysis, correlation analysis, etc.);
  • multivariate statistical analysis (linear and nonlinear discriminant analysis, cluster analysis, component analysis, factor analysis, etc.);
  • dynamic models and forecast based on time series.

The arsenal of statistical methods for Data Mining is classified into four groups of methods:

  1. Descriptive analysis and description of source data.
  2. Relationship analysis (correlation and regression analysis, factor analysis, analysis of variance).
  3. Multivariate statistical analysis (component analysis, discriminant analysis, multivariate regression analysis, canonical correlations, etc.).
  4. Time series analysis (dynamic models and forecasting).

Cybernetic Data Mining Methods

The second direction of Data Mining is a variety of approaches united by the idea of ​​computer mathematics and the use of artificial intelligence theory.

This group includes the following methods:

  • artificial neural networks (recognition, clustering, forecast);
  • evolutionary programming (including algorithms for the method of group accounting of arguments);
  • genetic algorithms (optimization);
  • associative memory (search for analogues, prototypes);
  • fuzzy logic;
  • decision trees;
  • expert knowledge processing systems.

Cluster analysis

The purpose of clustering is to search for existing structures.

Clustering is a descriptive procedure, it does not make any statistical inferences, but it does provide an opportunity to conduct exploratory analysis and study the “structure of the data.”

The very concept of “cluster” is defined ambiguously: each study has its own “clusters”. The concept of cluster is translated as “cluster”, “bunch”. A cluster can be characterized as a group of objects that have common properties.

The characteristics of a cluster can be described as two:

  • internal homogeneity;
  • external isolation.

A question that analysts ask when solving many problems is how to organize data into visual structures, i.e. expand taxonomies.

Clustering was initially most widely used in sciences such as biology, anthropology, and psychology. Clustering has been little used for solving economic problems for a long time due to the specific nature of economic data and phenomena.

Clusters can be disjoint, or exclusive (non-overlapping, exclusive), and overlapping.

It should be noted that as a result of applying various methods of cluster analysis, clusters of various shapes can be obtained. For example, “chain” type clusters are possible, when the clusters are represented by long “chains”, elongated clusters, etc., and some methods can create clusters of arbitrary shape.

Various methods may strive to create clusters of specific sizes (e.g., small or large) or assume that there are clusters of different sizes in the data set. Some cluster analysis methods are particularly sensitive to noise or outliers, others less so. As a result of using different clustering methods, different results may be obtained; this is normal and is a feature of the operation of a particular algorithm. These features should be taken into account when choosing a clustering method.

Let us give a brief description of approaches to clustering.

Algorithms based on data separation (Partitioning algorithms), incl. iterative:

  • dividing objects into k clusters;
  • Iterative redistribution of objects to improve clustering.
  • Hierarchyalgorithms:
  • agglomeration: each object is initially a cluster, clusters,
  • connecting with each other, they form a larger cluster, etc.

Density-basedmethods:

  • based on the ability to connect objects;
  • ignore noise and find clusters of arbitrary shape.

Grid - methods (Grid-based methods):

  • quantization of objects into grid structures.

Model methods (Model-based):

  • using the model to find clusters that best fit the data.

Cluster analysis methods. Iterative methods.

With a large number of observations, hierarchical methods of cluster analysis are not suitable. In such cases, non-hierarchical methods based on division are used, which are iterative methods of fragmenting the original population. During the division process, new clusters are formed until the stopping rule is satisfied.

Such non-hierarchical clustering consists of dividing a data set into a certain number of individual clusters. There are two approaches. The first is to determine the boundaries of clusters as the most dense areas in the multidimensional space of the source data, i.e. defining a cluster where there is a large “condensation of points”. The second approach is to minimize the measure of difference between objects

k-means algorithm

The most common non-hierarchical method is the k-means algorithm, also called fast cluster analysis. A complete description of the algorithm can be found in Hartigan and Wong (1978). Unlike hierarchical methods, which do not require preliminary assumptions regarding the number of clusters, to be able to use this method, it is necessary to have a hypothesis about the most likely number of clusters.

The k-means algorithm constructs k clusters located at the greatest possible distances from each other. The main type of problems that the k-means algorithm solves is the presence of assumptions (hypotheses) regarding the number of clusters, and they should be as different as possible. The choice of k may be based on previous research, theoretical considerations, or intuition.

The general idea of ​​the algorithm: a given fixed number k of observation clusters are compared to clusters so that the averages in the cluster (for all variables) differ from each other as much as possible.

Description of the algorithm

1. Initial distribution of objects into clusters.

  • The number k is selected, and in the first step these points are considered the “centers” of the clusters.
  • Each cluster corresponds to one center.

The selection of initial centroids can be done as follows:

  • selecting k-observations to maximize initial distance;
  • random selection of k-observations;
  • selection of the first k-observations.

As a result, each object is assigned to a specific cluster.

2. Iterative process.

The centers of the clusters are calculated, which are then used to calculate the coordinate-wise averages of the clusters. Objects are redistributed again.

The process of calculating centers and redistributing objects continues until one of the conditions is met:

  • cluster centers have stabilized, i.e. all observations belong to the cluster to which they belonged before the current iteration;
  • the number of iterations is equal to the maximum number of iterations.

The figure shows an example of the k-means algorithm for k equal to two.

An example of the k-means algorithm (k=2)

Choosing the number of clusters is a complex issue. If there are no assumptions regarding this number, it is recommended to create 2 clusters, then 3, 4, 5, etc., comparing the results obtained.

Checking the quality of clustering

After receiving the results of the k-means cluster analysis, you should check the correctness of the clustering (i.e., assess how different the clusters are from each other).

To do this, average values ​​for each cluster are calculated. Good clustering should produce very different means for all measurements, or at least most of them.

Advantages of the k-means algorithm:

  • ease of use;
  • speed of use;
  • understandability and transparency of the algorithm.

Disadvantages of the k-means algorithm:

  • the algorithm is too sensitive to outliers that can distort the average.

A possible solution to this problem is to use a modification of the algorithm - the k-median algorithm;

  • the algorithm may be slow on large databases. A possible solution to this problem is to use data sampling.

Bayesian networks

In probability theory, the concept of information dependence is modeled through conditional dependence (or strictly: the absence of conditional independence), which describes how our confidence in the outcome of some event changes when we gain new knowledge about facts, provided that we already knew some set of other facts.

It is convenient and intuitive to represent dependencies between elements through a directed path connecting these elements in a graph. If the relationship between elements x and y is not direct and is carried out through a third element z, then it is logical to expect that there will be an element z on the path between x and y. Such intermediary nodes will “cut off” the dependence between x and y, i.e. simulate a situation of conditional independence between them with a known value of direct influencing factors.Such modeling languages ​​are Bayesian networks, which are used to describe conditional dependencies between the concepts of a certain subject area.

Bayesian networks are graphical structures for representing probabilistic relationships between large numbers of variables and for performing probabilistic inference based on those variables."Naive" (Bayesian) classification is a fairly transparent and understandable classification method. "Naive" it is called because it is based on the assumption of mutualindependence of signs.

Classification properties:

1. Using all variables and determining all dependencies between them.

2. Having two assumptions about the variables:

  • all variables are equally important;
  • all variables are statistically independent, i.e. the value of one variable says nothing about the value of another.

There are two main scenarios for using Bayesian networks:

1. Descriptive analysis. The subject area is displayed as a graph, the nodes of which represent concepts, and the directed arcs, displayed by arrows, illustrate the direct dependencies between these concepts. The relationship between x and y means: knowing the value of x helps you make a better guess about the value of y. The absence of a direct connection between concepts models the conditional independence between them with known values ​​of a certain set of “separating” concepts. For example, a child's shoe size is obviously related to a child's reading ability through age. Thus, a larger shoe size gives greater confidence that the child is already reading, but if we already know the age, then knowing the shoe size will no longer give us additional information about the child’s ability to read.


As another, opposite example, consider such initially unrelated factors as smoking and colds. But if we know a symptom, for example, that a person suffers from a cough in the morning, then knowing that the person does not smoke increases our confidence that the person has a cold.

2. Classification and forecasting. The Bayesian network, allowing for the conditional independence of a number of concepts, makes it possible to reduce the number of parameters of the joint distribution, making it possible to confidently estimate them on the available volumes of data. So, with 10 variables, each of which can take 10 values, the number of parameters of the joint distribution is 10 billion - 1. If we assume that only 2 variables depend on each other between these variables, then the number of parameters becomes 8 * (10-1) + (10*10-1) = 171. Having a joint distribution model that is realistic in terms of computational resources, we can predict the unknown value of a concept as, for example, the most probable value of this concept given the known values ​​of other concepts.

The following advantages of Bayesian networks as a DataMining method are noted:

The model defines the dependencies between all variables, this makes it easyhandle situations in which the values ​​of some variables are unknown;

Bayesian networks are quite easy to interpret and allowPredictive modeling makes it easy to conduct what-if scenario analysis;

The Bayesian method allows you to naturally combine patterns,inferred from data, and, for example, expert knowledge obtained explicitly;

Using Bayesian networks avoids the problem of overfitting(overfitting), that is, excessive complication of the model, which is a weaknessmany methods (for example, decision trees and neural networks).

The Naive Bayes approach has the following disadvantages:

It is correct to multiply conditional probabilities only when all inputthe variables are truly statistically independent; although often this methodshows quite good results when the statistical condition is not metindependence, but theoretically such a situation should be handled by more complexmethods based on training Bayesian networks;

Direct processing of continuous variables is not possible - they are requiredconversion to an interval scale so that attributes are discrete; however suchtransformations can sometimes lead to the loss of significant patterns;

The classification result in the Naive Bayes approach is influenced only byindividual values ​​of input variables, the combined influence of pairs ortriplets of values ​​of different attributes are not taken into account here. This could improvequality of the classification model in terms of its predictive accuracy,however, it would increase the number of options tested.

Artificial neural networks

Artificial neural networks (hereinafter referred to as neural networks) can be synchronous and asynchronous.In synchronous neural networks, at each moment of time its state changes only one neuron. In asynchronous - the state changes immediately in a whole group of neurons, as a rule, in all layer. Two basic architectures can be distinguished: layered and mesh networks.The key concept in layered networks is the concept of layer.A layer is one or more neurons whose inputs receive the same common signal.Layered neural networks are neural networks in which neurons are divided into separate groups (layers) so that information is processed layer by layer.In layered networks, neurons of the i-th layer receive input signals, transform them, and transmit them through branching points to the neurons of the (i+1) layer. And so on until the k-th layer, which producesoutput signals for the interpreter and user. The number of neurons in each layer is not related to the number of neurons in other layers and can be arbitrary.Within one layer, data is processed in parallel, and across the entire network, processing is carried out sequentially - from layer to layer. Layered neural networks include, for example, multilayer perceptrons, radial basis function networks, cognitron, noncognitron, associative memory networks.However, the signal is not always sent to all neurons in the layer. In a cognitron, for example, each neuron of the current layer receives signals only from neurons close to it in the previous layer.

Layered networks, in turn, can be single-layer or multi-layer.

Single layer network- a network consisting of one layer.

Multilayer network- a network with several layers.

In a multilayer network, the first layer is called the input layer, subsequent layers are called internal or hidden, and the last layer is called the output layer. Thus, intermediate layers are all layers in a multilayer neural network except the input and output.The input layer of the network communicates with the input data, and the output layer communicates with the output.Thus, neurons can be input, output and hidden.The input layer is organized from input neurons, which receive data and distribute it to the inputs of neurons in the hidden layer of the network.A hidden neuron is a neuron located in the hidden layer of a neural network.Output neurons, from which the output layer of the network is organized, producesresults of the neural network.

In mesh networks Each neuron transmits its output to other neurons, including itself. The output signals of the network can be all or some of the output signals of neurons after several cycles of network operation.

All input signals are given to all neurons.

Training neural networks

Before using a neural network, it must be trained.The process of training a neural network consists of adjusting its internal parameters to a specific task.The neural network algorithm is iterative; its steps are called epochs or cycles.An epoch is one iteration in the learning process, including the presentation of all examples from the training set and, possibly, checking the quality of learning on a test set. many. The learning process is carried out on the training sample.The training set includes the input values ​​and their corresponding output values ​​of the dataset. During training, the neural network finds certain dependencies between the output fields and the input fields.Thus, we are faced with the question - what input fields (features) do we need?nessesary to use. Initially, the choice is made heuristically, thenthe number of inputs can be changed.

A problem that may arise is the number of observations in the data set. And although there are certain rules describing the relationship between the required number of observations and the size of the network, their correctness has not been proven.The number of required observations depends on the complexity of the problem being solved. As the number of features increases, the number of observations increases nonlinearly; this problem is called the “curse of dimensionality.” In case of insufficient quantitydata, it is recommended to use a linear model.

The analyst must determine the number of layers in the network and the number of neurons in each layer.Next, you need to assign such values ​​of weights and offsets that canminimize the decision error. The weights and biases are automatically adjusted to minimize the difference between the desired and received output signals, called the training error.The training error for the constructed neural network is calculated by comparingoutput and target (desired) values. The error function is formed from the resulting differences.

The error function is an objective function that requires minimization in the processsupervised learning of a neural network.Using the error function, you can evaluate the quality of the neural network during training. For example, the sum of squared errors is often used.The quality of training of a neural network determines its ability to solve the assigned tasks.

Retraining a neural network

When training neural networks, a serious difficulty often arises calledproblem of overfitting.Overfitting, or overfitting - overfittingneural network to a specific set of training examples, in which the network losesability to generalize.Overtraining occurs when there is too much training, not enoughtraining examples or an overcomplicated neural network structure.Retraining is due to the fact that the choice of the training setis random. From the first steps of learning, the error decreases. Onsubsequent steps in order to reduce the error (objective function) parametersadapt to the characteristics of the training set. However, this happens“adjustment” not to the general patterns of the series, but to the features of its part -training subset. At the same time, the accuracy of the forecast decreases.One of the options to combat network overtraining is to divide the training sample into twosets (training and testing).The neural network is trained on the training set. The constructed model is checked on the test set. These sets must not intersect.With each step, the model parameters change, but the constant decreaseThe value of the objective function occurs precisely on the training set. When we split the set into two, we can observe a change in the forecast error on the test set in parallel with observations on the training set. somethe number of forecast error steps decreases on both sets. However, onAt a certain step, the error on the test set begins to increase, while the error on the training set continues to decrease. This moment is considered the beginning of retraining

DataMining Tools

Both world-famous leaders and new developing companies are involved in the development of the DataMining sector of the global software market. DataMining tools can be presented either as a stand-alone application or as add-ons to the main product.The latter option is implemented by many software market leaders.Thus, it has already become a tradition that developers of universal statistical packages, in addition to traditional methods of statistical analysis, include in the packagea specific set of DataMining methods. These are packages like SPSS (SPSS, Clementine), Statistica (StatSoft), SAS Institute (SAS Enterprise Miner). Some OLAP solution providers also offer a set of DataMining methods, such as the Cognos family of products. There are suppliers that include DataMining solutions in the DBMS functionality: these are Microsoft (MicrosoftSQLServer), Oracle, IBM (IBMIntelligentMinerforData).

Bibliography

  1. Abdikeev N.M. Danko T.P. Ildemenov S.V. Kiselev A.D., “Business process reengineering. MBA course", M.: Eksmo Publishing House, 2005. - 592 p. - (MBA)
  1. Abdikeev N.M., Kiselev A.D. “Knowledge management in a corporation and business reengineering” - M.: Infra-M, 2011. - 382 p. – ISBN 978-5-16-004300-5
  1. Barseghyan A.A., Kupriyanov M.S., Stepanenko V.V., Kholod I.I. “Methods and models of data analysis: OLAP and Data Mining”, St. Petersburg: BHV-Petersburg, 2004, 336 pp., ISBN 5-94157-522-X
  1. Duke IN., Samoilenko A., “Data Mining.Training course" St. Petersburg: Peter, 2001, 386 p.
  1. Chubukova I.A., Data Mining course, http://www.intuit.ru/department/database/datamining/
  1. IanH. Witten, Eibe Frank, Mark A. Hall, Morgan Kaufmann, Data Mining: Practical Machine Learning Tools and Techniques (Third Edition), ISBN 978-0-12-374856-0
  1. Petrushin V.A. , Khan L. , Multimedia Data Mining and Knowledge Discovery

The development of methods for recording and storing data has led to a rapid growth in the volume of information collected and analyzed. The volumes of data are so impressive that it is simply impossible for a person to analyze them on their own, although the need for such an analysis is quite obvious, because this “raw” data contains knowledge that can be used in decision-making. In order to carry out automatic data analysis, Data Mining is used.

Data Mining is the process of discovering in “raw” data previously unknown, non-trivial, practically useful and interpretable knowledge necessary for decision-making in various areas of human activity. Data Mining is one of the steps of Knowledge Discovery in Databases.

The information found in the process of applying Data Mining methods must be non-trivial and previously unknown, for example, average sales are not. Knowledge should describe new connections between properties, predict the values ​​of some features based on others, etc. The knowledge found must be applicable to new data with some degree of reliability. The usefulness lies in the fact that this knowledge can bring certain benefits when applied. Knowledge must be in a non-mathematical form that is understandable to the user. For example, the logical constructions “if... then...” are most easily perceived by humans. Moreover, such rules can be used in various DBMSs as SQL queries. In the case where the extracted knowledge is not transparent to the user, there must be post-processing methods to bring it into an interpretable form.

The algorithms used in Data Mining require a lot of calculations. Previously, this was a limiting factor for the widespread practical use of Data Mining, but today's increase in the performance of modern processors has alleviated the severity of this problem. Now, in a reasonable amount of time, you can conduct a high-quality analysis of hundreds of thousands and millions of records.

Problems solved by Data Mining methods:

  1. Classification– this is the assignment of objects (observations, events) to one of the previously known classes.
  2. Regression, including forecasting tasks. Establishing the dependence of continuous outputs on input variables.
  3. Clustering is a grouping of objects (observations, events) based on data (properties) that describe the essence of these objects. Objects within a cluster must be “similar” to each other and different from objects included in other clusters. The more similar the objects within a cluster and the more differences between clusters, the more accurate the clustering.
  4. Association– identifying patterns between related events. An example of such a pattern is a rule indicating that event X follows from event Y. Such rules are called associative. This problem was first proposed to find typical shopping patterns in supermarkets, so it is sometimes also called market basket analysis.
  5. Sequential patterns– establishing patterns between events related in time, i.e. detection of the dependence that if event X occurs, then after a given time event Y will occur.
  6. Deviation Analysis– identification of the most uncharacteristic patterns.

Business analysis problems are formulated differently, but the solution to most of them comes down to one or another Data Mining problem or a combination of them. For example, risk assessment is a solution to a regression or classification problem, market segmentation is clustering, demand stimulation is association rules. In fact, Data Mining tasks are the elements from which a solution to the vast majority of real business problems can be assembled.

To solve the above problems, various Data Mining methods and algorithms are used. Due to the fact that Data Mining has developed and is developing at the intersection of such disciplines as statistics, information theory, machine learning, and database theory, it is quite natural that most Data Mining algorithms and methods were developed based on various methods from these disciplines. For example, the k-means clustering procedure was simply borrowed from statistics. The following Data Mining methods have become very popular: neural networks, decision trees, clustering algorithms, including scalable ones, algorithms for detecting associative connections between events, etc.

Deductor is an analytical platform that includes a full set of tools for solving Data Mining problems: linear regression, supervised neural networks, unsupervised neural networks, decision trees, search for association rules and many others. For many mechanisms, specialized visualizers are provided, which greatly facilitate the use of the resulting model and interpretation of the results. The strength of the platform is not only the implementation of modern analysis algorithms, but also the ability to arbitrarily combine various analysis mechanisms.

We welcome you to the Data Mining portal - a unique portal dedicated to modern Data Mining methods.

Data Mining technologies are a powerful tool for modern business analytics and data research to detect hidden patterns and build predictive models. Data Mining or knowledge extraction is based not on speculative reasoning, but on real data.

Rice. 1. Data Mining Application Scheme

Problem Definition – Statement of the problem: data classification, segmentation, construction of predictive models, forecasting.
Data Gathering and Preparation – Collection and preparation of data, cleaning, verification, removal of duplicate records.
Model Building – Model building, accuracy assessment.
Knowledge Deployment – ​​Application of a model to solve a given problem.

Data Mining is used to implement large-scale analytical projects in business, marketing, the Internet, telecommunications, industry, geology, medicine, pharmaceuticals and other areas.

Data Mining allows you to start the process of finding significant correlations and connections as a result of sifting through a huge amount of data using modern pattern recognition methods and the use of unique analytical technologies, including decision trees and classification, clustering, neural network methods and others.

A user who discovers data mining technology for the first time is amazed at the abundance of methods and effective algorithms that allow him to find approaches to solving difficult problems associated with the analysis of large volumes of data.

In general, Data Mining can be characterized as a technology designed to search large volumes of data. non-obvious, objective and practically useful patterns.

Data Mining is based on effective methods and algorithms developed for analyzing unstructured data of large volume and dimension.

The key point is that high-volume, high-dimensional data appears to lack structure and connections. The goal of data mining technology is to identify these structures and find patterns where, at first glance, chaos and arbitrariness reign.

Here is a current example of the application of data mining in the pharmaceutical and drug industry.

Drug interactions are a growing problem facing modern healthcare.

Over time, the number of medications prescribed (over-the-counter and all kinds of supplements) increases, making it more and more likely that there will be drug-drug interactions that can cause serious side effects that doctors and patients are unaware of.

This area refers to post-clinical research, when the drug has already been released to the market and is being used intensively.

Clinical studies refer to the evaluation of the effectiveness of a drug, but do not take into account the interactions of the drug with other drugs on the market.

Researchers at Stanford University in California examined the FDA's database of drug side effects and found that two commonly used drugs—the antidepressant paroxetine and the cholesterol-lowering drug pravastatin—increased risk of developing diabetes if used together.

A similar analysis study based on FDA data identified 47 previously unknown adverse interactions.

This is great, with the caveat that many of the negative effects noted by patients remain undetected. It is in this case that online search can perform at its best.

Upcoming Data Mining courses at StatSoft Data Analysis Academy in 2020

We begin our introduction to Data Mining using the amazing Data Science Academy videos.

Be sure to watch our videos and you will understand what Data Mining is!

Video 1. What is Data Mining?


Video 2. Review of data mining methods: decision trees, generalized predictive models, clustering and much more

JavaScript is disabled in your browser


Before starting a research project, we must organize a process for obtaining data from external sources, now we will show how this is done.

The video will introduce you to unique technology STATISTICA In-place database processing and connection of Data Mining with real data.

Video 3. The order of interaction with databases: graphical interface for building SQL queries, In-place database processing technology

JavaScript is disabled in your browser


Now we will get acquainted with interactive drilling technologies that are effective in conducting exploratory data analysis. The term drilling itself reflects the connection between Data Mining technology and geological exploration.

Video 4: Interactive Drilling: Exploration and Graphics Techniques for Interactive Data Exploration

JavaScript is disabled in your browser


Now we will get acquainted with association analysis (association rules), these algorithms allow you to find connections that exist in real data. The key point is the efficiency of algorithms on large volumes of data.

The result of connection analysis algorithms, for example, the Apriori algorithm, is the finding of connection rules for the objects under study with a given reliability, for example, 80%.

In geology, these algorithms can be used in exploration analysis of minerals, for example, how feature A is related to features B and C.

You can find specific examples of such solutions using our links:

In retail, Apriori algorithms or their modifications make it possible to study the relationship between different products, for example, when selling perfumes (perfume - nail polish - mascara, etc.) or products of different brands.

Analysis of the most interesting sections on the site can also be effectively carried out using association rules.

So check out our next video.

Video 5. Association rules

JavaScript is disabled in your browser

Here are examples of the application of Data Mining in specific areas.

Online trading:

  • analysis of customer trajectories from visiting the site to purchasing goods
  • assessment of service efficiency, analysis of failures due to lack of goods
  • connection of products that are interesting to visitors

Retail: analysis of customer information based on credit cards, discount cards, etc.

Typical retail tasks solved by Data Mining tools:

  • shopping cart analysis;
  • creation of predictive models and classification models of buyers and purchased goods;
  • creating customer profiles;
  • CRM, assessing the loyalty of customers of different categories, planning loyalty programs;
  • time series research and time dependencies, identifying seasonal factors, assessing the effectiveness of promotions on a large range of real data.

The telecommunications sector opens up unlimited opportunities for the use of data mining methods, as well as modern big data technologies:

  • classification of clients based on key characteristics of calls (frequency, duration, etc.), SMS frequency;
  • identifying customer loyalty;
  • fraud detection, etc.

Insurance:

  • risk analysis. By identifying combinations of factors associated with paid claims, insurers can reduce their liability losses. There is a case where an insurance company discovered that the amounts paid out on claims of married people were twice as high as the amounts paid out on claims by single people. The company responded to this by revising its discount policy for family customers.
  • fraud detection. Insurance companies can reduce fraud by looking for certain patterns in claims that characterize the relationships between lawyers, doctors and claimants.

The practical application of data mining and solving specific problems is presented in our next video.

Webinar 1. Webinar “Practical tasks of Data Mining: problems and solutions”

JavaScript is disabled in your browser

Webinar 2. Webinar "Data Mining and Text Mining: examples of solving real problems"

JavaScript is disabled in your browser


You can get more in-depth knowledge of data mining methodology and technology in StatSoft courses.







2024 gtavrl.ru.