The role of big data in the world. Big data in the modern world

Twitter

It was predicted that the total global volume of data created and replicated in 2011 could be about 1.8 zettabytes (1.8 trillion gigabytes) - about 9 times more than what was created in 2006.

More complex definition

However` big data` involve more than just analyzing huge amounts of information. The problem is not that organizations create huge volumes of data, but that most of it is in a format that does not fit well with the traditional structured database format - web logs, videos, text documents, machine code or, for example, geospatial data . All this is stored in many different repositories, sometimes even outside the organization. As a result, corporations may have access to a huge amount of their data and lack the necessary tools to establish relationships between this data and draw meaningful conclusions from it. Add to this the fact that data is now being updated more and more frequently, and you get a situation in which traditional methods of information analysis cannot keep up with the huge volumes of constantly updated data, which ultimately opens the way for technology big data.

Best definition

In essence the concept big data involves working with information of a huge volume and diverse composition, very often updated and located in different sources in order to increase operational efficiency, create new products and increase competitiveness. The consulting company Forrester gives a brief formulation: ` Big Data brings together techniques and technologies that extract meaning from data at the extreme limits of practicality.

How big is the difference between business analytics and big data?

Craig Bathy, executive director of marketing and chief technology officer of Fujitsu Australia, pointed out that business analysis is a descriptive process of analyzing the results achieved by a business in a certain period of time, while the processing speed big data allows you to make the analysis predictive, capable of offering business recommendations for the future. Big data technologies also allow you to analyze more types of data than business intelligence tools, which makes it possible to focus on more than just structured repositories.

Matt Slocum of O'Reilly Radar believes that although big data and business analytics have the same goal (finding answers to a question), they differ from each other in three aspects.

Big data is designed to handle larger volumes of information than business analytics, and this certainly fits the traditional definition of big data.
Big data is designed to handle faster, faster-changing information, which means deep exploration and interactivity. In some cases, results are generated faster than the web page loads.
Big data is designed to process unstructured data that we are only beginning to explore how to use once we have been able to collect and store it, and we need algorithms and conversational capabilities to make it easier to find trends contained within these data sets.

According to the white paper "Oracle Information Architecture: An Architect's Guide to Big Data" published by Oracle, when working with big data, we approach information differently than when conducting business analysis.

Working with big data is not like the usual business intelligence process, where simply adding up known values produces a result: for example, adding up paid invoices becomes sales for the year. When working with big data, the result is obtained in the process of cleaning it through sequential modeling: first, a hypothesis is put forward, a statistical, visual or semantic model is built, on its basis the accuracy of the put forward hypothesis is checked, and then the next one is put forward. This process requires the researcher to either interpret visual meanings or construct interactive queries based on knowledge, or develop adaptive `machine learning` algorithms that can produce the desired result. Moreover, the lifetime of such an algorithm can be quite short.

Big data analysis techniques

There are many different methods for analyzing data sets, which are based on tools borrowed from statistics and computer science (for example, machine learning). The list does not pretend to be complete, but it reflects the most popular approaches in various industries. It should be understood that researchers continue to work on creating new techniques and improving existing ones. In addition, some of the techniques listed do not necessarily apply exclusively to big data and can be successfully used for smaller arrays (for example, A/B testing, regression analysis). Of course, the more voluminous and diversified the array is analyzed, the more accurate and relevant data can be obtained as a result.

A/B testing. A technique in which a control sample is alternately compared with others. Thus, it is possible to identify the optimal combination of indicators to achieve, for example, the best consumer response to a marketing offer. Big Data allow you to carry out a huge number of iterations and thus obtain a statistically reliable result.

Association rule learning. A set of techniques for identifying relationships, i.e. association rules between variables in large data sets. Used in data mining.

Classification. A set of techniques that allows you to predict consumer behavior in a certain market segment (purchase decisions, churn, consumption volume, etc.). Used in data mining.

Cluster analysis. A statistical method for classifying objects into groups by identifying common features that are not known in advance. Used in data mining.

Crowdsourcing. Methodology for collecting data from a large number of sources.

Data fusion and data integration. A set of techniques that allows you to analyze comments from social network users and compare them with sales results in real time.

Data mining. A set of techniques that allows you to determine the categories of consumers most susceptible to the promoted product or service, identify the characteristics of the most successful employees, and predict the behavioral model of consumers.

Ensemble learning. This method uses many predictive models, thereby improving the quality of the forecasts made.

Genetic algorithms. In this technique, possible solutions are represented in the form of `chromosomes', which can be combined and mutated. As in the process of natural evolution, the fittest individual survives.

Machine learning. A direction in computer science (historically it has been given the name “artificial intelligence”), which pursues the goal of creating self-learning algorithms based on the analysis of empirical data.

Natural language processing (NLP). A set of techniques for recognizing natural human language borrowed from computer science and linguistics.

Network analysis. A set of techniques for analyzing connections between nodes in networks. In relation to social networks, it allows you to analyze the relationships between individual users, companies, communities, etc.

Optimization. A set of numerical methods for redesigning complex systems and processes to improve one or more metrics. Helps in making strategic decisions, for example, the composition of the product line to be launched on the market, conducting investment analysis, etc.

Pattern recognition. A set of techniques with self-learning elements for predicting the behavioral model of consumers.

Predictive modeling. A set of techniques that allow you to create a mathematical model of a predetermined probable scenario for the development of events. For example, analysis of the CRM system database for possible conditions that will prompt subscribers to change providers.

Regression. A set of statistical methods for identifying a pattern between changes in a dependent variable and one or more independent variables. Often used for forecasting and predictions. Used in data mining.

Sentiment analysis. Techniques for assessing consumer sentiment are based on natural language recognition technologies. They allow you to isolate messages related to the subject of interest (for example, a consumer product) from the general information flow. Next, evaluate the polarity of the judgment (positive or negative), the degree of emotionality, etc.

Signal processing. A set of techniques borrowed from radio engineering that aims to recognize a signal against a background of noise and its further analysis.

Spatial analysis. A set of methods for analyzing spatial data, partly borrowed from statistics - terrain topology, geographic coordinates, object geometry. Source big data Geographic information systems (GIS) are often used in this case.

Revolution Analytics (based on the R language for mathematical statistics).

Of particular interest on this list is Apache Hadoop, an open source software that has been proven as a data analyzer by most stock trackers over the past five years. As soon as Yahoo opened the Hadoop code to the open source community, a whole movement of creating products based on Hadoop immediately appeared in the IT industry. Almost all modern analysis tools big data provide Hadoop integration tools. Their developers are both startups and well-known global companies.

Markets for Big Data Management Solutions

Big Data Platforms (BDP, Big Data Platform) as a means of combating digital hording

Ability to analyze big data, colloquially called Big Data, is perceived as a benefit, and unambiguously. But is this really so? What could the unbridled accumulation of data lead to? Most likely to what domestic psychologists, in relation to humans, call pathological hoarding, syllogomania, or figuratively “Plyushkin syndrome.” In English, the vicious passion to collect everything is called hording (from the English hoard - “stock”). According to the classification of mental illnesses, hording is classified as a mental disorder. In the digital era, digital hoarding is added to the traditional material hording; it can affect both individuals and entire enterprises and organizations ().

World and Russian market

Big data Landscape - Main suppliers

Interest in collection, processing, management and analysis tools big data Almost all leading IT companies showed this, which is quite natural. Firstly, they directly encounter this phenomenon in their own business, and secondly, big data open up excellent opportunities for developing new market niches and attracting new customers.

Many startups have appeared on the market that make business by processing huge amounts of data. Some of them use ready-made cloud infrastructure provided by large players like Amazon.

Theory and practice of Big Data in industries

History of development

2017

TmaxSoft forecast: the next “wave” of Big Data will require modernization of the DBMS

Businesses know that the vast amounts of data they accumulate contain important information about their business and customers. If a company can successfully apply this information, it will have a significant advantage over its competitors and will be able to offer better products and services than theirs. However, many organizations still fail to effectively use big data due to the fact that their legacy IT infrastructure is unable to provide the necessary storage capacity, data exchange processes, utilities and applications required to process and analyze large amounts of unstructured data to extract valuable information from them, TmaxSoft indicated.

Additionally, the increased processing power needed to analyze ever-increasing volumes of data may require significant investment in an organization's legacy IT infrastructure, as well as additional maintenance resources that could be used to develop new applications and services.

On February 5, 2015, the White House released a report that discussed how companies are using " big data» to charge different prices to different customers, a practice known as “price discrimination” or “personalized pricing”. The report describes the benefits of big data for both sellers and buyers, and its authors conclude that many of the issues raised by big data and differential pricing can be addressed through existing anti-discrimination laws and regulations. protecting consumer rights.

The report notes that at this time, there is only anecdotal evidence of how companies are using big data in the context of personalized marketing and differentiated pricing. This information shows that sellers use pricing methods that can be divided into three categories:

study of the demand curve;
Steering and differentiated pricing based on demographic data; And
targeted behavioral marketing (behavioral targeting) and individualized pricing.

Studying the Demand Curve: To determine demand and study consumer behavior, marketers often conduct experiments in this area in which customers are randomly assigned to one of two possible price categories. “Technically, these experiments are a form of differential pricing because they result in different prices for customers, even if they are “non-discriminatory” in the sense that all customers have the same probability of being “sent” to a higher price.”

Steering: It is the practice of presenting products to consumers based on their membership in a specific demographic group. For example, a computer company's website may offer the same laptop to different types of customers at different prices based on their self-reported information (for example, depending on whether the user is a government, academic, or commercial user, or an individual) or on their geographical location (for example, determined by the IP address of a computer).

Targeted behavioral marketing and customized pricing: In these cases, customers' personal information is used to target advertising and customize pricing for certain products. For example, online advertisers use data collected by advertising networks and through third-party cookies about online user activity to target their advertisements. This approach, on the one hand, allows consumers to receive advertising of goods and services of interest to them. It may, however, cause concern for those consumers who do not want certain types of their personal data (such as information about visits to websites linked to medical and financial matters) were collected without their consent.

Although targeted behavioral marketing is widespread, there is relatively little evidence of personalized pricing in the online environment. The report speculates that this may be because the methods are still being developed, or because companies are hesitant to use custom pricing (or prefer to keep quiet about it) - perhaps fearing a backlash from consumers.

The report's authors suggest that "for the individual consumer, the use of big data clearly presents both potential rewards and risks." While acknowledging that big data raises transparency and discrimination issues, the report argues that existing anti-discrimination and consumer protection laws are sufficient to address them. However, the report also highlights the need for “ongoing oversight” when companies use sensitive information in ways that are not transparent or in ways that are not covered by existing regulatory frameworks.

This report continues the White House's efforts to examine the use of big data and discriminatory pricing on the Internet and the resulting consequences for American consumers. It was previously reported that the White House Big Data Working Group published its report on this issue in May 2014. The Federal Trade Commission (FTC) also addressed these issues during its September 2014 workshop on big data discrimination.

2014

Gartner dispels myths about Big Data

A fall 2014 research note from Gartner lists a number of common Big Data myths among IT leaders and provides rebuttals to them.

Everyone is implementing Big Data processing systems faster than us

Interest in Big Data technologies is at an all-time high: 73% of organizations surveyed by Gartner analysts this year are already investing in or planning to do so. But most of these initiatives are still in the very early stages, and only 13% of respondents have already implemented such solutions. The most difficult thing is to determine how to extract income from Big Data, to decide where to start. Many organizations get stuck in the pilot stage because they cannot tie the new technology to specific business processes.

We have so much data that there is no need to worry about small errors in it

Some IT managers believe that small data flaws do not affect the overall results of analyzing huge volumes. When there is a lot of data, each individual error actually has less of an impact on the result, analysts note, but the errors themselves also become more numerous. In addition, most of the analyzed data is external, of unknown structure or origin, so the likelihood of errors increases. So in the world of Big Data, quality is actually much more important.

Big Data technologies will eliminate the need for data integration

Big Data promises the ability to process data in its original format, with automatic schema generation as it is read. It is believed that this will allow information from the same sources to be analyzed using multiple data models. Many believe that this will also enable end users to interpret any data set as they see fit. In reality, most users often want the traditional way with a ready-made schema, where the data is formatted appropriately and there are agreements on the level of integrity of the information and how it should relate to the use case.

There is no point in using data warehouses for complex analytics

Many information management system administrators believe that there is no point in spending time creating a data warehouse, given that complex analytical systems rely on new types of data. In fact, many complex analytics systems use information from a data warehouse. In other cases, new types of data need to be additionally prepared for analysis in Big Data processing systems; decisions have to be made about the suitability of the data, the principles of aggregation and the required level of quality - such preparation may occur outside the warehouse.

Data warehouses will be replaced by data lakes

In reality, vendors mislead customers by positioning data lakes as a replacement for storage or as critical elements of the analytical infrastructure. Underlying data lake technologies lack the maturity and breadth of functionality found in warehouses. Therefore, managers responsible for data management should wait until lakes reach the same level of development, according to Gartner.

Accenture: 92% of those who implemented big data systems are satisfied with the results

Among the main advantages of big data, respondents named:

“searching for new sources of income” (56%),
“improving customer experience” (51%),
“new products and services” (50%) and
“an influx of new customers and maintaining the loyalty of old ones” (47%).

When introducing new technologies, many companies are faced with traditional problems. For 51%, the stumbling block was security, for 47% - budget, for 41% - lack of necessary personnel, and for 35% - difficulties in integrating with the existing system. Almost all companies surveyed (about 91%) plan to soon solve the problem of staff shortages and hire big data specialists.

Companies are optimistic about the future of big data technologies. 89% believe they will change business as much as the Internet. 79% of respondents noted that companies that do not engage in big data will lose their competitive advantage.

However, respondents disagreed about what exactly should be considered big data. 65% of respondents believe that these are “large data files”, 60% believe that this is “advanced analytics and analysis”, and 50% believe that this is “data visualization tools”.

Madrid spends €14.7 million on big data management

In July 2014, it became known that Madrid would use big data technologies to manage city infrastructure. The cost of the project is 14.7 million euros, the basis of the implemented solutions will be technologies for analyzing and managing big data. With their help, the city administration will manage work with each service provider and pay accordingly depending on the level of services.

We are talking about administration contractors who monitor the condition of streets, lighting, irrigation, green spaces, clean up the territory and remove, as well as waste recycling. During the project, 300 key performance indicators of city services were developed for specially designated inspectors, on the basis of which 1.5 thousand various checks and measurements will be carried out daily. In addition, the city will begin using an innovative technology platform called Madrid iNTeligente (MiNT) - Smarter Madrid.

2013

Experts: Big Data is in fashion

Without exception, all vendors in the data management market are currently developing technologies for Big Data management. This new technological trend is also actively discussed by the professional community, both developers and industry analysts and potential consumers of such solutions.

As Datashift found out, as of January 2013, there was a wave of discussions around “ big data"exceeded all imaginable dimensions. After analyzing the number of mentions of Big Data on social networks, Datashift calculated that in 2012 the term was used about 2 billion times in posts created by about 1 million different authors around the world. This is equivalent to 260 posts per hour, with a peak of 3,070 mentions per hour.

Gartner: Every second CIO is ready to spend money on Big data

After several years of experimentation with Big data technologies and the first implementations in 2013, the adaptation of such solutions will increase significantly, Gartner predicts. Researchers surveyed IT leaders around the world and found that 42% of respondents have already invested in Big data technologies or plan to make such investments within the next year (data as of March 2013).

Companies are forced to spend money on processing technologies big data, since the information landscape is rapidly changing, requiring new approaches to information processing. Many companies have already realized that large amounts of data are critical, and working with them allows them to achieve benefits that are not available using traditional sources of information and methods of processing it. In addition, the constant discussion of the topic of “big data” in the media fuels interest in relevant technologies.

Frank Buytendijk, a vice president at Gartner, even urged companies to tone down their efforts as some worry they are falling behind competitors in their adoption of Big Data.

“There is no need to worry; the possibilities for implementing ideas based on big data technologies are virtually endless,” he said.

Gartner predicts that by 2015, 20% of Global 1000 companies will have a strategic focus on “information infrastructure.”

In anticipation of the new opportunities that big data processing technologies will bring, many organizations are already organizing the process of collecting and storing various types of information.

For education, government, and industrial organizations, the greatest potential for business transformation lies in the combination of accumulated data with so-called dark data (literally, “dark data”), the latter including email messages, multimedia and other similar content. According to Gartner, the winners in the data race will be those who learn to deal with a variety of sources of information.

Cisco survey: Big Data will help increase IT budgets

The Spring 2013 Cisco Connected World Technology Report, conducted in 18 countries by independent research firm InsightExpress, surveyed 1,800 college students and an equal number of young professionals between the ages of 18 and 30. The survey was conducted to find out the level of readiness of IT departments to implement projects Big Data and gain insight into the challenges involved, technological shortcomings and strategic value of such projects.

Most companies collect, record and analyze data. However, the report says, many companies face a range of complex business and information technology challenges with Big Data. For example, 60 percent of respondents admit that Big Data solutions can improve decision-making processes and increase competitiveness, but only 28 percent said that they are already receiving real strategic benefits from the accumulated information.

More than half of the IT executives surveyed believe that Big Data projects will help increase IT budgets in their organizations, as there will be increased demands on technology, personnel and professional skills. At the same time, more than half of respondents expect that such projects will increase IT budgets in their companies as early as 2012. 57 percent are confident that Big Data will increase their budgets over the next three years.

81 percent of respondents said that all (or at least some) Big Data projects will require the use of cloud computing. Thus, the spread of cloud technologies may affect the speed of adoption of Big Data solutions and the business value of these solutions.

Companies collect and use many different types of data, both structured and unstructured. Here are the sources from which survey participants receive their data (Cisco Connected World Technology Report):

Nearly half (48 percent) of IT leaders predict the load on their networks will double over the next two years. (This is especially true in China, where 68 percent of respondents share this view, and in Germany – 60 percent). 23 percent of respondents expect network load to triple over the next two years. At the same time, only 40 percent of respondents declared their readiness for explosive growth in network traffic volumes.

27 percent of respondents admitted that they need better IT policies and information security measures.

21 percent need more bandwidth.

Big Data opens up new opportunities for IT departments to add value and build strong relationships with business units, allowing them to increase revenue and strengthen the company's financial position. Big Data projects make IT departments a strategic partner to business departments.

According to 73 percent of respondents, the IT department will become the main driver of the implementation of the Big Data strategy. At the same time, respondents believe that other departments will also be involved in the implementation of this strategy. First of all, this concerns the departments of finance (named by 24 percent of respondents), research and development (20 percent), operations (20 percent), engineering (19 percent), as well as marketing (15 percent) and sales (14 percent).

Gartner: Millions of new jobs needed to manage big data

Global IT spending will reach $3.7 billion by 2013, which is 3.8% more than spending on information technology in 2012 (year-end forecast is $3.6 billion). Segment big data(big data) will develop at a much faster pace, says a Gartner report.

By 2015, 4.4 million jobs in information technology will be created to service big data, of which 1.9 million jobs will be in . Moreover, each such job will entail the creation of three additional jobs outside of the IT sector, so that in the United States alone, 6 million people will work to support the information economy in the next four years.

According to Gartner experts, the main problem is that there is not enough talent in the industry for this: both the private and public educational systems, for example in the United States, are not able to supply the industry with a sufficient number of qualified personnel. So of the new IT jobs mentioned, only one out of three will be staffed.

Analysts believe that the role of nurturing qualified IT personnel should be taken directly by companies that urgently need them, since such employees will be their ticket to the new information economy of the future.

2012

The first skepticism regarding "Big Data"

Analysts from Ovum and Gartner suggest that for a fashionable topic in 2012 big data The time may come to liberate yourself from illusions.

The term “Big Data” at this time typically refers to the ever-increasing volume of information flowing online from social media, sensor networks and other sources, as well as the growing range of tools used to process the data and identify business-relevant data from it. -trends.

“Because of (or despite) the hype around the idea of big data, manufacturers in 2012 looked at this trend with great hope,” said Tony Bayer, an analyst at Ovum.

Bayer reported that DataSift conducted a retrospective analysis of big data mentions in

The term "Big Data" may be recognizable today, but there is still quite a bit of confusion surrounding it as to what it actually means. In truth, the concept is constantly evolving and being redefined as it remains the driving force behind many ongoing waves of digital transformation, including artificial intelligence, data science, and the Internet of Things. But what is Big-Data technology and how is it changing our world? Let's try to understand the essence of Big Data technology and what it means in simple words.

The Amazing Growth of Big Data

It all started with an explosion in the amount of data we have created since the dawn of the digital age. This is largely due to the development of computers, the Internet and technologies that can “snatch” data from the world around us. Data in itself is not a new invention. Even before the age of computers and databases, we used paper transaction records, customer records, and archival files that constitute data. Computers, especially spreadsheets and databases, have made it easy for us to store and organize data on a large scale. Suddenly information was available with just one click.

However, we have come a long way from the original tables and databases. Today, every two days we create as much data as we received from the very beginning until the year 2000. That's right, every two days. And the amount of data we create continues to grow exponentially; by 2020, the amount of available digital information will increase from approximately 5 zettabytes to 20 zettabytes.

Nowadays, almost every action we take leaves its mark. We generate data every time we go online, when we carry our smartphones equipped with a search engine, when we talk to our friends through social networks or chats, etc. In addition, the amount of machine-generated data is also growing rapidly. Data is generated and shared when our smart home devices communicate with each other or with their home servers. Industrial equipment in plants and factories is increasingly equipped with sensors that accumulate and transmit data.

The term "Big-Data" refers to the collection of all this data and our ability to use it to our advantage in a wide range of areas, including business.

How does Big-Data technology work?

Big Data works on the principle: the more you know about a particular subject or phenomenon, the more reliably you can achieve new understanding and predict what will happen in the future. As we compare more data points, relationships emerge that were previously hidden, and these relationships allow us to learn and make better decisions. Most often, this is done through a process that involves building models based on the data we can collect and then running simulations that tweak the values of the data points each time and track how they affect our results. This process is automated—modern analytics technology will run millions of these simulations, tweaking every possible variable until they find a model—or idea—that helps solve the problem they're working on.

Bill Gates hangs over the paper contents of one CD

Until recently, data was limited to spreadsheets or databases - and everything was very organized and neat. Anything that couldn't be easily organized into rows and columns was considered too complex to work with and was ignored. However, advances in storage and analytics mean that we can capture, store and process large amounts of different types of data. As a result, “data” today can mean anything from databases to photographs, videos, sound recordings, written texts and sensor data.

To make sense of all this messy data, Big Data-based projects often use cutting-edge analytics using artificial intelligence and computer learning. By teaching computing machines to determine what specific data is—through pattern recognition or natural language processing, for example—we can teach them to identify patterns much faster and more reliably than we can ourselves.

How is Big Data used?

This ever-increasing flow of sensor data, text, voice, photo and video data means that we can now use data in ways that would have been unimaginable just a few years ago. This is bringing revolutionary changes to the business world in almost every industry. Today, companies can predict with incredible accuracy which specific categories of customers will want to make a purchase and when. Big Data also helps companies carry out their activities much more efficiently.

Even outside of business, projects related to Big Data are already helping to change our world in various ways:

Improving Healthcare – Data-driven medicine has the ability to analyze vast amounts of medical information and images into models that can help detect disease at an early stage and develop new drugs.
Predicting and responding to natural and man-made disasters. Sensor data can be analyzed to predict where earthquakes are likely to occur, and human behavior patterns provide clues that help organizations provide assistance to survivors. Big Data technology is also used to track and protect the flow of refugees from war zones around the world.
Preventing crime. Police forces are increasingly using data-driven strategies that incorporate their own intelligence information and publicly available information to use resources more effectively and take deterrent action where necessary.

The best books about Big-Data technology

Everybody lies. Search engines, Big Data and the Internet know everything about you.
BIG DATA. All technology in one book.
Happiness industry. How Big Data and new technologies help add emotion to products and services.
Revolution in analytics. How to improve your business in the era of Big Data using operational analytics.

Problems with Big Data

Big Data gives us unprecedented ideas and opportunities, but also raises problems and questions that need to be addressed:

Data Privacy – The Big Data we generate today contains a lot of information about our personal lives, the privacy of which we have every right to. More and more, we are being asked to balance the amount of personal data we disclose with the convenience that Big Data-based apps and services offer.
Data Security - Even if we decide we are happy with someone having our data for a specific purpose, can we trust them to keep our data safe and secure?
Data discrimination - once all the information is known, will it be acceptable to discriminate against people based on data from their personal lives? We already use credit scores to decide who can borrow money, and insurance is also heavily data-driven. We should expect to be analyzed and assessed in more detail, but care must be taken to ensure that this does not make life more difficult for those with fewer resources and limited access to information.

Performing these tasks is an important component of Big Data and must be addressed by organizations that want to use such data. Failure to do this can leave a business vulnerable, not only in terms of its reputation, but also legally and financially.

Looking to the future

Data is changing our world and our lives at an unprecedented pace. If Big Data is capable of all this today, just imagine what it will be capable of tomorrow. The amount of data available to us will only increase, and analytics technology will become even more advanced.

For businesses, the ability to apply Big Data will become increasingly critical in the coming years. Only those companies that view data as a strategic asset will survive and thrive. Those who ignore this revolution risk being left behind.

In the Russian-speaking environment it is used as a term Big Data, and the concept of “big data”. The term “big data” is a carbon copy of the English term. Big data does not have a strict definition. It is impossible to draw a clear line - is it 10 terabytes or 10 megabytes? The name itself is very subjective. The word “big” is like “one, two, many” among primitive tribes.

However, there is an established opinion that big data is a set of technologies that are designed to perform three operations. Firstly, process larger volumes of data compared to “standard” scenarios. Secondly, be able to work with quickly arriving data in very large volumes. That is, there is not just a lot of data, but it is constantly becoming more and more. Third, they must be able to work with structured and ill-structured data in parallel in different aspects. Big data assumes that algorithms receive a stream of information that is not always structured and that more than one idea can be extracted from it.

A typical example of big data is information coming from various physical experimental facilities - for example, with, which produces a huge amount of data and does so constantly. The installation continuously produces large volumes of data, and scientists use it to solve many problems in parallel.

The emergence of big data in the public space was due to the fact that this data affected almost all people, and not just the scientific community, where such problems have been solved for a long time. Into the public sphere of technology Big Data came out when we started talking about a very specific number - the number of inhabitants of the planet. 7 billion collected on social networks and other projects that aggregate people. YouTube, Facebook, In contact with, where the number of people is measured in billions, and the number of transactions they perform simultaneously is enormous. The data flow in this case is user actions. For example, data from the same hosting YouTube, which flow through the network in both directions. Processing means not only interpretation, but also the ability to correctly process each of these actions, that is, put it in the right place and make this data available to each user quickly, since social networks do not tolerate waiting.

Much of what concerns big data, the approaches that are used to analyze it, have actually been around for quite some time. For example, processing images from surveillance cameras, when we are not talking about one picture, but a stream of data. Or robot navigation. All this has existed for decades, but now data processing tasks have affected a much larger number of people and ideas.

Many developers are accustomed to working with static objects and thinking in terms of states. In big data the paradigm is different. You have to be able to work with a constant flow of data, and this is an interesting task. It affects more and more areas.

In our lives, more and more hardware and software are beginning to generate large amounts of data - for example, the Internet of Things.

Things are already generating huge flows of information. The Potok police system sends information from all cameras and allows you to find cars using this data. Fitness bracelets, GPS trackers and other things that serve the needs of individuals and businesses are becoming increasingly fashionable.

The Moscow Department of Informatization is recruiting a large number of data analysts, because a lot of statistics on people are accumulated and they are multi-criteria (that is, statistics on a very large number of criteria have been collected about each person, about each group of people). You need to find patterns and trends in this data. For such tasks, mathematicians with IT education are needed. Because ultimately the data is stored in structured DBMSs, and you need to be able to access them and obtain information.

Previously, we did not consider big data as a problem for the simple reason that there was no place to store it and there were no networks to transmit it. When these opportunities appeared, the data immediately filled the entire volume provided to them. But no matter how much bandwidth and data storage capacity are expanded, there will always be sources, for example, physical experiments, experiments on modeling the streamlining of a wing, which will produce more information than we can transmit. According to Moore's law, the performance of modern parallel computing systems is steadily increasing, and the speeds of data transmission networks are also increasing. However, data must be able to quickly be stored and retrieved from storage media (hard drive and other types of memory), and this is another challenge in big data processing.

Big data - what is it in simple words

In 2010, the first attempts to solve the growing problem of big data began to appear. Software products were released, the action of which was aimed at minimizing risks when using huge amounts of information.

By 2011, such large companies as Microsoft, Oracle, EMC and IBM became interested in big data - they became the first to use Big data developments in their development strategies, and quite successfully.

Universities began studying big data as a separate subject already in 2013 - now not only data science, but also engineering, coupled with computing subjects, deals with problems in this area.

The main methods of data analysis and processing include the following:

Class methods or deep analysis (Data Mining).

These methods are quite numerous, but they have one thing in common: the mathematical tools used in conjunction with achievements from the field of information technology.

Crowdsourcing.

This technique allows you to obtain data simultaneously from several sources, and the number of the latter is practically unlimited.

A/B testing.

From the entire volume of data, a control set of elements is selected, which is alternately compared with other similar sets where one of the elements was changed. Conducting such tests helps determine which parameter fluctuations have the greatest impact on the control population. Thanks to the volume of Big Data, it is possible to carry out a huge number of iterations, with each of them getting closer to the most reliable result.

Predictive analytics.

Specialists in this field try to predict and plan in advance how the controlled object will behave in order to make the most profitable decision in this situation.

Machine learning (artificial intelligence).

It is based on empirical analysis of information and the subsequent construction of self-learning algorithms for systems.

Network analysis.

The most common method for studying social networks is that after obtaining statistical data, the nodes created in the grid are analyzed, that is, the interactions between individual users and their communities.

Prospects and trends for the development of Big data

In 2017, when big data ceased to be something new and unknown, its importance not only did not decrease, but increased even more. Experts are now betting that big data analytics will become available not only to giant organizations, but also to small and medium-sized businesses. This approach is planned to be implemented using the following components:

Cloud storage.

Data storage and processing are becoming faster and more economical - compared to the costs of maintaining your own data center and possible expansion of staff, renting a cloud seems to be a much cheaper alternative.

Using Dark Data.

The so-called “dark data” is all non-digitized information about the company, which does not play a key role in its direct use, but can serve as a reason for switching to a new format for storing information.

Artificial Intelligence and Deep Learning.

Machine intelligence learning technology, which imitates the structure and operation of the human brain, is ideally suited for processing large amounts of constantly changing information. In this case, the machine will do everything that a person would do, but the likelihood of error is significantly reduced.

Blockchain

This technology makes it possible to speed up and simplify numerous online transactions, including international ones. Another advantage of Blockchain is that it reduces transaction costs.

Self-service and reduced prices.

In 2017, it is planned to introduce “self-service platforms” - these are free platforms where representatives of small and medium-sized businesses can independently evaluate the data they store and systematize it.

The VISA company similarly used Big Data, tracking fraudulent attempts to perform a particular operation. Thanks to this, they save more than $2 billion annually from leakage.

The German Labor Ministry managed to cut costs by 10 billion euros by introducing a big data system into its work on issuing unemployment benefits. At the same time, it was revealed that a fifth of citizens receive these benefits without reason.

Big Data has not spared the gaming industry either. Thus, the World of Tanks developers conducted a study of information about all players and compared the available indicators of their activity. This helped predict the possible future outflow of players - based on the assumptions made, representatives of the organization were able to interact more effectively with users.

Notable organizations using big data also include HSBC, Nasdaq, Coca-Cola, Starbucks and AT&T.

Big Data problems

The biggest problem with big data is the cost of processing it. This can include both expensive equipment and wage costs for qualified specialists capable of servicing huge amounts of information. Obviously, the equipment will have to be updated regularly so that it does not lose minimum functionality as the volume of data increases.

The second problem is again related to the large amount of information that needs to be processed. If, for example, a study produces not 2-3, but a numerous number of results, it is very difficult to remain objective and select from the general flow of data only those that will have a real impact on the state of any phenomenon.

Big Data privacy problem. With most customer service services moving to online data usage, it is very easy to become the next target for cybercriminals. Even simply storing personal information without making any online transactions can be fraught with undesirable consequences for cloud storage clients.

The problem of information loss. Precautionary measures require not limiting yourself to a simple one-time data backup, but making at least 2-3 backup copies of the storage. However, as the volume increases, the difficulties with redundancy increase - and IT specialists are trying to find the optimal solution to this problem.

Big data technology market in Russia and the world

As of 2014, 40% of the big data market volume is made up of services. Revenue from the use of Big Data in computer equipment is slightly inferior (38%) to this indicator. The remaining 22% comes from software.

The most useful products in the global segment for solving Big Data problems, according to statistics, are In-memory and NoSQL analytical platforms. 15 and 12 percent of the market, respectively, are occupied by Log-file analytical software and Columnar platforms. But Hadoop/MapReduce in practice cope with big data problems not very effectively.

Results of implementing big data technologies:

increasing the quality of customer service;
optimization of supply chain integration;
optimization of organization planning;
acceleration of interaction with clients;
increasing the efficiency of processing customer requests;
reduction in service costs;
optimization of processing client requests.

Best books on Big Data

"The Human Face of Big Data" by Rick Smolan and Jennifer Erwitt

Suitable for initial study of big data processing technologies - it introduces you easily and clearly. Makes it clear how the abundance of information has influenced everyday life and all its spheres: science, business, medicine, etc. Contains numerous illustrations, so it is perceived without much effort.

"Introduction to Data Mining" by Pang-Ning Tan, Michael Steinbach and Vipin Kumar

Also useful for beginners is a book on Big Data, which explains working with big data according to the principle “from simple to complex.” Covers many important points at the initial stage: preparation for processing, visualization, OLAP, as well as some methods of data analysis and classification.

"Python Machine Learning" by Sebastian Raschka

A practical guide to using and working with big data using the Python programming language. Suitable for both engineering students and professionals who want to deepen their knowledge.

"Hadoop for Dummies", Dirk Derus, Paul S. Zikopoulos, Roman B. Melnik

Hadoop is a project created specifically for working with distributed programs that organize the execution of actions on thousands of nodes simultaneously. Getting to know it will help you understand in more detail the practical application of big data.

(literally - big data)? Let's look first at the Oxford Dictionary:

Data- quantities, signs or symbols that a computer operates and that can be stored and transmitted in the form of electrical signals, recorded on magnetic, optical or mechanical media.

Term Big Data used to describe a large data set that grows exponentially over time. To process such a large amount of data, machine learning is indispensable.

The benefits that Big Data provides:

Collecting data from various sources.
Improving business processes through real-time analytics.
Storing huge amounts of data.
Insights. Big Data is more insightful into hidden information through structured and semi-structured data.
Big data helps you reduce risk and make smart decisions with the right risk analytics

Big Data Examples

New York Stock Exchange generates daily 1 terabyte trading data for the past session.

Social media: Statistics show that Facebook uploads every day 500 terabytes new data is generated mainly due to uploading photos and videos to social network servers, messaging, comments under posts, and so on.

Jet engine generates 10 terabytes data every 30 minutes during the flight. Since thousands of flights are made every day, the volume of data reaches petabytes.

Big Data classification

Big data forms:

Structured
Unstructured
Semi-structured

Structured form

Data that can be stored, accessed and processed in a form with a fixed format is called structured. Over time, computer science has made great strides in improving techniques for working with this type of data (where the format is known in advance) and learned how to benefit from it. However, today there are already problems associated with the growth of volumes to sizes measured in the range of several zettabytes.

1 zettabyte equals a billion terabytes

Looking at these numbers, it is easy to see the veracity of the term Big Data and the difficulties associated with processing and storing such data.

Data stored in a relational database is structured and looks like, for example, tables of company employees

Unstructured form

Data of unknown structure is classified as unstructured. In addition to its large size, this shape is characterized by a number of difficulties in processing and extracting useful information. A typical example of unstructured data is a heterogeneous source containing a combination of simple text files, images and videos. Today, organizations have access to large amounts of raw or unstructured data, but do not know how to extract value from it.

Semi-structured form

This category contains both of those described above, so semi-structured data has some form but is not actually defined by tables in relational databases. An example of this category is personal data presented in an XML file.

Prashant RaoMale35 Seema R.Female41 Satish ManeMale29 Subrato RoyMale26 Jeremiah J.Male35

Characteristics of Big Data

Big Data growth over time:

Blue color represents structured data (Enterprise data), which is stored in relational databases. Other colors indicate unstructured data from various sources (IP telephony, devices and sensors, social networks and web applications).

According to Gartner, big data varies in volume, rate of generation, variety, and variability. Let's take a closer look at these characteristics.

Volume. The term Big Data itself is associated with large size. Data size is the most important metric in determining the potential value to be extracted. Every day, 6 million people use digital media, generating an estimated 2.5 quintillion bytes of data. Therefore, volume is the first characteristic to consider.
Diversity- the next aspect. It refers to heterogeneous sources and the nature of data, which can be either structured or unstructured. Previously, spreadsheets and databases were the only sources of information considered in most applications. Today, data in the form of emails, photos, videos, PDF files, and audio are also considered in analytical applications. This variety of unstructured data leads to problems in storage, mining and analysis: 27% of companies are not confident that they are working with the right data.
Generation speed. How quickly data is accumulated and processed to meet requirements determines potential. Speed determines the speed of information flow from sources - business processes, application logs, social networking and media sites, sensors, mobile devices. The flow of data is huge and continuous over time.
Variability describes the variability of data at some points in time, which complicates processing and management. For example, most data is unstructured in nature.

Big Data analytics: what are the benefits of big data

Promotion of goods and services: Access to data from search engines and sites like Facebook and Twitter allows businesses to more accurately develop marketing strategies.

Improving service for customers: Traditional customer feedback systems are being replaced by new ones that use Big Data and Natural Language Processing to read and evaluate customer feedback.

Risk calculation associated with the release of a new product or service.

Operational efficiency: big data is structured in order to quickly extract the necessary information and quickly produce accurate results. This combination of Big Data and storage technologies helps organizations optimize their work with rarely used information.