• Aucun résultat trouvé

STREAMING DATA

Dans le document Big Data, Mining, and Analytics (Page 184-187)

Billie Anderson and J. Michael Hardin

STREAMING DATA

Situations in which data are streaming, arriving in real time from their sources, are becoming increasingly prevalent. The demand to analyze the data as it is received and make decisions quickly is growing as well.

This demand comes not only from the search for a competitive advantage (as in the Amazon example), but also from a desire to improve lives. For example, healthcare equipment is designed to continuously create data.

The ability to analyze healthcare data in real time provides benefits unique to the medical field. In one project, the University of Ontario Institute of Technology is helping Toronto hospitals be “smarter” by designing a system that allows doctors and other healthcare providers to analyze data in real time. The real-time system allows the healthcare providers to receive more than 1000 pieces of unique medical diagnostic informa-tion per second. This real-time analysis platform is being used as an early warning system, allowing doctors to detect life-threatening infections up to 24 hours sooner than in the past.2

The age of streaming data initiated dramatic changes in how analysts col-lect and analyze data. Previously, in a traditional data analysis situation, the data would be collected and stored in an enterprise data warehouse (EDW).

EDWs played a critical role in academic research and business development throughout the 1990s. A data warehouse serves as a large repository of his-torical and current transactional data of an organization. An EDW is a cen-tralized data warehouse that is accessible to the entire organization. When it was time to perform an analysis, the historical data would be pulled from the EDW; then several statistical analyses were performed, and a report was generated and given to the decision maker in the organization.

The problem with a traditional EDW is that given the volume and veloc-ity of data available to organizations today, traditional approaches to deci-sion making are no longer appropriate. By the time the decideci-sion maker has the information needed to take action, updated data and information are available, making the initial decision outdated. For example, social media capture fast-breaking trends on customer sentiments about prod-ucts and brands. Although companies might be interested in whether a

rapid change in online sentiment correlates with changes in sales, by the time a traditional analysis was completed, a host of new data would be available. Therefore, in big data environments, it is important to analyze, decide, and act quickly and often.3

In a streaming data application, data are generated in real time, taking the form of an unbounded sequence of text, video, or values. Since traditional database structures are not applicable in a streaming data environment, new database technology is being developed to store streaming data types. One such system is a data stream management system (DSMS).4 A DSMS is a set of computer hardware that maintains and queries streaming data.

The main difference between a DSMS and an EDW data architecture is the data stream. A DSMS differs from a traditional EDW in that new data are generated continually and the arrival rates may differ dramatically, ranging from millions of items per second (e.g., clickstreams of a con-sumer on a retail website) to several items per hour (e.g., weather station monitoring such as temperature and humidity readings). A traditional EDW queries static data to generate statistics and reports, but a DSMS constantly queries streaming data in an effort to give the end user the most up-to-date real-time information possible.

Another key difference between a DSMS and an EDW is the way in which data is stored. In a DSMS the data is stored in memory for as long as processing the data is needed. An EDW uses disk space to store the data.

The EDW data architecture is more expensive since hardware has to be purchased to store the massive amounts of data that is being generated.

Figures 8.1 and 8.2 show a schematic difference between the EDW and DSMS data architectures.

In order to develop and implement streaming data algorithms for any environment, scalable storage is essential to the strategy for streaming data in a flexible computing environment, leading to a streaming algo-rithm for the data environment that is effective. The algoalgo-rithm will go through much iteration, but should culminate in an efficient method for the end user to analyze the data stream effectively and make a decision in a timely manner. This sequential development will require a statistician, a domain expert, and computer engineers who will develop the system. One methodology of developing a real-time analysis system is profiled in the following paragraphs.

A group of researchers from the Los Alamos National Laboratory recently detailed their efforts to develop a streaming data algorithm in which radio astronomy is the motivating example.5 They sought the mitigation of noise,

or radio frequency interference (RFI), to enable the real-time identifica-tion of radio transients or transient astronomical events that emit electro-magnetic energy in the radio frequency domain. RFI, that is, man-made radio signals, can hinder the identification of astronomical signals of inter-est. Real-time identification of radio transients is crucial because it would permit the telescope observing an interesting radio event to redirect other assets to collect additional data pertaining to that event. Further, systems may collect so much data that it is prohibitively expensive or impossible to

Customer transaction data

Internal data such as financial

reports

Customer service data

WarehouseData

Sources of business data Static storage system Query reports from the end user

FIGURE 8.1

Enterprise data warehouse architecture.

Unstructured text such as text, video, pictures, social media information, web data, financial logs, sensor

networks data

1 0 0 1 0 0 1 1 1 0 0 1 1 0 1 0 1 0

1 0 0 1 0 0 1 1 1 0 0 1 1 0 1 0 1 0

Sources of business data streaming from multiple sources

Data arrives in real time and is stored in computer’s memory

Queries are constantly being generated for the end user

FIGURE 8.2

Data stream management system architecture.

save all of it for later analysis. In this case, the data of interest must be identi-fied quickly so that they may be saved for future study.

The authors of the Los Alamos study note that when developing a real-time framework for analysis, the computational details are only one part of the overall system. Network bandwidth and storage are key components of the real-time structure for analysis. Network bandwidth is the rate at which data can be received over a network connection or interface. In the author’s study, network bandwidth was exposed as a limiting factor. Scalable stor-age systems are needed for sustaining high data streaming rates.

Dans le document Big Data, Mining, and Analytics (Page 184-187)