• Aucun résultat trouvé

STREAMING DATA CASE STUDIES

Dans le document Big Data, Mining, and Analytics (Page 187-191)

Billie Anderson and J. Michael Hardin

STREAMING DATA CASE STUDIES

Researchers from various fields have developed intriguing real-world applications using streaming data for different application areas. This sec-tion outlines those applicasec-tions, which come from the medical and mar-keting fields.

Healthcare Streaming Data Case Study

Influenza is a respiratory disease known to be a global health concern due to the disease’s ability to cause serious illness or death.6,7 Much work has been done in developing and implementing surveillance systems for influ-enza. A multitude of ways exist to develop an influenza surveillance track-ing method. A traditional influenza surveillance system would begin with a patient admitted to a hospital; the patient would be monitored while in the hospital.8 A more contemporary approach links Google researchers to healthcare officials from the Centers for Disease Control, using Internet searches to monitor what part of the population has influenza.9 The pur-pose of these surveillance systems is to alert healthcare workers to a poten-tial influenza outbreak so that medical practitioners can intervene before the outbreak reaches a pandemic level.

One group of researchers has recently developed a digital dashboard allowing it to monitor influenza outbreaks using data from multiple sources in Hong Kong.10 A dashboard is a visual user interface that sum-marizes pertinent information in a way that is easy to read. The use of dashboards was initially part of business applications that help manag-ers and high-level executives make decisions. With a healthcare focus, the

researchers in Hong Kong used the dashboard system that would update with real-time data sources such as

• Weekly consultation rate of influenza-like illness reported by gen-eral outpatient clinics in Hong Kong

• Weekly consultation rate of influenza-like illness reported by gen-eral practitioners

• Weekly influenza virus isolation rate

• Weekly overall school absenteeism rate

• Weekly hospital admission rate of children aged 4 and under with principal diagnosis of influenza

These metrics were tracked in real time, and if the researchers noted a spike in any of the dashboard graphics, they could alert local healthcare officials.

Marketing Streaming Data Case Study

Online advertising plays an important role in advertising market strategies with the growth of the Internet. Currently, one of most widely used rev-enue models for online advertising involves charging for each click based on the popularity of keywords and the number of competing advertisers.

In this pay-per-click (PPC) model, the advertiser pays only if its adver-tisement is clicked. The PPC model leaves room for individuals or rival companies to generate false clicks (that is, click fraud), which poses serious problems to the development of a reliable online advertising market.

PPC is a relatively new advertising model; it emerged onto the Internet in 1998. Originally introduced by Goto.com, the idea was that sponsored versions of the search term would be auctioned off to the highest bidder.

Beginning in 2000, Goto.com began selling its services to many of the largest search engines of the time. Goto.com changed its name to Overture in 2001 and controlled nearly the entire PPC search market. Yahoo pur-chased Overture in 2003. There was one exception to Overture’s domi-nance: Google also had a PPC model that it developed internally.11

In 2002, Google altered its model and began offering the Google paid search results on other search engines (Overture made a similar move the year before). Google’s new model was known as Adwords. Adwords was an adoption of Overture’s model in which advertisers bid on how much they would pay per click. This model allows advertisers to buy their way to the top of a search posting; that is, the highest bid gets the most exposure.

Google strategists realized there was a problem with this approach: if an advertiser bid its way to the top of the ranking with an irrelevant ad, and no one clicked on it, then no one made any money from the advertising. Google introduced click-through rate, as a measure of the ad’s relevance, into the ranking algorithm. If an ad with a lower bid per click got clicked more often, it would rank higher. The rest is history; the click-through rate approach is what has turned Google into the billion dollar company it is today.12

With such intense competition, one can see why detecting click fraud is a major concern. Researchers in the Department of Electrical and Computer Engineering of Iowa State University have proposed a method of detecting click fraud in the PPC model using streaming data.13 Possible sources of click fraud come from

• Search engines or online ad publishers

• Ad subdistributors

• Competitors

• Web page crawlers

When assessing which clicks are valid or invalid in a clickstream, dupli-cate clicks are a major consideration. The authors from Iowa consider two scenarios for the duplicate click issue.

Scenario 1: A normal client visited an advertiser’s website by clicking the ad link of a publisher. One week later, the client visited the same website again by clicking on the same ad link.

Scenario 2: The competitors or even the publishers control thousands of computers, each of which initiates many clicks to the ad links every day.

Obviously the first scenario is not fraud, but the second scenario is click fraud. The authors used a time-based jumping window as the basis of their analysis. In this scenario, a time dimension has to be specified. For exam-ple, one should examine clicks over a unit of time—second, minute, hour, day, etc. Then the window of time that is specified is broken down into subwindows. The information and statistics gathered on the clickstream over the entire time window is based on the combination from the smaller subwindows. The statistics that are calculated to detect if the click is fraud-ulent or not are based on a theoretical algorithm known as the Bloom filter. In summary, the Bloom filter detects if the click is a duplicate over a jumping window. If the click is determined to be a duplicate, it is marked

fraudulent. When new clicks are added to the stream (or another subwin-dow), the oldest clickstream data is deleted. Clickstream data is constantly being fed into the algorithm to determine if clicks are fraudulent or not.

The main advantage of the author’s approach is that high-speed click-streams can be processed in a way that uses less computer memory, thus speeding up processing time in determining if the click is fraudulent or not.

Credit Card Fraud Detection Case Study

Credit cards are responsible for $25 trillion in transactions a year at more than 24 million locations in 200 countries. It is estimated that there are 10,000 payment transactions every second around the world.15 Credit card fraud costs the United States $8.6 billion annually.16 Since there is a global economic impact for credit card fraud, researchers around the world in the data mining community are actively developing methods to thwart this problem. One statistical method that is being used to detect credit card fraud is clustering. Sorin (2012) provides a comprehensive overview of the contemporary statistical clustering techniques that are being used to detect financial fraud.17 One of the techniques he describes in his pub-lication is a clustering method that combines streaming data to detect credit card fraud.17

Clustering is an unsupervised data mining technique. An unsupervised technique is one in which there is no outcome or target variable that is modeled using predictor variables. Clustering isolates transactions into homogenous groups. The clustering algorithm can work on multidimen-sional data and aims to minimize the distance within clusters while maxi-mizing the distance between clusters. With an arbitrary starting point of k centers, the algorithm proceeds iteratively by assigning each point to its nearest center based on Euclidean distance, then recomputing the central points of each cluster. This continues until convergence, where every clus-ter assignment has stabilized.

Often the purpose of clustering is descriptive. For example, segmenting existing transactions into groups and associating a distinct profile with each group will help determine what transactions need to be looked at more closely.

Traditional clustering methods cannot analyze streaming data since one of the underlying assumptions for clustering analysis is that the data is in a traditional static database. Tasoulis et al. (2008) have developed a cluster algorithm that will generate clusters on streaming data.18 The main

idea of the algorithm is to cluster transactions in moving time windows.

The data in the time windows are recentered to the mean of the data points they include at each time point in a manner that depends on each trans-action’s time stamp. Every time a new transaction (data streaming point) arrives, the clusters are updated. The algorithm has a “forgetting factor”

that deletes the data points when they are no longer informative in build-ing the clusters.

Creating a clustering technique for streaming data for a credit card application has the potential to uncover hidden fraudulent transactions in massive streaming credit card data. As new transactions occur, finding out how clusters evolve can prove crucial in identifying what factors make the transaction fraudulent. Several issues that are important in streaming data credit card detection are

• Detecting the fraudulent transaction as soon as it occurs

• Detecting equally well both types of changes, abrupt and gradual

• Distinguishing between real evolution of the clusters and noise If the transaction is found to be fraudulent, the analyst must do the following:

• Make sure the transaction is not out of date (the importance of using moving time windows)

• Archive some of the old clusters and examples

The second point is to create a data stream analysis technique that is efficient. Streaming data is like a river: data flows in and data flows out.

The analyst only gets to see the data one time, so it is important that the algorithm is efficient and has been trained properly to detect fraudulent transactions. The algorithm needs to “remember” what a fraudulent trans-action looks like.

Dans le document Big Data, Mining, and Analytics (Page 187-191)