Steven Barber - Big Data, Mining, and Analytics

CONTENTS

Introduction ...180

Quantitative Approaches to Streaming Data Analysis ...181

Event Stream Processing Evolution in Recent History ...182

There Is Always More Data, and It Isn’t Getting Any Simpler ...182

Processing Model Evolution ...183

The Orthodox History of CEP: Long-Needed Switch from RDBMS-Based Real-Time Processing to Event Stream Processing ...183

Another View: CEP Is a Message Transformation Engine That’s a Service Attached to a Distributed Messaging Bus ...184

The Wheel Turns Again: Big Data, Web Applications, and Log File Analysis ...185

Rise of Inexpensive Multicore Processors and GPU-Based Processing ...185

Advantages of CEP ...186

Visual Programming for Communicating with Stakeholders ...186

CEP Example: A Simple Application Using StreamBase ...187

EventFlow Concepts and Terminology...188

Step-by-Step through the MarketFeedMonitor App ... 190

Uses of CEP in Industry and Applications ...197

Automated Securities Trading ...197

Market Data Management ...199

Signal Generation/Alpha-Seeking Algorithms ...199

Execution Algorithms ... 200

Smart Order Routing ... 200

Real-Time Profit and Loss ...201

Transaction Cost Analysis ... 202

INTRODUCTION

Complex event processing (CEP) is a relatively new category of software whose purpose is to process streams of events, perform a set of computa-tions on them, and then publish new streams of events based on the result of these computations.

On the surface, this discipline sounds like every other kind of computer programming since Lovelace (input, compute, and output), and of course, that is true. What makes CEP interesting is its narrow focus on input events that arrive in streams, and on computations that generally seek to take in very large numbers of relatively raw events and, by operating on them, produce streams of output events that have more meaning to the enterprise—that represent events at a higher level of abstraction or value than the events that went in. With CEP, what we’re interested in, primar-ily, are the events that are arriving right now. The CEP system receives an event, acts on it immediately, and then there’s another event arriving to take care of. There’s no attempt to hold all the previous events in memory

CEP in Other Industries ... 203

Intelligence and Security... 203

Multiplayer Online Gaming ... 204

Retail and E-Commerce Transaction Analysis ... 205

Network and Software Systems Monitoring ... 206

Bandwidth and Quality of Service Monitoring ... 207

Effects of CEP ... 207

Decision Making ... 207

Proactive Rather Than Reactive, Shorter Event Time Frames, Immediate Feedback ... 207

Strategy ... 208

From Observing and Discovering Patterns to Automating Actions Based on Pattern ... 208

Operational Processes ... 208

Reduced Time to Market Due to Productivity Increases ... 208

Shortened Development Means Ability to Try Out More New Ideas ... 208

Summary ... 209

for a long time—the CEP process squeezes what’s interesting out of the event as it arrives and discards what’s not interesting. This focus on real-time (as it arrives) push as opposed to pull is at the core of what CEP is.

All this is not to say that CEP can’t be used to look at stored or historical data—it often is, especially when application testing is going on. However, the usual case is for the CEP system to play back stored historical data as if it represented events that are arriving in streams, and not to scan over large historical stores of now-static data.

CEP, being a streaming data-oriented technology, is not often used for pattern discovery in the same way a data warehouse is. CEP is very often used to match already known patterns against arriving data, and then, once the match occurs, to take some action based on the recognition of that pattern. For example, if a securities trader believes that three upticks followed by two downticks in the same second means something, then a CEP application can be easily made to look for that pattern and generate an output event whenever that pattern occurs. Another common use of CEP is for creating caches of information that can be used with interac-tive discovery-oriented tools so that new patterns or conditions can be discovered more or less live, and then acted upon manually before it is too late to act, or to explore the data visually in order to perceive interest-ing patterns in the first place, which then, once identified and validated against longer-term trends, can be the basis for future automation by another CEP application.

QUANTITATIVE APPROACHES TO STREAMING DATA ANALYSIS

CEP systems provide the ability to operate on, or transform, input events into output events in a number of ways. The model for CEP processing tends to be declarative rather than imperative, data flow-oriented rather than control flow-oriented, and push-oriented rather than pull-oriented.

Events arrive or happen, and then the events are presented to the first operation in the application, which operates on the event and in turn emits zero or more events in response to the next operation in the stream of operators, which processes the event(s) generated by the upstream operations, and so on until no events are emitted or an output is reached.

It is the arrival of the event that triggers the CEP processing—the CEP

processor does not periodically scan for events, but rather reacts to events immediately when they arrive. This property makes CEP systems very responsive, with minimal processing latency.

Once this event-oriented model is grasped, the kinds of operations that may be performed in a CEP application are somewhat different than in tra-ditional imperative languages. Rather than if/then/else and for and while loops, there are event- and stream-oriented operators, like Aggregate, Filter, Map, and Split. Rather than keeping state in variables, generally state is kept in relational tables and retrieved via streaming joins of arriv-ing events with the contents of the tables. Streams are partitioned into windows over which operations are performed. The effect of this model is that operations are performed at a higher level of abstraction than in tra-ditional programming languages—closer to the concerns of application domains, and without developers having to pay attention to tracking lots of lower-level housekeeping details.

EVENT STREAM PROCESSING EVOLUTION IN RECENT HISTORY

There Is Always More Data, and It Isn’t Getting Any Simpler The need for event processing systems arises from the observation that with modern communication and information technology, there is always going to be more data than humans can contend with directly, and that every year there is more data: more in volume and more types. Further, the relationships between arriving data are arbitrarily complex. Making sense of the data at our fingertips is not easy. Being able to recognize which events are interesting, or which match some pattern we are interested in, is something that simply must be automated. Events may be arriving thousands or even millions of time per second, and often it is valuable to respond to them as quickly as we can, perhaps within microseconds. For example, these days most of the securities trading in the major equities and currency markets occurs within a few hundred microseconds of an order’s placement in a trading venue. The ability to respond that fast to an order tends to encourage the placement of yet more orders, more quickly, in order to gain tiny advantages in trading.

Processing Model Evolution

The Orthodox History of CEP: Long-Needed Switch from RDBMS-Based Real-Time Processing to Event Stream Processing

In the social history of technology, CEP seems to have grown out of the relational database world. (To the extent that attempts to standardize CEP languages have been made, the resulting languages tend to look like Structured Query Language (SQL).)

RDBMS: Store, Then Analyze

In an off-the-shelf relational database management system (RDBMS), data is stored in relational tables, usually kept on disk or some other rela-tively slow persistent storage medium. Increasing the relative slowness of RDBMSs, not only is the base data stored on disk, but also the indexes in the tables, as well as transaction logs. Once the data is written to the table, subsequent read queries reextract the data for analysis—reading data usu-ally involves scanning indexes and then reading the data back off the disk.

In the early days of creating event-oriented systems, RDBMSs were pretty much what was available for structured storage, and the default was to put stuff there. Unfortunately, as data arrival rates and volumes began to rise, traditionally structured RDBMSs were unable to keep up. For example, when doing automated securities transaction processing with, say, the usual North American equities trade and quote messages, RDBMSs are happy to process maybe a few thousand transactions per second, if well tuned. However, the equities markets now generate hundreds of thou-sands of messages per second, and the cost of an RDBMS that can handle that amount of traffic is astronomical, if not infinite. This style of process-ing is sometimes called an outbound processprocess-ing model (Figure 9.1).

CEP proceeds from this fairly simple high-level observation: not every event need be stored by every process that receives it. If we need to store every event for later, it might be better to do that separate and in parallel to the analysis of the event, and perhaps we don’t need to have our event persister operate with the Atomicity, Consistency, Isolation, Durability (ACID) transactionality of a traditional relational: a simple file appender might be just fine and will give us a couple orders of magnitude of addi-tional performance on the same hardware—from thousands of events per second to hundreds of thousands per second. The CEP engine receives the events in parallel with the persister, and indeed, since it is operating against memory instead of disk, could easily process millions of messages

Dans le document Big Data, Mining, and Analytics (Page 196-200)