• Aucun résultat trouvé

Processing Forecasting Queries

N/A
N/A
Protected

Academic year: 2022

Partager "Processing Forecasting Queries"

Copied!
25
0
0

Texte intégral

(1)

Processing Forecasting Queries Processing Forecasting Queries

Songyun Duan, Shivnath Babu

Duke University

(2)

Motivation Motivation

Real-time forecasting of future events based on historical data is useful in many domains

Proactive system management

If a performance problem is forecast, take corrective actions in advance to avoid it

Adaptive query processing

Inventory planning

Environmental monitoring

And many others

Need a framework to process forecasting queries automatically and efficiently

(3)

Forecasting Queries Forecasting Queries

Select

From D

Forecast L Xi

0.1 1.3

0.8

1.3

0.3

2.2 0.4

2

0.1 0.2

1

Xi

X1 Xn

¿ + L

L

?

¿

¿

Lead Time

T

D (T; X 1; X2; : : : ; Xn ) : historical time-series data up

(4)

An Example Forecasting Query An Example Forecasting Query

Select C

From Usage

Forecast 1 day

25 47 25 46 46 46 46 46 25

B

12 13 35 36 35 13 13 35 35

A

18 17 16 15 14 13 12 11 10 9

Day

16 68 16 16 68 68 16 68 17

C

Table: Usage

?

Lead time

(5)

Example Query Processing Example Query Processing

---- a Naa Naïïve Approachve Approach

16 68 16 16 68 68 16 68

?

17

26.6

C1

Class attribute

25 47 25 46 46 46 46 46 25 B

12 13 35 36 35 13 13 35 35 A

18 17 16 15 14 13 12 11 10 9 Day

16 68 16 16 68 68 16 68 17 C

?

(6)

Example Query Processing Example Query Processing

17 16 15 14 13 12 11 10 9 Day

16 68 16 16 68 68 16 68

?

A¡ 2 B¡ 1 C1

25 47 25 46 46 46 46 46 25 B

12 13 35 36 35 13 13 35 35 A

18 17 16 15 14 13 12 11 10 9 Day

16 68 16 16 68 68 16 68 17 C

?

13 35

12 36 35 13 13 35 35

47 25

25 46 46 46 46 46 25

-- --

--

Bayesian network

Previous prediction=26.6 C1 = 1.24* A¡ 2 + 0:3 ¤ B¡ 1

54 68

(7)

Challenges Challenges

To process real-time forecasting queries,

Challenge 1: generate a good processing strategy automatically and efficiently

Apply appropriate transformations to the data

E.g., shift, discretization, normalization, aggregation

Pick the right type of statistical model

E.g., multivariate linear regression (MLR), classification and regression tree (CART), Bayesian network (BN)

Challenge 2: for continuous forecasting over streaming data, adapt processing strategies when necessary

(8)

Outline Outline

Space of execution plans

Plan Search Algorithm

Processing continuous forecasting query

Experimental evaluation

Related work

Summary

(9)

Execution Plan Execution Plan

D

Synopsis

Time-series data

Forecasting result

Predictor

u

P (Syn; u) ) u:Z

Query: Forecast (D, X

i

; L )

Syn(f Y1; ¢¢¢; Yn g ! Z )

B (D ; Z ) ) Syn(f Y1; ¢¢¢; Yng ! Z )

T

1

T

k

Transformers D10 Dk0

Synopsis Builder Logical operators

Transformer:

E.g., Shift (X, δ)

Synopsis builder:

Predictor:

Synopsis

E.g., linear regression, regression tree

D ) D0

(10)

Sample Execution Plan Sample Execution Plan

Select C

From Usage Forecast 1 day

MLR

MLR learne

r

Predictor

Shi ft ( C ; 1) Usage

Shift (A; ¡ 2)

¼

( A ¡ 2;B ¡ 1 )

Shift (B ; ¡ 1)

D0

16 25

12 17

… … …

68 46

35 13

68 47

13 16

18

16 25

35 15

14 Day

16 46

36

C B

A

68 16

?

16 68 16

?

C1

12 13 35 36 35

A¡ 2 B¡ 1

46

25 47 25 46

?

47 17 35

… … …

25 46 46

16

13

16 16 36

68 15 35

14 Day

16

A¡ 2 B¡ 1 C1

D’

u = (35; 47; ?) Synopsis = multivariate linear regression (MLR)

54

u

= 1.24* A¡ 2 + 0:3 ¤ B¡ 1 C1

(11)

Estimating Accuracy of a Plan Estimating Accuracy of a Plan

Accuracy

How close are forecasting results to real values

Given a dataset D and a plan P

Tk

T1

Synopsis Predictor

P D

Synopsis Builder

RMSE=

q P

i ( ai ¡ bi )2 m

Synopsis type

Training

Test

Example accuracy metric:

Predicted value Actual value

b

1

b

2

b

m

a

2

a

1

a

m

(12)

Find a Good Plan Quickly Find a Good Plan Quickly

Optimization challenge: minimize the number of plans executed before finding a plan with high accuracy

Efficient plan search to balance accuracy Vs. running time

Simplified plan space to describe our algorithms

Two types of transformers: Shift and Project

Accuracy of best execution

plan found so far

Algorithm 2

Reasonably-good execution plan

available Algorithm 1

Elapsed processing time

Lead Time

0

(13)

Fa Fa ’ ’ s Plan Search (FPS) Algorithm s Plan Search (FPS) Algorithm

Dataset (n = 3)

Query: Forecast(Usage, C, 1)

7 8 9

4 5 6

1 2 3

A B C

Class attribute

A¡ 1B¡ 1C¡ 1A¡ 2B¡ 2C¡ 2 C1

7 8 9

4 5 6

1 2 3

7 8 9

4 5 6

1 2 3

8 9

?

Ranked list A¡ 2B¡ 1B¡ 2A¡ 1 C B C¡ 1 A C¡ 2 C1

Attribute Ranking

Linear-correlation-based

Shift( , )X i δ

Learn a synopsis and generate the plan?

Imagine n = 100, ∆ = 90

Extended data has 9000+ attributes

(-∆<=δ<0) (∆=2)

(14)

BN Predictor Shi f t (A ; ¡ 2) Shi f t (B ; ¡ 1)

P1

¼

A ¡ 2 ;B ¡ 1

Fa Fa ’ ’ s Plan Search (FPS) Algorithm s Plan Search (FPS) Algorithm

A¡ 1B¡ 1C¡ 1A¡ 2B¡ 2C¡ 2 C1 A B C

Ranked list A¡ 2B¡ 1B¡ 2A¡ 1 C B C¡ 1 A C¡ 2 C1 (n=3,∆=2)

Shifted data

Attribute selection

Fast Correlation-Based Filter (FCBF)

Correlation-based Feature Selection (CFS)

Wrapper

( , )A¡ 2 B¡ 1 ( , , )A¡ 2 C¡ 1 C¡ 2

BN Predictor Shi f t (A ; ¡ 2)

P2

Shi f t(C; ¡ 1) Shi f t(C; ¡ 2)

¼A¡ 2;C¡ 1;C¡ 2

Stop

Do forecasting using the plan with highest accuracy , otherwise increase ∆

(Acc1) (Acc2)

(15)

Adaptive Fa

Adaptive Fa ’ ’ s Plan Search (FPS s Plan Search (FPS - - A) A)

Forecast (S[W], Xi ; L )

Stream S

… … … …

W

L

The ranked list and plans for S[W]

A¡ 1

B¡ 1 C¡ 1

A¡ 2 B A C¡ 2 C1

Plan

P

1 Plan

P

2

(A¡ 2; B¡ 1) (A¡ 2; C¡ 1; C¡ 2)

B¡ 2 C

(A¡ 2; C)

Plan

P

01

(A¡ 2; C; C¡ 1)

Plan

P

02

B¡ 2 C

Continuous query:

(16)

Outline Outline

Space of execution plans

Plan Search Algorithm

Processing continuous forecasting query

Experimental evaluation

Related work

Summary

(17)

Experimental Setting Experimental Setting

Target domain: system and database monitoring

Datasets (#attributes 3~250, #instances 700~15000)

Aging dataset from a departmental cluster

Aging behavior: progressive degradation in performance

A real dataset parsed from logs of 98’ World Cup web-site Periodic segments – characteristic of most popular web-sites

5 testbed datasets

Our testbed runs OLTP applications using MySQL

Simulated periodic workloads, aging behavior, and multiple resource contentions

(18)

Accuracy metric = balanced accuracy B A = 1 ¡ 0:5 ¤ ( # f al se posi t i v es

# n egat i v es + # f al se n egat i v es

# posi t i v es )

Multiple Chunks Vs. One Chunk Multiple Chunks Vs. One Chunk

0 5 0 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0 3 5 0 4 0 0 4 5 0 5 0 0

0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1

E l a p s e d p r o c e s s in g t im e ( s e c )

BA of current best plan

M u lt ip le c h u n k s O n e c h u n k

(19)

Synopsis Comparison Synopsis Comparison

3200 933 22339

Time

.85 .91 .86

BA

FPS(RF)

482 1948

Time

.86 0.51

BA

FPS(SVM)

24 19 130 201 36

Time

.80 .85 .80 .84 .64

BA

FPS(MLR)

109 50 249

37 135

Time

.81 .91 .85 .85 .71

BA

FPS(CART)

.82 .91 .84 .87 .71

BA

FPS(BN)

14 53 45 29 62

Time

Aging-variant-tb Multi-small-tb Periodic-small-tb

FIFA-real Aging-real

Dataset

FPS using BN or CART can achieve accuracy comparable to more sophisticated synopses

Lesson: More important to find right transformations

(20)

FPS Vs. State

FPS Vs. State - - of of - - the the - - Art Synopsis Art Synopsis

1 00 101 102 103

0 .5 0 .6 0 .7 0 .8 0 .9 1

E la p s e d p ro c e s s in g tim e (s e c , lo g s c a le )

BA of current best plan

F P S

R F -b as e R F -s hif ts

Synthetic dataset, Lead time = 25, n = 3,∆= 90

(21)

FPS FPS - - adaptive Vs. FPS adaptive Vs. FPS - - nonadaptive nonadaptive

Runtime overhead < 2%

Adaptability and Convergence

0 . 5 5 0 . 6 0 . 6 5 0 . 7 0 . 7 5 0 . 8 0 . 8 5 0 . 9 0 . 9 5

BA of current best plan

F P S -A d a p tiv e F P S -N o n a d a p tiv e

(Lead time = 25, ∆= 90)

(22)

Sharing Computation in FPS and FPS Sharing Computation in FPS and FPS - - A A

0 5 0 0 1 0 0 0 1 5 0 0

0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1

E l a p s e d p r o c e s s in g t im e ( s e c )

BA of current best plan

F P S - n o s h a r in g F P S - s h a r in g

FPS and FPS-A aggressively share computation as multiple plans are explored during plan search

(23)

Related Work Related Work

Time-series forecasting, performance problem forecasting, and self-tuning

Difference: our automated framework balances accuracy against running time

Machine learning techniques on data transformation and modeling

Can be incorporated as operators in our framework

Integration of synopses and data mining algorithms with DBMS and DSMS

Used for processing conventional SQL/XML queries

(24)

Summary Summary

Defined declarative one-time and continuous forecasting queries

Proposed an automatic plan search algorithm for processing one-time forecasting queries

Proposed an adaptive algorithm for continuous forecasting over streaming data

Extensive experimental evaluation of both algorithms

Thanks!

Thanks!

(25)

Execution Plans for Example Query Execution Plans for Example Query

Query: Forecast (Usage, C, 1)

Predictor

P1 Usage

Shi f t(C; 1)

26.6

MLR

Predictor

Shi f t (A ; ¡ 2) Shi f t (B ; ¡ 1)

P2

Shi f t(C; 1)

54

MLR

¼

A ¡ 2 ;B ¡ 1

CART Predictor

Shi f t (A ; ¡ 2) Shi f t (B ; ¡ 1)

P3

Shi f t(C; 1)

68

¼

A ¡ 2 ;B ¡ 1

Références

Documents relatifs

Average weekly traffic intensity and pollution parame- ters measured in Pista Silla

While the Bayesian approach to NN modelling offers significant advantages over the classical NN learning methods, it will be shown that the use of GP regression models will

We also establish a partition polytope analogue of the lower bound theorem, indicating that the output- sensitive enumeration algorithm can be far superior to previously

La doctrine partage l'opinion du tribunal de commerce de la Seine 260, qui a déclaré dans son jugement antérieur « Il est vrai, au revirement de la jurisprudence, nous déplorons,

2 Department of Information Technology, Alexander Technological Educational Institute of Thessaloniki, P.O. The developed forecasting algorithm creates trend models based on

Dans cet expos´ e, on cherche ` a r´ esoudre le probl` eme de pr´ ediction aveugle, c’est ` a dire lorsque la covariance est inconnue et que l’on doit, avec un unique ´

The goal of the present study is to therefore to ex- amine how stress and anxiety were experienced during lockdown as a function of depression history in a sample of university

The study led to the design of two adaptive multi-agent systems: AMAWind-Turbine forecasting the production of a wind farm using wind turbine data, and AMAWind-Farm forecasting