Processing Forecasting Queries Processing Forecasting Queries
Songyun Duan, Shivnath Babu
Duke University
Motivation Motivation
Real-time forecasting of future events based on historical data is useful in many domains
Proactive system management
If a performance problem is forecast, take corrective actions in advance to avoid it
Adaptive query processing
Inventory planning
Environmental monitoring
And many others
Need a framework to process forecasting queries automatically and efficiently
Forecasting Queries Forecasting Queries
Select
From D
Forecast L Xi
… 0.1 1.3
0.8 …
…1.3
0.3
……
…
…
…
…
…
… …
…
…
2.2 0.4
2
0.1 0.2
1
Xi
X1 Xn
¿ + L
L
…
?
¿
¿
Lead Time
T
D (T; X 1; X2; : : : ; Xn ) : historical time-series data up
An Example Forecasting Query An Example Forecasting Query
Select C
From Usage
Forecast 1 day
25 47 25 46 46 46 46 46 25
B
12 13 35 36 35 13 13 35 35
A
18 17 16 15 14 13 12 11 10 9
Day
16 68 16 16 68 68 16 68 17
C
Table: Usage
?
Lead time
Example Query Processing Example Query Processing
---- a Naa Naïïve Approachve Approach
16 68 16 16 68 68 16 68
?
17
26.6
C1
Class attribute
25 47 25 46 46 46 46 46 25 B
12 13 35 36 35 13 13 35 35 A
18 17 16 15 14 13 12 11 10 9 Day
16 68 16 16 68 68 16 68 17 C
?
Example Query Processing Example Query Processing
17 16 15 14 13 12 11 10 9 Day
16 68 16 16 68 68 16 68
?
A¡ 2 B¡ 1 C1
25 47 25 46 46 46 46 46 25 B
12 13 35 36 35 13 13 35 35 A
18 17 16 15 14 13 12 11 10 9 Day
16 68 16 16 68 68 16 68 17 C
?
13 35
12 36 35 13 13 35 35
47 25
25 46 46 46 46 46 25
-- --
--
Bayesian network
Previous prediction=26.6 C1 = 1.24* A¡ 2 + 0:3 ¤ B¡ 1
54 68
Challenges Challenges
To process real-time forecasting queries,
Challenge 1: generate a good processing strategy automatically and efficiently
Apply appropriate transformations to the data
E.g., shift, discretization, normalization, aggregation
Pick the right type of statistical model
E.g., multivariate linear regression (MLR), classification and regression tree (CART), Bayesian network (BN)
Challenge 2: for continuous forecasting over streaming data, adapt processing strategies when necessary
Outline Outline
Space of execution plans
Plan Search Algorithm
Processing continuous forecasting query
Experimental evaluation
Related work
Summary
Execution Plan Execution Plan
D
Synopsis
Time-series data
Forecasting result
Predictor
u
P (Syn; u) ) u:Z
Query: Forecast (D, X
i; L )
Syn(f Y1; ¢¢¢; Yn g ! Z )
B (D ; Z ) ) Syn(f Y1; ¢¢¢; Yng ! Z )
T
1T
kTransformers D10 Dk0
Synopsis Builder Logical operators
Transformer:
E.g., Shift (X, δ)
Synopsis builder:
Predictor:
Synopsis
E.g., linear regression, regression tree
D ) D0
Sample Execution Plan Sample Execution Plan
Select C
From Usage Forecast 1 day
MLR
MLR learne
r
Predictor
Shi ft ( C ; 1) Usage
Shift (A; ¡ 2)
¼
( A ¡ 2;B ¡ 1 )Shift (B ; ¡ 1)
D0
16 25
12 17
… … …
68 46
35 13
68 47
13 16
18
16 25
35 15
14 Day
16 46
36
C B
A
68 16
?
16 68 16
?
C1
…
…
12 13 35 36 35
A¡ 2 B¡ 1
… 46
25 47 25 46
?
47 17 35
… … …
25 46 46
… 16
13 …
16 16 36
68 15 35
14 Day
16
…
A¡ 2 B¡ 1 C1
D’
u = (35; 47; ?) Synopsis = multivariate linear regression (MLR)
54
u
= 1.24* A¡ 2 + 0:3 ¤ B¡ 1 C1
Estimating Accuracy of a Plan Estimating Accuracy of a Plan
Accuracy
How close are forecasting results to “real values”
Given a dataset D and a plan P
Tk
T1
Synopsis Predictor
P D
Synopsis Builder
RMSE=
q P
i ( ai ¡ bi )2 m
Synopsis type
Training
Test
Example accuracy metric:
Predicted value Actual value
b
1b
2b
ma
2a
1a
mFind a Good Plan Quickly Find a Good Plan Quickly
Optimization challenge: minimize the number of plans executed before finding a plan with high accuracy
Efficient plan search to balance accuracy Vs. running time
Simplified plan space to describe our algorithms
Two types of transformers: Shift and Project
Accuracy of best execution
plan found so far
Algorithm 2
Reasonably-good execution plan
available Algorithm 1
Elapsed processing time
Lead Time
0
Fa Fa ’ ’ s Plan Search (FPS) Algorithm s Plan Search (FPS) Algorithm
Dataset (n = 3)
Query: Forecast(Usage, C, 1)
7 8 9
…
4 5 6
…
1 2 3
…
A B C
Class attribute
A¡ 1B¡ 1C¡ 1A¡ 2B¡ 2C¡ 2 C1
7 8 9
…
4 5 6
…
1 2 3
…
7 8 9
…
4 5 6
…
1 2 3
…
8 9
…
?
Ranked list A¡ 2B¡ 1B¡ 2A¡ 1 C B C¡ 1 A C¡ 2 C1
Attribute Ranking
Linear-correlation-based
Shift( , )X i δ
Learn a synopsis and generate the plan?
Imagine n = 100, ∆ = 90
Extended data has 9000+ attributes
(-∆<=δ<0) (∆=2)
BN Predictor Shi f t (A ; ¡ 2) Shi f t (B ; ¡ 1)
P1
¼
A ¡ 2 ;B ¡ 1Fa Fa ’ ’ s Plan Search (FPS) Algorithm s Plan Search (FPS) Algorithm
A¡ 1B¡ 1C¡ 1A¡ 2B¡ 2C¡ 2 C1 A B C
Ranked list A¡ 2B¡ 1B¡ 2A¡ 1 C B C¡ 1 A C¡ 2 C1 (n=3,∆=2)
Shifted data
Attribute selection
Fast Correlation-Based Filter (FCBF)
Correlation-based Feature Selection (CFS)
Wrapper
( , )A¡ 2 B¡ 1 ( , , )A¡ 2 C¡ 1 C¡ 2
BN Predictor Shi f t (A ; ¡ 2)
P2
Shi f t(C; ¡ 1) Shi f t(C; ¡ 2)
¼A¡ 2;C¡ 1;C¡ 2
Stop
Do forecasting using the plan with highest accuracy , otherwise increase ∆
(Acc1) (Acc2)
Adaptive Fa
Adaptive Fa ’ ’ s Plan Search (FPS s Plan Search (FPS - - A) A)
Forecast (S[W], Xi ; L )
Stream S
… … … …
W
L
…
The ranked list and plans for S[W]
A¡ 1
B¡ 1 C¡ 1
A¡ 2 B A C¡ 2 C1
Plan
P
1 PlanP
2(A¡ 2; B¡ 1) (A¡ 2; C¡ 1; C¡ 2)
B¡ 2 C
(A¡ 2; C)
Plan
P
01(A¡ 2; C; C¡ 1)
Plan
P
02B¡ 2 C
Continuous query:
Outline Outline
Space of execution plans
Plan Search Algorithm
Processing continuous forecasting query
Experimental evaluation
Related work
Summary
Experimental Setting Experimental Setting
Target domain: system and database monitoring
Datasets (#attributes 3~250, #instances 700~15000)
Aging dataset from a departmental cluster
Aging behavior: progressive degradation in performance
A real dataset parsed from logs of 98’ World Cup web-site Periodic segments – characteristic of most popular web-sites
5 testbed datasets
Our testbed runs OLTP applications using MySQL
Simulated periodic workloads, aging behavior, and multiple resource contentions
Accuracy metric = balanced accuracy B A = 1 ¡ 0:5 ¤ ( # f al se posi t i v es
# n egat i v es + # f al se n egat i v es
# posi t i v es )
Multiple Chunks Vs. One Chunk Multiple Chunks Vs. One Chunk
0 5 0 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0 3 5 0 4 0 0 4 5 0 5 0 0
0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1
E l a p s e d p r o c e s s in g t im e ( s e c )
BA of current best plan
M u lt ip le c h u n k s O n e c h u n k
Synopsis Comparison Synopsis Comparison
3200 933 22339
Time
.85 .91 .86
BA
FPS(RF)
482 1948
Time
.86 0.51
BA
FPS(SVM)
24 19 130 201 36
Time
.80 .85 .80 .84 .64
BA
FPS(MLR)
109 50 249
37 135
Time
.81 .91 .85 .85 .71
BA
FPS(CART)
.82 .91 .84 .87 .71
BA
FPS(BN)
14 53 45 29 62
Time
Aging-variant-tb Multi-small-tb Periodic-small-tb
FIFA-real Aging-real
Dataset
FPS using BN or CART can achieve accuracy comparable to more sophisticated synopses
Lesson: More important to find right transformations
FPS Vs. State
FPS Vs. State - - of of - - the the - - Art Synopsis Art Synopsis
1 00 101 102 103
0 .5 0 .6 0 .7 0 .8 0 .9 1
E la p s e d p ro c e s s in g tim e (s e c , lo g s c a le )
BA of current best plan
F P S
R F -b as e R F -s hif ts
Synthetic dataset, Lead time = 25, n = 3,∆= 90
FPS FPS - - adaptive Vs. FPS adaptive Vs. FPS - - nonadaptive nonadaptive
Runtime overhead < 2%
Adaptability and Convergence
0 . 5 5 0 . 6 0 . 6 5 0 . 7 0 . 7 5 0 . 8 0 . 8 5 0 . 9 0 . 9 5
BA of current best plan
F P S -A d a p tiv e F P S -N o n a d a p tiv e
(Lead time = 25, ∆= 90)
Sharing Computation in FPS and FPS Sharing Computation in FPS and FPS - - A A
0 5 0 0 1 0 0 0 1 5 0 0
0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1
E l a p s e d p r o c e s s in g t im e ( s e c )
BA of current best plan
F P S - n o s h a r in g F P S - s h a r in g
FPS and FPS-A aggressively share computation as multiple plans are explored during plan search
Related Work Related Work
Time-series forecasting, performance problem forecasting, and self-tuning
Difference: our automated framework balances accuracy against running time
Machine learning techniques on data transformation and modeling
Can be incorporated as operators in our framework
Integration of synopses and data mining algorithms with DBMS and DSMS
Used for processing conventional SQL/XML queries
Summary Summary
Defined declarative one-time and continuous forecasting queries
Proposed an automatic plan search algorithm for processing one-time forecasting queries
Proposed an adaptive algorithm for continuous forecasting over streaming data
Extensive experimental evaluation of both algorithms
Thanks!
Thanks!
Execution Plans for Example Query Execution Plans for Example Query
Query: Forecast (Usage, C, 1)
Predictor
P1 Usage
Shi f t(C; 1)
26.6
MLR
Predictor
Shi f t (A ; ¡ 2) Shi f t (B ; ¡ 1)
P2
Shi f t(C; 1)
54
MLR
¼
A ¡ 2 ;B ¡ 1CART Predictor
Shi f t (A ; ¡ 2) Shi f t (B ; ¡ 1)
P3
Shi f t(C; 1)
68