Incremental and Decremental Training for Linear Classification

(1)

Supplement Materials for “Incremental and Decremental Training for Linear Classification”

Cheng-Hao Tsai

Dept. of Computer Science National Taiwan Univ., Taiwan

Chieh-Yen Lin

Chih-Jen Lin

Details of Deriving

(25)

From Section 3.2, ∆Dα¯ is the dual initial solution. Then

dual initial value

=

l

X

i=k+1

h(∆Dα^∗i,∆DC)−

1 2

l

X

i=k+1 l

X

j=k+1

∆²_Dα^∗_iα^∗_jK(i, j)−

l

X

i=k+1

(∆Dα^∗_i)² d 2∆D

=∆D l

X

i=k+1

h(α^∗i, C)

− ∆D

2

l

X

i=k+1 l

X

j=k+1

α^∗iα^∗jK(i, j) +

l

X

i=k+1

(α^∗i)² d

∆D





=∆D

1

2w¯^Tw¯+C

l

X

i=k+1

ξ( ¯w;xi, yi)−1 2

k

X

i=1 k

X

j=1

α^∗_iα^∗_jK(i, j)

+1 2

l

X

i=k+1 l

X

j=k+1

α^∗iα^∗jK(i, j) +

l

X

i=k+1

(αi^∗)²d

−∆D

2

l

X

i=k+1 l

X

j=k+1

α^∗_iα^∗_jK(i, j) +

l

X

i=k+1

(α^∗_i)² d

∆D





=1

2w¯^Tw¯+ ∆DC

l

X

i=k+1

ξ( ¯w;xi, yi)−∆D

2

k

X

i=1 k

X

j=1

α^∗iα^∗jK(i, j)

−∆²_D−∆D

2

l

X

i=k+1 l

X

j=k+1

α^∗iα^∗jK(i, j) +∆D−1 2 w¯^Tw.¯

The second and third equalities respectively use (31) and (23).

Experiments for Logistic regression with Larger

C

In the paper we present figures of comparing optimization methods using logistic regression withC= 1. Figures I and II give result of usingC= 128.

Experiments for L2-loss Support Vector Ma- chine

We have conducted experiments on L2-loss SVM by fol- lowing the same settings for logistic regression. Table I shows the relative difference between initial and optimal function values for incremental and decremental learning, respectively. The comparison of the three optimization methods are shown in Figures III and IV forC= 1 and Figures V and VI forC= 128.

(2)

Data set Formulation C= 1 C= 128

wo-ws r= 5 r= 50 r= 500 wo-ws r= 5 r= 50 r= 500

ijcnn Primal 3.2e+00 3.3e−04 2.4e−05 1.7e−06 3.2e+00 3.4e−04 2.5e−05 1.7e−06

Dual 1.0e+00 1.9e−01 1.9e−02 1.3e−03 1.0e+00 1.9e−01 1.9e−02 1.3e−03

webspam Primal 2.9e+00 2.4e−04 2.6e−05 3.8e−06 3.0e+00 3.2e−04 3.8e−05 4.3e−06

Dual 1.0e+00 2.0e−01 2.0e−02 2.4e−03 1.0e+00 2.0e−01 2.0e−02 2.4e−03

news20 Primal 9.3e+00 1.8e−01 1.5e−02 1.3e−03 2.0e+02 5.1e+00 4.4e−01 3.3e−02

Dual 1.0e+00 1.6e−01 1.2e−02 1.0e−03 1.0e+00 2.3e−01 2.3e−02 2.1e−04

rcv1 Primal 1.3e+01 1.8e−01 1.4e−02 2.6e−03 3.1e+02 9.0e+00 7.5e−01 1.7e−01

Dual 1.0e+00 1.8e−01 1.4e−02 2.0e−03 1.0e+00 2.3e−01 2.6e−02 9.9e−04

real-sim Primal 1.5e+01 1.1e−01 1.1e−02 1.4e−03 1.5e+02 3.9e+00 4.5e−01 3.7e−02

Dual 1.0e+00 1.8e−01 1.7e−02 2.0e−03 1.0e+00 3.0e−01 3.3e−02 8.0e−04

yahoo-japan Primal 5.8e+00 1.3e−01 1.1e−02 1.1e−03 5.5e+01 4.6e+00 4.4e−01 4.9e−02

Dual 1.0e+00 2.2e−01 2.1e−02 2.3e−03 1.0e+00 2.7e−01 2.7e−02 2.9e−03

(a) Incremental learning

Data set Formulation C= 1 C= 128

wo-ws r= 5 r= 50 r= 500 wo-ws r= 5 r= 50 r= 500

ijcnn Primal 3.2e+00 3.3e−04 3.7e−05 2.5e−06 3.2e+00 3.4e−04 3.7e−05 2.6e−06

Dual 1.0e+00 6.8e−01 8.0e−02 3.5e−03 1.0e+00 8.5e+01 8.3e+00 4.4e−01

webspam Primal 2.9e+00 2.3e−04 3.4e−05 5.8e−06 3.0e+00 3.2e−04 5.1e−05 6.8e−06

Dual 1.0e+00 5.1e−02 9.6e−03 1.4e−03 1.0e+00 3.1e+00 7.6e−01 1.9e−01

news20 Primal 8.5e+00 7.7e−02 8.6e−03 7.4e−04 2.1e+02 1.0e−01 1.7e−02 3.2e−04

Dual 1.0e+00 9.1e−02 6.1e−03 3.3e−04 1.0e+00 1.6e+01 1.4e+00 1.6e−04

rcv1 Primal 1.3e+01 8.6e−02 8.1e−03 1.4e−03 3.2e+02 1.9e−01 2.4e−02 1.3e−03

Dual 1.0e+00 1.0e−01 7.5e−03 7.7e−04 1.0e+00 2.1e+01 2.9e−01 7.3e−04

real-sim Primal 1.5e+01 6.3e−02 7.8e−03 1.2e−03 1.7e+02 1.9e−01 1.8e−02 1.4e−03

Dual 1.0e+00 1.5e−01 1.6e−02 2.4e−03 1.0e+00 2.4e+01 1.3e−01 4.2e−02

yahoo-japan Primal 5.9e+00 7.5e−02 8.5e−03 8.7e−04 5.9e+01 2.0e−01 2.0e−02 2.8e−03

Dual 1.0e+00 2.0e−01 2.4e−02 2.8e−03 1.0e+00 3.2e+01 5.3e+00 2.4e−01

(b) Decremental learning

(3)

TRON DCD PCD

0 0.1 0.2 0.3 0.4 0.5

−8

−6

−4

−2 0 2

Training time (seconds)

Relative function value difference (log)

no−warm−start warm−start−r5 warm−start−r50 warm−start−r500

0 2 4 6 8

−8

−6

−4

−2 0 2

0 2 4 6 8 10

−8

−6

−4

−2 0 2

(a)ijcnn

0 50 100 150 200

−8

−6

−4

−2 0 2

0 20 40 60 80 100 120

−8

−6

−4

−2 0 2

0 100 200 300 400 500 600 700

−8

−6

−4

−2 0 2

(b)webspam

0 5 10 15 20 25 30 35

−8

−6

−4

−2 0 2

0 5 10 15 20 25

−8

−6

−4

−2 0 2

0 1000 2000 3000 4000 5000

−8

−6

−4

−2 0 2

(c)news20

0 0.5 1 1.5 2 2.5 3

−8

−6

−4

−2 0 2

0 1 2 3 4

−8

−6

−4

−2 0 2

0 20 40 60 80 100 120 140

−8

−6

−4

−2 0 2

(d)rcv1

0 2 4 6 8 10

−8

−6

−4

−2 0 2

0 2 4 6 8 10 12 14

−8

−6

−4

−2 0 2

0 20 40 60 80 100 120

−8

−6

−4

−2 0 2

(e)real-sim

0 50 100 150 200 250 300

−8

−6

−4

−2 0 2

0 10 20 30 40 50 60 70

−8

−6

−4

−2 0 2

0 1 2 3 4 5

x 10⁴

−8

−6

−4

−2 0 2

(f)yahoo-japan

Figure I: Incremental learning: running time (in seconds) versus the relative objective value difference.

Logistic regression withC= 128 is used.

(4)

0 0.1 0.2 0.3 0.4 0.5

−8

−6

−4

−2 0

0 1 2 3 4 5 6

−8

−6

−4

−2 0

warm−start−r5 warm−start−r50 warm−start−r500

0 2 4 6 8

−8

−6

−4

−2 0

(a)ijcnn

0 50 100 150 200

−8

−6

−4

−2 0 2

0 20 40 60 80 100

−8

−6

−4

−2 0 2

0 100 200 300 400 500

−8

−6

−4

−2 0 2

(b)webspam

0 5 10 15 20 25

−8

−6

−4

−2 0 2

0 5 10 15

−8

−6

−4

−2 0 2

0 1000 2000 3000 4000 5000

−8

−6

−4

−2 0 2

(c)news20

0 0.5 1 1.5 2 2.5

−8

−6

−4

−2 0 2

0 0.5 1 1.5 2 2.5 3

−8

−6

−4

−2 0 2

0 20 40 60 80 100

−8

−6

−4

−2 0 2

(d)rcv1

0 2 4 6 8 10 12

−8

−6

−4

−2 0 2

0 2 4 6 8 10 12

−8

−6

−4

−2 0 2

0 20 40 60 80 100

−8

−6

−4

−2 0 2

(e)real-sim

0 50 100 150 200 250

−8

−6

−4

−2 0 2

0 10 20 30 40 50 60

−8

−6

−4

−2 0 2

0 0.5 1 1.5 2 2.5 3

x 10⁴

−8

−6

−4

−2 0 2

(f)yahoo-japan

Figure II: Decremental learning: running time (in seconds) versus the relative objective value difference.

Logistic regression withC= 128 is used.

(5)

TRON DCD PCD

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

−8

−6

−4

−2 0 2

0 0.1 0.2 0.3 0.4

−8

−6

−4

−2 0 2

0 0.2 0.4 0.6 0.8 1 1.2 1.4

−8

−6

−4

−2 0 2

(a)ijcnn

0 10 20 30 40 50 60

−8

−6

−4

−2 0 2

0 2 4 6 8 10

−8

−6

−4

−2 0 2

0 100 200 300 400 500 600

−8

−6

−4

−2 0 2

(b)webspam

0 2 4 6 8 10 12

−8

−6

−4

−2 0 2

0 0.5 1 1.5 2

−8

−6

−4

−2 0 2

0 10 20 30 40 50

−8

−6

−4

−2 0 2

(c)news20

0 0.2 0.4 0.6 0.8 1 1.2 1.4

−8

−6

−4

−2 0 2

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

−8

−6

−4

−2 0 2

0 0.2 0.4 0.6 0.8 1 1.2 1.4

−8

−6

−4

−2 0 2

(d)rcv1

0 0.5 1 1.5 2 2.5

−8

−6

−4

−2 0 2

0 0.2 0.4 0.6 0.8 1

−8

−6

−4

−2 0 2

0 0.5 1 1.5 2 2.5 3 3.5

−8

−6

−4

−2 0 2

(e)real-sim

0 20 40 60 80 100 120

−8

−6

−4

−2 0 2

0 1 2 3 4 5 6 7

−8

−6

−4

−2 0 2

0 50 100 150

−8

−6

−4

−2 0 2

(f)yahoo-japan

Figure III: Incremental learning: running time (in seconds) versus the relative objective value difference.

L2-SVM withC= 1is used.

(6)

0 0.05 0.1 0.15 0.2 0.25

−8

−6

−4

−2 0

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

−8

−6

−4

−2 0

0 0.2 0.4 0.6 0.8 1

−8

−6

−4

−2 0

(a)ijcnn

0 10 20 30 40 50

−8

−6

−4

−2 0 2

0 2 4 6 8

−8

−6

−4

−2 0 2

0 100 200 300 400 500

−8

−6

−4

−2 0 2

(b)webspam

0 2 4 6 8 10 12

−8

−6

−4

−2 0 2

0 0.2 0.4 0.6 0.8 1 1.2 1.4

−8

−6

−4

−2 0 2

0 10 20 30 40

−8

−6

−4

−2 0 2

(c)news20

0 0.2 0.4 0.6 0.8 1

−8

−6

−4

−2 0 2

0 0.05 0.1 0.15 0.2 0.25

−8

−6

−4

−2 0 2

0 0.2 0.4 0.6 0.8 1

−8

−6

−4

−2 0 2

(d)rcv1

0 0.5 1 1.5 2

−8

−6

−4

−2 0 2

0 0.2 0.4 0.6 0.8

−8

−6

−4

−2 0 2

0 0.5 1 1.5 2 2.5

−8

−6

−4

−2 0 2

(e)real-sim

0 20 40 60 80 100

−8

−6

−4

−2 0 2

0 1 2 3 4 5

−8

−6

−4

−2 0 2

0 20 40 60 80 100 120 140

−8

−6

−4

−2 0 2

(f)yahoo-japan

Figure IV: Decremental learning: running time (in seconds) versus the relative objective value difference.

(7)

TRON DCD PCD

0 0.05 0.1 0.15 0.2 0.25

−8

−6

−4

−2 0 2

0 10 20 30 40 50

−8

−6

−4

−2 0 2

0 0.2 0.4 0.6 0.8 1 1.2 1.4

−8

−6

−4

−2 0 2

(a)ijcnn

0 50 100 150 200 250

−8

−6

−4

−2 0 2

0 200 400 600 800 1000

−8

−6

−4

−2 0 2

0 100 200 300 400 500 600

−8

−6

−4

−2 0 2

(b)webspam

0 200 400 600 800 1000

−8

−6

−4

−2 0 2

0 50 100 150 200

−8

−6

−4

−2 0 2

0 100 200 300 400 500 600 700

−8

−6

−4

−2 0 2

(c)news20

0 5 10 15 20 25 30

−8

−6

−4

−2 0 2

0 5 10 15 20 25 30

−8

−6

−4

−2 0 2

0 5 10 15 20 25 30

−8

−6

−4

−2 0 2

(d)rcv1

0 10 20 30 40 50

−8

−6

−4

−2 0 2

0 20 40 60 80 100

−8

−6

−4

−2 0 2

0 10 20 30 40 50

−8

−6

−4

−2 0 2

(e)real-sim

0 500 1000 1500 2000 2500 3000

−8

−6

−4

−2 0 2

0 100 200 300 400 500 600

−8

−6

−4

−2 0 2

0 100 200 300 400 500 600

−8

−6

−4

−2 0 2

(f)yahoo-japan

Figure V: Incremental learning: running time (in seconds) versus the relative objective value difference.

(8)

0 0.05 0.1 0.15 0.2 0.25

−8

−6

−4

−2 0

0 5 10 15 20 25 30 35

−8

−6

−4

−2 0

0 0.2 0.4 0.6 0.8 1

−8

−6

−4

−2 0

(a)ijcnn

0 50 100 150 200 250

−8

−6

−4

−2 0 2

0 100 200 300 400 500 600 700

−8

−6

−4

−2 0 2

0 100 200 300 400 500

−8

−6

−4

−2 0 2

(b)webspam

0 200 400 600 800

−8

−6

−4

−2 0 2

0 20 40 60 80 100 120 140

−8

−6

−4

−2 0 2

0 100 200 300 400 500 600 700

−8

−6

−4

−2 0 2

(c)news20

0 5 10 15 20 25 30 35

−8

−6

−4

−2 0 2

0 5 10 15 20 25

−8

−6

−4

−2 0 2

0 5 10 15 20 25

−8

−6

−4

−2 0 2

(d)rcv1

0 10 20 30 40 50 60

−8

−6

−4

−2 0 2

0 10 20 30 40 50 60 70

−8

−6

−4

−2 0 2

0 10 20 30 40

−8

−6

−4

−2 0 2

(e)real-sim

0 500 1000 1500 2000 2500

−8

−6

−4

−2 0 2

0 100 200 300 400 500

−8

−6

−4

−2 0 2

0 100 200 300 400 500 600

−8

−6

−4

−2 0 2

(f)yahoo-japan

Figure VI: Decremental learning: running time (in seconds) versus the relative objective value difference.

Incremental and Decremental Training for Linear Classification

Supplement Materials for “Incremental and Decremental Training for Linear Classification”

Cheng-Hao Tsai

[email protected]

Chieh-Yen Lin

[email protected]

Chih-Jen Lin

[email protected]

Details of Deriving

Experiments for Logistic regression with Larger

Experiments for L2-loss Support Vector Ma- chine