• Aucun résultat trouvé

REGR factor score 2 for analysis 1

N/A
N/A
Protected

Academic year: 2022

Partager "REGR factor score 2 for analysis 1"

Copied!
82
0
0

Texte intégral

(1)

Introduction of Symbolic Data Analysis and its Applications

Huiwen Wang (

王惠文

)

School of Economic Management,

Beijing Univ. of Aeronautics and Astronautics, Beijing 100083, China

E-mail: [email protected]

(2)

Outline

„ A Short Introduction on Symbolic Data

„ Concept of Interval Data

„ Descriptive Statistics of Interval Data

„ Principal Component Analysis of Interval Data

„ Application I:

Analysis on Speculation in China’s Stock Market

„ Factor Interval Data Analysis

„ **Application II:

Data Analysis on Features Market of China

(3)

Founder of SDA: Edwin Diday (Paris IX)

z In the first conference of the International Federation of Classification Societies (IFCS) , 1987.

z Many literatures on symbolic data has been published in different journals, proceedings and working reports.

z About 17 European research groups have cooperated in an European Esprit Project (No.20821) named Symbolic Official Data Analysis System. As the result of this project a software package SODAS has been developed.

z H.-H. Bock, E. Diday (Editors), Analysis of Symbolic Data, Springer, 2000

I. A Short Introduction on Symbolic Data

(4)

Data Mining:

How to treat the huge sets of data

(1) Classify the dataset into several groups (2) Summarize each group by a symbolic data

Knowledge mining:

Extend the classical statistical methods (1) Descriptive statistics of symbolic data:

Mean, Variance, Histogram (2) Multivariate analysis of symbolic data:

Factor analysis, Classification, Regression In many domains of human activities it is quite common to record huge sets of data.

Motivation of SDA

(5)

Example: Analysis on Stock Market of China

„ The total number of stocks is very large.

„ The number of individuals changes every year.

„ Some of individuals may be renamed.

Year 1990 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005

Number

of Stocks 10 306 509 714 819 912 1088 1137 1134 1215 1234

Total number of stocks in Market

Problem:

(6)

REGR factor score 1 for analysis 1

6 4

2 0

-2 -4

-6

REGR factor score 2 for analysis 1

10

8

6

4

2

0

-2

-4

1178 117711731176 11751174 1172

117011711169 1167 1168 1166

1165

11631164 1162 1161 1160

1159 1158 1157 11551156 1154

1153

1152 1151 11491150

11481147 1146

1145 11441143 1142 1141

11391140 11381137113511341136

11331132 11311130

1129 1127 1128

1126 11251124 11221123 11211120

1119 1118

1117 1116 1115 11141113 1112 111011091111 1108

1107 1105 1106

1104

11031102 1101 1100

1099 1098

1097 1096

1095

1094 1093 10901091 1092

1089

1088

1087 1086 1085

1084

1083 1082

1081 1079 1080

1078 1077

1076 1075

1074 1073 1071 1072 1070

10691068 1067

10661065 1063 10641062

1061

10591060 1058 1057

1056

1055 1054

1053 1052

1051

1050 1049

10481047

1046 1045

1043 1044

10421040 1041 1039

1038 1037103510361034

10331032 1030 1031

1029 1028 1027

1025102410231026 1022

1021 1020 1019

1018

1017 1016 1015

10131014

10121011 1010 1009 1008

1007

1006 1004 10031005

1002

1000 1001

999 998

996 997

995

994 993 991992

989 990 988 987 986 985 984

983

982 981 979980978

976 977

975973 974

972

970 971 969968

967

966 965 964

963 962

961 960

959 958 957 954955956

953 952

951 950

949 948

946 947 945

943944 941942 940 939

938

936 937

935 934

932 933 931

930929 928927 926

925

924

923 922

921 920

919 918

916917 915 914

913 912 911 910

909 908 906907 905904 903902

901 900

898 899 897

896895 894893

892 891

890

889888 887

886 885 884

883 882

881

880 879

878 877 876 875

874 873 871 872

870

869 868867

866 865

864 862 863 860 861 859 858

857 856 855

854 852851853

850 849 848847 846

845844 843 842

841 840

838839 837836

835 834

833 831832830 829

828

827 826

825 824

823

822 821

820 819 818 817 816814 815

813 812

811 810 808809 807

805806 804 803

801 802

800

798 799 797

796 795

793 792791794 790789788

787 786

785 784

783

782 781

780 779

778 777776 775

773 774 771 772 770

769 767768 766

765

764 763 761762 760

759

758

757 756 755

754 753

752 751

749 750

748 747

746

745

743744 742740 741

739 738

737 735734733 736 731730729 732728

727726 725

723724

722 721

720

719 718717 716 714 715 713

712 711

710 709708

707

706 705 704

703

702

701

700 699

698 696697

695 694

692 691693 690 688689

687

686 685

684 683

682 681 680

679 678

677

676 675 674 673672671

670

669 667 668

666 664665 663

661662 660

659 658

657 656 655654

652 653 651 650

649

648 647

646 645 644

643 641642

640

639637 638635 636 634633 632

631 630

629

628 626627 625 624

623622 620619621

618 616 617

614615 612613

611 610

608 609

607 606

605 604 603 602

601 600 598599 596597

595 594

593

591592 590

589 587 588

586 585 583 584

582 581

580 579

578 576 577

575

574573 572571570 569 568

567 566

565 564

563 562

561 559558 560

557

556 555554 553 552 551

550

548 546549547 544 545

542 543 541 540

539

538

537 536

535 534 532533 531

529530 528526527 525

524

523 521 522

520 519

518 517

516515 514 513

512 511 510 509

508 507 506 505

504 503

502

501

500

499 498 497

496

494495

493

492 491 490 489

487485 486 488 484

483

482481 480 479

478 477

475 476 474

473

472 471 470

469 468 467 466

465 463464 462

461460

459 458

457

456 455

454 453 451452 450 449

448 447

446 445 444

443442441440 438 439

437 435436

434

433 432431

430 429 428

427

426 425

424 422421423

420 419 417416 418

415 414

413 412 411 410409

408 406407

405

404 403 401 402

400 399

398 396 395397

394 392393

391 390

388 387389 386 385 384383382 381

380 379378 377

376

374372375 373 371

369370 368367

366 365 364 363362 361360

359358 357 356

355 354

353 352

351 349350 348

347

345346 343344

341340342

339335336338337 334 333 332 331 330 328 329

327

326325 324

323 322

321 320

319 317 318

316 315

314 313 312

310309 308311 307 305 306

304 303300299 301302

297298 296

295 293294291292

290 288 289

287 286

285

284

282283 281

279280 278

277 276

275

274 273272

271 270

269 268 267

266 265

263 264 262 261260259 258 257

256 255 254

253 252

251

250 249

248 246247

245244 243

242

241 240

238239 237

236234 233235 232 231230 229

227 228 226225224 223

222 220221

219218 217

216

215 214212 213 210211

209 208 207206

205 204 203 201202 200

198 199 196 197

195

194

193 192191190189 188

187 185186

183184 181182179 180 177 178 176

175 174173 171172

170169 168 167

166

164165 163

162 161

160

159 158 157 156

155 154

153 151152 149148150

146145 147 144 143

142 140 141 138 139

137 136 135134

132133 131130 129 128

127 126 124 125

123 122 120121

119 118117116

115 114

112113 111110

109 108 107

106 105

103104

102 100 101

98 99 97 96 95

94 9392

91

90 89 88 8685 84 8382 87

808177 7879 76

75

74 73

72

71 6970

68 67

66

65 64

63

62 61

59 60

58 5657

55 54

53 52 50 51 48 49 464547 444342

41 3940

38 37 36

35 34

3332

31 292830

27

25 26

24 23

21 22

19 20 18 17

16

14 15

13 12

11 10

9 7 8

5 46 3

2 1

Variables: Negotiable Market Capitalization (NMC) , Return rate, Turnover rate, P/E ratio, Volatility

PCA on 1088 stocks of China (2001 )

(7)

REGR factor score 1 for analysis 1

6 4

2 0

-2 -4

-6

RE G R f ac tor s cor e 2 f or a na ly si s 1

8

6

4

2

0

-2

-4

1163 11621161 1160 1158 1159

1157

115511561154

1152 1153 1151

1150

11491148 1147 1146 1145

1144 1143 1142 11401141 1139

1138

1137 1136 11341135

11331132 1131

1130 11291128 1127 1126

1125 11231122 1124 1121

11201119 11181117 11161115

1114 1112 1113

1111 11101109

1108 1106 11051107

1104 1103

1102 1101 1100 1099 1098 1097 10951094 1096 1093 1091 1092

1090

1089 1088

1087 1086

1085 1084

1083 1082

1081

1080 1079 10761077 1078

1075

1074

1073 1072 1071

1070

1069

1068

1067 1066 1065

1064

1063

1062 1061

1060 1059 1057 1058

1056

1055 1053 1054

10511052

1049 10501048

1047

10451046 1044 1043

1042

1041

1040

1039 1038

1037

1036 1035

1034

1033 1032

1031 1030 1029

10281026 1027 1025

1024 10231021 10221020

10191018 1017 1016

1015 1014 1013

10111010 1012 1009 1008

1007 1006 1005

1004

1003 1002 1001

9991000 998997

996

995 994

993

992 990 989 991

988

986 987

985 984

982 983

981

980 979

977978 975 976

974

973 972 971 970

969

968 967 966

965964 962 963

961959 960

958

956 957

955954 953

952 951 950

949 948

947 946

945 944

943 940941 942 939 938

937 936

935 934

933

932 931

930 929

927 928 926 925

924

922 923

921 920

918 919 917

916915

914913 912

911

910

909 908907 906

905 903904

902 901

900 899 898

897 896

895 894 892893891 890889

888 887

886 884 885

883882 881

879880 877878 875 876

874 873872

871 870

869

868 867

866 864865 863

862 861

859 860

858

857 856855

854

853 852 850 851 848 849 847 846

845

844 843

842 841

840839 838

837 836835 834

833832 831

830

829 828

826827 825

823824

822

821 820

819 818 817

816

815 814

813

812

811

810 809

808 807 806 805 804802 803

801 800

799 798

796797 795

794 793 792 791

789 790

788

786 787

785 784

782 781780783 779778 777

776 775

774 773

772

771 770

769

768

767 766765 764

762 763 760 761 759

758 756757 754 755

753 752 751 750

749

748

747

746 745

744

743 742

740 741

739 738

737

736 735

734 733

731 732 730

729 728 726725724 727 722 721720 723719

718717 716

715 714

713 712

711

710 709708 707 705 706 704 703

702 701700

699

698 697 696

695694

693

692

691 689690

688 687

685 684686 683

681682 680

679 678

677 676

675 674 673

672 671 670

669

668 667

666665664

663

661 662 660

659

657658 656

654655 653

652

651 650 649

648 647 645 646 644

643

642

641 640

639 638 637

636 634635

633

632630 631

629 628 627 626 624625 623 622

621 619 620 618 617

616615 613612614

611 610 609

608 606607 605

604 603

602 601

600 599

598 597 596 595 594 593 592

591 590

589 588

587 586

584585 583

582 580 581

579 578

576 577 575 574

573 572

571 569 570

568

567566 565 564563

562 561

560 559

558 557

556 555

554

553 552

551 550

549548 547 546 545

544

542 540 543541 538 536 539537

535 534

533

532

531 530

529 528525 526527

523524 522

520521 519

518

517 516

515

514 513

512 511

510 509 508 507

506 505

504 503

502

501 500 499

498 497

496

495

494 493 492

491

489490 488

487 486 485

484

483 482

481

480

479

478 477 476 475

474 473

471 472 470

469

468 467 466

465 464 462 463

461 459460 458

457456

455 454

453

452 451

450 448 449

447 446

444445 443 442

441 440

439438437

436434 435 433 432

430431

429 428427

426 425 424

423

422

421

420 418417419

416

415 413412 414 411

410 409

408 407 406405

404 402403

401

400

399 397 398

396 395

394 393 392

391

390 388389

387 386

384383385 382 381 380379 378 377

376 375374 373

372

370368371 369 367

365366 364363

362 361 360 359 358356355354 357

353 352

351

350 349

348

347

345346 344

343

341342 339 340

337336338

335331 332334333 330 329 328

327 326 324 325

323

322321

320 319

318

317 316

315

313 314 312

311

310 309 308

306305 304307 303 301 302

300 299 297298

296295 293294

292 291

289290287288 286 284 285

283 282

281

280

278279 277

275276 274 272 273

271

270 269268

267 266

265 264

263

262 261

259 260 258

257256255 254 253

252 251 250

249 248

247

246 245

244 242243

241240 239

238

237 236

234235 233

232 231 230 229

228 227 226 225

223 224 222221220 219

218 216217

215214 213

212

211 210208 209 206 207

205 204 203202

201 200 199

198 197 196

194 195 192 193

191

190

189 188187186185 184

183 181182

179180 177175178 176 173 174

172 171

170 168 169 167

166165 164 163

162

160161 159

158 157

156

155 154

153 152

151 150

149

147148 145144146 143 142141

140 139

138

136 137 134 135 133

132 131130

128129 127

126 124 125

123 122 120 121

119 118 117116

115 114113112

111

110 109

107106108

105 104 103

102 101

99100

98 96 97

94 95 93 92

91

90 89

88 87

86 85

84 8281 807879 83

7677 7475 73 72 71

70 69

68

67 65 66

64

63

62

61 60

59

58 57

56 55

54 53

5152

50 49

48 47 46

44 45 424143 403938

3637

35 34

33 32

3029 31 272628 24 25

23 22

20 182119

17 16 14 15

13 12

11 10

9 7 8

6 5 4

3

2 1

PCA after deleting 15 outliers

(8)

REGR factor score 1 for analysis 1

4 3

2 1

0 -1

-2 -3

-4

R EG R f ac to r s co re 2 f or a na lys is 1

8

6

4

2

0

-2

-4

-6

PCA after deleting 113 outliers

(9)

Stock Style Classification Based On CITIC Style Indices

Tab.4-1 The classification of the six CITIC style indices Low B/P value High B/P value High market

capitalization

Large-cap

growth Large-cap value Mid-market

capitalization Mid-cap growth Mid-cap value Low market

capitalization

Small-cap

growth Small-cap value

Sizerepresented by Negotiable Market Capitalization (NMC) B/P Rationet assets / market value

CITIC: China International Trust Investment Company

(10)

Object

Market Capitalization

Turnover

Rate Return Rate P/E Ratio Volatility Large-Cap

Growth Large-Cap

Value Mid-Cap

Growth Mid-Cap

Value Small-Cap

Growth Small-Cap

Value

[ x 11 , x 11 ] [ x 2 1 , x 2 1 ] [ x 3 1 , x 3 1 ]

[ x 4 1 , x 4 1 ]

[ x 5 1 , x 5 1 ]

[ x 6 1 , x 6 1 ]

[ x 12 , x 1 2 ] [ x 22 , x 22 ]

[ x 32 , x 32 ]

[ x 42 , x 42 ]

[ x 52 , x 52 ] [ x 62 , x 62 ]

[ x 1 3 , x 1 3 ] [ x 23 , x 23 ] [ x 33 , x 33 ] [ x 43 , x 43 ] [ x 53 , x 53 ] [ x 63 , x 63 ]

[ x 1 4 , x 1 4 ] [ x 24 , x 24 ]

[ x 34 , x 34 ]

[ x 44 , x 44 ] [ x 54 , x 54 ]

[ x 64 , x 64 ]

[ x 1 5 , x 1 5 ]

[ x 25 , x 25 ] [ x 35 , x 35 ] [ x 45 , x 45 ] [ x 55 , x 55 ] [ x 65 , x 65 ]

Interval Data Table

Reflect: Central Tendency & Dispersion Tendency of each class

{ } { }

1, , 1, ,

min , max

i i

k k

ij ij ij ij

k m k m

x a x a

= =

= =

** Here we use Interquartile Range to prevent the influence of outliers.

Variable

(11)

The Scatter Plot of PCA

Quick Impression:

¾Size Style is the primary factor influencing activity of market deals.

¾Growth or Value Style are also important factors that can describe the behaviours of stock market in China.

LargeCapG.

LargeCapV.

SmallCapG.

SmallCapV. MidCapV.

MidCapG.

(12)

Definition of Symbolic Data

E. Diday, 1987,1996

(1) Traditional Data:

A single quantitative or Qualitative value

(2) A Set of Value or Categories (Multivalued Variable) EX: Height (w)={ 3.5, 2.1, 5 } means that :

the height of w can be ether 3.5 or 2.1 or 5 (3) An Interval

EX: Height (w)=[3,5] means that:

the height of w varies in the interval [3,5]

(4) A Set of Value with Associated Weights (Distribution Var.) histogram , membership function.

Cell

Data Table

(13)

Supplier Weekly working time

Customer Delivery’s Town Weight of delivery

S1 34 SG1,SG3

Paris(0.5), Lannion(0.5) [12,15]

S2 45 SG2

Toulouse (1) [20,25]

S3 40 SG4,SG5

Lille(0.4), Lyon(0.6) [34,40]

Example:

We consider three relations: Supplier, Delivery, and Customer

Multivalued Variable Distribution variable Interval variable

Note: Compositional data is a kind of distribution data.

(14)

Characteristics

„ Reduce complexity of data system, since SDA reduces

aggregation scale by pretreatment of classifying original dataset.

„ Powerful in summarizing global features of the data

set, because it can hold overall characteristics of the sample group, and extract internal rules of the data set.

„ Visualize analysis results, especially it is suitable for

knowledge mining and visualization research of large

scale data set.

(15)

Interval Data: the elements of matrix are interval of real data.

The interval data may result from three sources:

– Imprecision of the measurement:

– Expert’s knowledge: including a range of uncertainty – Summary a class of dataset (second-order-object)

The object i to be described is a class of elements.

Let be m i observations of the feature j for the object i:

ij , ij

a δ a δ

⎡ − + ⎤

⎣ ⎦

1 2

{ a a ij , ij , , a ij m i }

{ } { }

1, , 1, ,

,

min , max

i i

ij ij ij

k k

ij k m ij ij k m ij

x x

Where x a x a

ξ

= =

⎡ ⎤

= ⎣ ⎦

= =

Range of error

II. Concept of Interval Data

(16)

Interval Data Matrix

¾ Let denote n objects (classes) described by p features of interval type.

¾ Let be the interval which include all the possible values of feature j for object i , the resulting symbolic data matrix is given by:

1, 2, ,

i = n

1 , 2 , , p

Y Y Y

ij x x ij , ij

ξ = ⎣

[ ]

[ ]

11 11 1 1

11 1

1 1 1

, ,

, ,

p p

p

n np n n np np

n p

x x x x

X

x x x x

ξ ξ

ξ ξ

×

⎛ ⎡ ⎤ ⎞

⎛ ⎞ ⎜ ⎣ ⎦ ⎟

⎜ ⎟

= ⎜ ⎟ = ⎜ ⎟

⎜ ⎟

⎜ ⎟ ⎜ ⎡ ⎤ ⎟

⎝ ⎠ ⎝ ⎣ ⎦ ⎠

(17)

III. Descriptive Statistics of Interval Data

Basic set: E={1,2,…,n}

X is an univariable interval-valued variable measured for each observed unit of the basic set E={1,2,…,n}

For each k∈ E, we denote

Where, is the lower bound of the interval; and is the upper bound of the interval.

]

, [

) (

X k = x k x k

x k x k

(18)

The Equi-Distribution Hypothesis:

(1) Each observed unit k∈ E is equally likely, i.e. each observed unit k∈ E has been selected with the same probability 1/n.

(2) The values of x are uniformly distributed in the interval.

[ ]

( )

⎪ ⎪

⎪⎪ ⎨

<

− ≤

− <

=

if 1

if

if 0

, Pr

k k k

k k

k

k k k

x

x x x

x

x

x x

x x

x

ξ

ξ ξ ξ

ξ

∀ ξ ∈ R

Basic set

E={1,2,…,n}

(19)

„ Empirical Distribution Function of X

EDF is the distribution function of the uniform mixture of n uniform distributions defined on the interval , for all k∈ E.

[ ]

( )

{ } { }

{ }

≥ + ∈

= −

⋅ ∈

< +

⋅ ∈

− +

= −

=

) ( ) (

1 #

#

# 1 1 0

) (

: Thus

, 1 Pr

) (

k X

j k

k

k k

X

i j k

k X k

E k

k k X

n

x E

j x

x

x n

n

x E

j n

x E

i x

x

x F n

x x x

n x F

ξ ξ

ξ ξ

ξ ξ ξ ξ

ξ ξ

] , [ ) (

X k = x k x k

{ jE ξ ≥ x j }

#

Where the notation of :

is the number of intervals with X ( j ) = [ x j , x j ] ξx j

(20)

By taking the derivation of the empirical distribution function, we get the Empirical Density Function of X

∈ ∑ −

=

) (

1 ) 1

( f

k

X k k

X ξ n ξ x x

{ }

∈ ∑

≥ + ∈

= −

) (

1 # )

(

k X

j k k

X k

n

x E

j x

x

x F n

ξ

ξ ξ ξ

Here is an indicator function:

E k k

X ξ

k X ξ

k

X

⎩ ⎨

= ∈

) (

if 0

) (

if ) 1

) (

( ξ

1 )

) (

( k ξ 1 X

„ Empirical Density Function of X

From EDF

∑ ∈ −

=

E

k k k

k X

x x

n ( )

1 1 ( ) ξ

R

ξ

(21)

„ Empirical Mean of X

( ) ξ ξ

ξ d

X : = ∫ − +∞ ∞ f X

) 1 (

: ∫ + ( )

∈ −

= ξ ξ

ξ d

x x

X n

E

k k k

k

1 X

∑ ∈

= +

E k

k x k

x X n

2 1

∑ ∈ −

=

E

k k k

k X

X n 1 x x ( )

) (

f ( ) ξ

ξ 1

Empirical Density Function

] ,

[

if 0

] , [

if ) 1

) (

( ⎩ ⎨ ⎧

= ∈

k k

k k k

X ξ x x

x x ξ ξ

1

indicator function

k k

k

k x x x

x

+ 2

[ ]

∑∫ ∈

∞ +

− −

=

E

k k k

k

X d

x x

n ( ξ ) ξ ξ

1 1 ( )

1 ∑∫

∈ −

=

E k

x

x k k

k k

x d x

n ξ ξ

∑ ∈

= +

E k

k x k

x

n 2

1

(22)

„ Empirical Standard Deviation:

( 2 2 ) 2 ( ) 2

2

4 1 3

1 ⎥

⎢ ⎤

⎡ +

− +

+

= ∑ ∑

k E

k k E

k

k k k

k

Y x x

x n x

x n x

s

( ξ X ) ( ) ξ d ξ

s X : = ∫ − +2 f X

∞ +

− +∞

=

=

ξ ξ

ξ

ξ ξ

ξ

d M

X d

s

X X X

) ( f :

) ( f

2 2

2 2

2

ξ ξ x d x

n k E

x

x k k

k

∑∫ k

∈ −

= 1 2

∈ −

= −

E

k k k

k k

x x

x x

n 3 ( )

1 3 3

( )

∑ ∈

+ +

=

E k

k k k

k x x x

n x

2 2

3

1

(23)

Example : We observe an interval-valued variable X defined on a set E={1,2,…,8}. The observed values of X at set E are:

X(E)={[0,2]; [1,3]; [1.5,2.5]; [2,4]; [3.5,5]; [4.5,5.5]; [5,7]; [6.5,7.5]}

045 .

2 781

. 24 3

5 . 443

781 .

3 )

7 6 2 5

5 . 3 8

2 2 1 8 ( 1

2 ≅

=

≅ +

+ + +

+ + +

=

s X

X

(24)

Depends on the equi-distribution hypothesis:

(1) Each observed unit k∈ E has been selected with the same probability 1/n :

(2) The values of x are uniformly distributed in the interval.

Example: An interval-valued variable X defined on a set E={1,2,3}.

The observed values of X at set E are: X(E)={[0,2]; [2,4] ;[1,3]}

3 1

3 , 1

3 , 1

) 3 , 1 [ )

4 , 2 [ )

2 , 0

[ = p = p =

p

6 1 1

3 1 3

1 ~

6 1 2

4 1 3

1 ~

6 1 1

2 1 3

~ 1

) 1 3

~ (

), 2 4

~ (

), 0 2

~ (

) 3 , 1 [

) 4 , 2 [

) 2 , 0 [

) 3 , 1 [ )

3 , 1 [ )

4 , 2 [ )

4 , 2 [ )

2 , 0 [ )

2 , 0 [

− =

×

=

− =

×

=

− =

×

=

×

=

×

=

×

=

f f f

f p

f p

f p

0 0 0

1/7 1/5 1/4 2/7 1/3

1 2 3 4

0 1 2 3 4 1/3

1/6 f

„ Histogram:

(25)

(1) The interval that contains all the observed single values of X is given by I = [0, 4].

(2) There are 5 distinct observed boundaries that are ranked in increasing order:

(3) Consider the partition of I by the 4 intervals

4

, 3

, 2

, 1

,

0 1 2 3 4

0

) 4], , [3 ( ),

3), , ([2 ),

2), , 1 [ ( ), ), 1 , 0

([ f 1 f 2 f 3 f 4

=

=

=

=

= u u u u

u

X(E)={[0,2]; [1,3]; [2,4]}

0 0 0

1/7 1/5 1/4 2/7 1/3

1 2 3 4

0 1 2 3 4 1/3

1/6 f

6 1 2

4 1 3

~ 1

3 1 2

4 1 1

3 1 3

~ 1 ~

3 1 1

3 1 0

2 1 3

1

~ ~

6 1 1

2 1 3

1 ~

) 4 , 2 [ 4

) 4 , 2 [ )

3 , 1 [ 3

) 3 , 1 [ )

2 , 0 [ 2

) 2 , 0 [ 1

− =

×

=

=

⎟ =

⎜ ⎞

+ −

× −

= +

=

⎟ =

⎜ ⎞

+ −

× −

= +

=

− =

×

=

=

f f

f f

f

f f

f

f f

Calculation Process:

(26)

Example: We observe an interval-valued variable X defined on a set E={1,2,…,8}. The observed values of X at set E are:

X(E)={[0,2]; [1,3]; [1.5,2.5]; [2,4]; [3.5,5]; [4.5,5.5]; [5,7]; [6.5,7.5]}

(1) The interval that contains all the observed single values of X is given by I = [0, 7.5].

(2) The 14 distinct observed boundaries that are ranked in increasing order:

(3) Consider the partition of I by the 13 intervals 5 . 7

, 7 ,

, 5 . 1

, 1

,

0 1 2 13 14

0 = u = u = u = u =

u

13 , , 2 , 1

, ) ,

[ 1 =

= u u j

I j j j

two repeated boundaries: 2 , 5

The corresponding Density Function is described by the list of pairs:

Let p j indicates the area of the bar of base I j in the histogram.

( [ u j 1 , u j ) f j ) , j = 1 , 2 , , 13

( j j j ) j j j

j x u u f u u f

p = Pr( ∈ [ 1 , ) , = ( − 1 ) ×

(27)

X(E)={[0,2]; [1,3]; [1.5,2.5]; [2,4]; [3.5,5]; [4.5,5.5]; [5,7]; [6.5,7.5]}

4 1 5

. 1 5 . 2

1 1

3 1 0

2 1 8

1 : ) 2 , 5 . 1 [

8 1 1

3 1 0

2 1 8

1 : ) 5 . 1 , 1 [

16 1 0

2 1 8

1 : ) 1 , 0 [

3 3

2 2

1 1

⎥⎦ =

⎢⎣ ⎤

+ − + −

× −

=

=

⎥⎦ =

⎢⎣ ⎤

+ −

× −

=

=

− =

×

=

=

f I

f I

f I

= 8 n

5 . 7

, 7 ,

, 2 ,

5 . 1

, 1

,

0 1 2 2 13 14

0 = u = u = u = u = u =

u

observed boundaries

Observed values:

(28)

Thus the corresponding list is given by:

( [ u j 1 , u j ) f j ) , j = 1 , 2 , , 13

),

),

),

),

),

),

),

),

),

),

),

),

⎟ ⎠

⎜ ⎞

⎟ ⎛

⎜ ⎞

⎟ ⎛

⎜ ⎞

⎟ ⎛

⎜ ⎞

⎟ ⎛

⎜ ⎞

⎟ ⎠

⎜ ⎞

⎟ ⎛

⎜ ⎞

⎟ ⎛

⎜ ⎞

⎟ ⎛

⎜ ⎞

⎟ ⎠

⎜ ⎞

⎟ ⎛

⎜ ⎞

⎟ ⎛

⎜ ⎞

⎟ ⎛

⎜ ⎞

8 5 1

. 7 7 16 [

7 3 5 . 6 16 [

5 1 . 6 5 . 5 16 [

5 3 . 5 5 24 [

5 5 5 . 4 [

24 5 2

. 4 4 48 [

4 7 5 . 3 16 [

5 1 . 3 3 8 [

3 1 5 . 2 [

4 5 1

. 2 2 4 [

) 1 2 5 . 1 8 [

5 1 . 1 1 16 [

1 1 0 [

8 ) 1 5 . 1 2 4 (

1

16 ) 1

1 5 . 1 8 (

1

16 ) 1

0 1 16 (

1

3 2 1

=

×

=

=

×

=

=

×

=

p p p

0

1/ 20 1/ 10 3/ 20 1/ 5 1/ 4 3/ 10

1 1. 5 2 2. 5 3 3. 5 4 4. 5 5 5. 5 6. 5 7 7. 5

) ( − 1

×

= j j j

j f u u

p

(29)

Now, consider a new partition of the interval with 5 subintervals:

] 8 , 0

= [ I

] 8 , 5 . 6 [ )

5 . 6 , 5 [ )

5 , 5 . 3 [ )

5 . 3 , 5 . 1 [ )

5 . 1 , 0

[ ∪ ∪ ∪ ∪

′ = I

5 , 4 , 3 , 2 , 1

,

=

′ = ∑

∈ ′

α

α

α

I I

j

j

p

Remark that p

For

( )

12 1 5

. 1

8 / 1

8 0 1

5 . 1

8 ) 1 1 5 . 1 8 ( ) 1 0 1 16 ( 1

: ) 5 . 1 , 1 [ ) 1 , 0 [ ) 5 . 1 , 0 [

1

1 1

1

=

=

=

×

=

=

− +

=

=

′ =

f

f p

p I

⎟ ⎠

⎜ ⎞

12 1 ), 5 . 1 , 0

Thus, [

(30)

0

1/ 20 1/ 10 3/ 20 1/ 5

1 2 3 4 5 6 7 8

With the same methodology, we get the empirical frequency distribution:

⎟ ⎠

⎜ ⎞

⎟ ⎛

⎜ ⎞

⎟ ⎛

⎜ ⎞

⎟ ⎛

⎜ ⎞

⎟ ⎛

⎜ ⎞

48 ]; 5 8 , 5 . 6 [ 48 ;

); 5 5 . 6 , 5 [ 48 ;

); 7 5 , 5 . 3 64 [

); 11 5 . 3 , 5 . 1 [ 12 ;

]; 1

5

.

1

,

0

[

(31)

[ ]

[ ]

11 11 1 1

11 1

1 1 1

, ,

, ,

p p

p

n np n n np np

x x x x

X

x x x x

ξ ξ

ξ ξ

⎛ ⎡ ⎤ ⎞

⎛ ⎞ ⎜ ⎣ ⎦ ⎟

⎜ ⎟

= ⎜ ⎟ = ⎜ ⎟

⎜ ⎟

⎜ ⎟ ⎜ ⎡ ⎤ ⎟

⎝ ⎠ ⎝ ⎣ ⎦ ⎠

IV. PCA of Interval Data

(Cazes, Chouakria, Diday& Schektman,1997) Interval Data Matrix

Let denote n objects described by p features of interval type value.

1, 2, ,

i = n

1 , 2 , , p

Y Y Y

(32)

1. The VERTICES Method

z Let denote the interval data vector of object i . This data point can be visualized as a hyper-rectangle in space with 2 p vertices.

z For p = 2, the object vector is

z With 2 2 vertices as follows:

( 1 , , ) ( [ 1 , 1 ] , , , )

i i ip i i ip ip

x = ξ ξ = x x x x

R p

[ ] [ ]

( 1 , 1 , 2 , 2 )

i i i i i

O = x x x x

1 2

1 2

1 2

1 2

i i

i i

i

i i

i i

x x

x x

M x x

x x

⎡ ⎤

⎢ ⎥

⎢ ⎥

= ⎢ ⎥

⎢ ⎥

⎣ ⎦

R i

X 2

X 1 M i

R i

(33)

Transformation to Numerical Matrix

¾ Describe each interval-valued object O i by a numerical data matrix M i with 2 p rows and p columns.

¾ Compile the matrices into the numerical matrix M with rows and p columns.

[ ]

[ ]

11 11 1 1

1

1 1

, ,

, ,

p p

n p

n n n np np

x x x x

O X

O x x x x

×

⎛ ⎡ ⎤ ⎞

⎛ ⎞ ⎜ ⎣ ⎦ ⎟

⎜ ⎟

= ⎜ ⎜ ⎝ ⎟ ⎜ ⎟ ⎜ ⎠ = ⎜ ⎝ ⎡ ⎣ ⎤ ⎦ ⎟ ⎟ ⎟ ⎠

2 p n ×

11 1

11 1

1 2

1

1

p

p

p

n p

n n np

n np

x x

x x

M M

M x x

x x

×

⎛ ⎡ ⎤ ⎞

⎜ ⎢ ⎥ ⎟

⎜ ⎢ ⎥ ⎟

⎜ ⎢ ⎥ ⎟

⎛ ⎞ ⎜ ⎣ ⎦ ⎟

⎜ ⎟ ⎜ ⎟

= ⎜ ⎜ ⎝ ⎟ ⎜ ⎟ ⎠ ⎜ = ⎡ ⎤ ⎟ ⎟

⎢ ⎥

⎜ ⎟

⎢ ⎥

⎜ ⎟

⎢ ⎥

⎜ ⎣ ⎦ ⎟

⎝ ⎠

Original symbolic data matrix

Numerical matrix from the vertices

1 , , n

M M

(34)

„ Apply the classical PCA method to numerical matrix M, with a suitable choice of dimension .

„ Let Y 1 * , Y 2 *…, Y S * denote the first s numerical principal components.

How to construct the interval-type principal components from these numerical principal components Y 1 * , Y 2 *…, Y S * ?

[ Y Y S ] p n s

Y = 1 * , , * 2 ×

1

2 p

n n p

M M

M ×

⎛ ⎞

⎜ ⎟

=⎜ ⎟ ⎜ ⎝ ⎟ ⎠

PCA

[ ]

[ ]

11 11 1 1

1 1

, ,

, ,

p p

n n np np

n p

x x x x

X

x x x x

×

⎛ ⎡ ⎣ ⎤ ⎦ ⎞

⎜ ⎟

= ⎜ ⎟

⎜ ⎟

⎜ ⎡ ⎣ ⎤ ⎦ ⎟

⎝ ⎠

Transformation to Numerical Matrix

11 1

11 1

1 1

, ,

, ,

s s

n ns

n ns n s

y y y y

Y

y y y y

×

⎛ ⎡ ⎣ ⎤ ⎦ ⎡ ⎣ ⎤ ⎦ ⎞

⎜ ⎟

⎜ ⎟

= ⎜ ⎟

⎡ ⎤ ⎡ ⎤

⎜ ⎣ ⎦ ⎣ ⎦ ⎟

⎝ ⎠

n n

p n

p s

s

2 2

2 : dim

<

×

(35)

p n ×

X M X Y * Y

p

P n ×

2 2 P n × s

s n ×

n

n p

s 2 2 <<

Tran. PCA Tran.

s

s n × 2 s

n ×

Y Tran. M Y

???

2 3 =8 vertices

2 2 =4 vertices p = 3

Cubic

s =2

Square

(36)

Construct the interval-type principal components from these numerical principal components Y 1 *Y 2 *…, Y S * .

Let L i be the set of row indices in matrix M that refers to the matrix M i corresponding to the i th symbolic object.

For , let be the value of the numerical principal component Y h * with row index k. The value of the interval-type principal component for the i th object is then defined by

where

. Y

2I

[ ih ih ]

ih y , y

y = ) min kh

L

ih k y

y

i

∈ (

= max kh )

L

ih k y

y

i

∈ (

= L i

ky kh

h I

Y

L 1

M 1

M 2 PCA L

Y * 2

11 1

11 1

1 1

, ,

, ,

s s

n ns

n ns n s

y y y y

Y

y y y y

×

⎛ ⎡ ⎣ ⎤ ⎦ ⎡ ⎣ ⎤ ⎦ ⎞

⎜ ⎟

⎜ ⎟

= ⎜ ⎟

⎡ ⎤ ⎡ ⎤

⎜ ⎣ ⎦ ⎣ ⎦ ⎟

⎝ ⎠

O 1

O 2

(37)

Possible Limitations of the Vertices Method 1.NP hard problem: Computing size will be very heavy when the number variable of p is large, since the elements of the numerical matrix M is n × 2 p .

2. Accuracy problem: the hyper-rectangle may enlarge the range of the original dataset and reduce the analysis precision.

Example: Regression

(38)

2. The CENTERS Method

For resolve the NP hard problem of vertices method, the centers method is proposed for the interval data set.

PCA Method

™ Denote

be the centers of the hyper-rectangles R i .

( c 1 , , c ) p , 1, ,

i i ip

c = x x ′ ∈ R i = n

( )

p j

n x i

x x

X X

x x

x x

X

ij c ij

ij

c p c

p n np c n c

c p c

, ,

2 , 1

, ,

2 , 1

2 ,

: Where

,

,

~

1 1

1 11

= + =

=

=

⎟⎟

⎟ ⎟

⎜⎜

⎜ ⎜

=

×

(39)

™Classical PCA method is applied to these centers matrix for finding the factorial axes.

™ Determine for each object i its interval values of principal components as follows:

X ~

p j

n x

x n

i

ij c

c j 1 , 1 , 2 , , :

1

=

= ∑

=

Denote

is the mean of feature X c j

Thus the h-th principal component of centre i is given by

( ) is the th eigenvecto r of ( ~ )

) (

1

1

X Cov S

h- ,u

, u u

u x

x y

ph h

h

ih p

j

c j ij c

ih c

=

=

= ∑

=

(40)

Definition: ( ) ( )

( ) ( )

s h

n i

u x

x u

x x

y

u x

x u

x x

y

jh u

j

c j ij

jh u

j

c j ih ij

jh u

j

c j jh ij

u j

c j ih ij

jh jh

jh jh

, ,

1 , ,

1

0 :

0 :

0 :

0 :

=

=

− +

=

− +

=

>

<

>

<

11 1

11 1

1 1

, ,

, ,

s s

n ns

n ns n s

y y y y

Y

y y y y

×

⎛ ⎡ ⎣ ⎤ ⎦ ⎡ ⎣ ⎤ ⎦ ⎞

⎜ ⎟

⎜ ⎟

= ⎜ ⎟

⎡ ⎤ ⎡ ⎤

⎜ ⎣ ⎦ ⎣ ⎦ ⎟

⎝ ⎠

The variation of the principle components can be estimated from the variability of the descriptive features.

( c ) ( ij c j )

j

ij x x x

x − ≥ −

(41)

Possible Limitations of CENTERS Method

Some possible difficulties in applications:

(1) Elements of numerical matrix would be too small (2) Loss the information of variation.

the 1th PCA axe of

the centers dataset the real PCA axe

Example: PCA Method

11 1

1

p

n np

c c

C

c c

⎛ ⎞

⎜ ⎟

= ⎜ ⎟

⎜ ⎟

⎝ ⎠

Real Valued matrix

from the centers

(42)

1. Contest about the excess speculation of China’s stock market in 2001

What:

Way:

How: Overthrow and rebuilt the market ?

Or improve and enhance the criterions of management ?

Year 1990 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005

Market

Size 10 306 509 714 819 912 1088 1137 1134 1215 1234

Total number of stocks in China

V. Analysis on Speculation in China’s Stock Market

Emergency Project of NSFC (2001)

How to demonstrate the existence of excess speculation?

What is the essential issues in market management?

Références

Documents relatifs

First, the system detects salient references to economic aspects associated with economic growth, prices, interest rates and bank lending and employs a multinomial

Figures 1 – 2 shows the relationship between liquidity and exchange rate flexibility and volatility for countries at different level of financial development, which is measured by

Keywords: Variable annuities, indifference pricing, stochastic control, utility maximization, backward stochastic differential equation.. MSC2000 subject classification: 60H99,

induces a quadratic overhead of the substitution process. Turning to practical values and embracing the second approach, instead, speeds up the substitution process, inducing a

[ 25 ] extended the original LIBOR Market Model to both stochastic volatility and displaced diffusion, whereas [ 35 ] proposed a widely used version of the stochastic

The state ofthe market described by a stochastic process for asset prices is the outcome of a market equilibrium, and its evolution through time is described by correlations

On the contrary, financial globalization, and in particular investment-globalization (foreign direct investment and portfolio investment), promotes growth not only

Exchane Rate Dynamics under Financial Market Frictions- Exchange rate regime, capital market openness and monetary policy -Electoral cycle of exchange rate in Korea : The Trilemma