Introduction of Symbolic Data Analysis and its Applications
Huiwen Wang (
王惠文)
School of Economic Management,
Beijing Univ. of Aeronautics and Astronautics, Beijing 100083, China
E-mail: [email protected]
Outline
A Short Introduction on Symbolic Data
Concept of Interval Data
Descriptive Statistics of Interval Data
Principal Component Analysis of Interval Data
Application I:
Analysis on Speculation in China’s Stock Market
Factor Interval Data Analysis
**Application II:
Data Analysis on Features Market of China
Founder of SDA: Edwin Diday (Paris IX)
z In the first conference of the International Federation of Classification Societies (IFCS) , 1987.
z Many literatures on symbolic data has been published in different journals, proceedings and working reports.
z About 17 European research groups have cooperated in an European Esprit Project (No.20821) named Symbolic Official Data Analysis System. As the result of this project a software package SODAS has been developed.
z H.-H. Bock, E. Diday (Editors), Analysis of Symbolic Data, Springer, 2000
I. A Short Introduction on Symbolic Data
Data Mining:
How to treat the huge sets of data
(1) Classify the dataset into several groups (2) Summarize each group by a symbolic data
Knowledge mining:
Extend the classical statistical methods (1) Descriptive statistics of symbolic data:
Mean, Variance, Histogram (2) Multivariate analysis of symbolic data:
Factor analysis, Classification, Regression In many domains of human activities it is quite common to record huge sets of data.
Motivation of SDA
Example: Analysis on Stock Market of China
The total number of stocks is very large.
The number of individuals changes every year.
Some of individuals may be renamed.
Year 1990 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005
Number
of Stocks 10 306 509 714 819 912 1088 1137 1134 1215 1234
Total number of stocks in Market
Problem:
REGR factor score 1 for analysis 1
6 4
2 0
-2 -4
-6
REGR factor score 2 for analysis 1
10
8
6
4
2
0
-2
-4
1178 117711731176 11751174 1172
117011711169 1167 1168 1166
1165
11631164 1162 1161 1160
1159 1158 1157 11551156 1154
1153
1152 1151 11491150
11481147 1146
1145 11441143 1142 1141
11391140 11381137113511341136
11331132 11311130
1129 1127 1128
1126 11251124 11221123 11211120
1119 1118
1117 1116 1115 11141113 1112 111011091111 1108
1107 1105 1106
1104
11031102 1101 1100
1099 1098
1097 1096
1095
1094 1093 10901091 1092
1089
1088
1087 1086 1085
1084
1083 1082
1081 1079 1080
1078 1077
1076 1075
1074 1073 1071 1072 1070
10691068 1067
10661065 1063 10641062
1061
10591060 1058 1057
1056
1055 1054
1053 1052
1051
1050 1049
10481047
1046 1045
1043 1044
10421040 1041 1039
1038 1037103510361034
10331032 1030 1031
1029 1028 1027
1025102410231026 1022
1021 1020 1019
1018
1017 1016 1015
10131014
10121011 1010 1009 1008
1007
1006 1004 10031005
1002
1000 1001
999 998
996 997
995
994 993 991992
989 990 988 987 986 985 984
983
982 981 979980978
976 977
975973 974
972
970 971 969968
967
966 965 964
963 962
961 960
959 958 957 954955956
953 952
951 950
949 948
946 947 945
943944 941942 940 939
938
936 937
935 934
932 933 931
930929 928927 926
925
924
923 922
921 920
919 918
916917 915 914
913 912 911 910
909 908 906907 905904 903902
901 900
898 899 897
896895 894893
892 891
890
889888 887
886 885 884
883 882
881
880 879
878 877 876 875
874 873 871 872
870
869 868867
866 865
864 862 863 860 861 859 858
857 856 855
854 852851853
850 849 848847 846
845844 843 842
841 840
838839 837836
835 834
833 831832830 829
828
827 826
825 824
823
822 821
820 819 818 817 816814 815
813 812
811 810 808809 807
805806 804 803
801 802
800
798 799 797
796 795
793 792791794 790789788
787 786
785 784
783
782 781
780 779
778 777776 775
773 774 771 772 770
769 767768 766
765
764 763 761762 760
759
758
757 756 755
754 753
752 751
749 750
748 747
746
745
743744 742740 741
739 738
737 735734733 736 731730729 732728
727726 725
723724
722 721
720
719 718717 716 714 715 713
712 711
710 709708
707
706 705 704
703
702
701
700 699
698 696697
695 694
692 691693 690 688689
687
686 685
684 683
682 681 680
679 678
677
676 675 674 673672671
670
669 667 668
666 664665 663
661662 660
659 658
657 656 655654
652 653 651 650
649
648 647
646 645 644
643 641642
640
639637 638635 636 634633 632
631 630
629
628 626627 625 624
623622 620619621
618 616 617
614615 612613
611 610
608 609
607 606
605 604 603 602
601 600 598599 596597
595 594
593
591592 590
589 587 588
586 585 583 584
582 581
580 579
578 576 577
575
574573 572571570 569 568
567 566
565 564
563 562
561 559558 560
557
556 555554 553 552 551
550
548 546549547 544 545
542 543 541 540
539
538
537 536
535 534 532533 531
529530 528526527 525
524
523 521 522
520 519
518 517
516515 514 513
512 511 510 509
508 507 506 505
504 503
502
501
500
499 498 497
496
494495
493
492 491 490 489
487485 486 488 484
483
482481 480 479
478 477
475 476 474
473
472 471 470
469 468 467 466
465 463464 462
461460
459 458
457
456 455
454 453 451452 450 449
448 447
446 445 444
443442441440 438 439
437 435436
434
433 432431
430 429 428
427
426 425
424 422421423
420 419 417416 418
415 414
413 412 411 410409
408 406407
405
404 403 401 402
400 399
398 396 395397
394 392393
391 390
388 387389 386 385 384383382 381
380 379378 377
376
374372375 373 371
369370 368367
366 365 364 363362 361360
359358 357 356
355 354
353 352
351 349350 348
347
345346 343344
341340342
339335336338337 334 333 332 331 330 328 329
327
326325 324
323 322
321 320
319 317 318
316 315
314 313 312
310309 308311 307 305 306
304 303300299 301302
297298 296
295 293294291292
290 288 289
287 286
285
284
282283 281
279280 278
277 276
275
274 273272
271 270
269 268 267
266 265
263 264 262 261260259 258 257
256 255 254
253 252
251
250 249
248 246247
245244 243
242
241 240
238239 237
236234 233235 232 231230 229
227 228 226225224 223
222 220221
219218 217
216
215 214212 213 210211
209 208 207206
205 204 203 201202 200
198 199 196 197
195
194
193 192191190189 188
187 185186
183184 181182179 180 177 178 176
175 174173 171172
170169 168 167
166
164165 163
162 161
160
159 158 157 156
155 154
153 151152 149148150
146145 147 144 143
142 140 141 138 139
137 136 135134
132133 131130 129 128
127 126 124 125
123 122 120121
119 118117116
115 114
112113 111110
109 108 107
106 105
103104
102 100 101
98 99 97 96 95
94 9392
91
90 89 88 8685 84 8382 87
808177 7879 76
75
74 73
72
71 6970
68 67
66
65 64
63
62 61
59 60
58 5657
55 54
53 52 50 51 48 49 464547 444342
41 3940
38 37 36
35 34
3332
31 292830
27
25 26
24 23
21 22
19 20 18 17
16
14 15
13 12
11 10
9 7 8
5 46 3
2 1
Variables: Negotiable Market Capitalization (NMC) , Return rate, Turnover rate, P/E ratio, Volatility
PCA on 1088 stocks of China (2001 )
REGR factor score 1 for analysis 1
6 4
2 0
-2 -4
-6
RE G R f ac tor s cor e 2 f or a na ly si s 1
8
6
4
2
0
-2
-4
1163 11621161 1160 1158 1159
1157
115511561154
1152 1153 1151
1150
11491148 1147 1146 1145
1144 1143 1142 11401141 1139
1138
1137 1136 11341135
11331132 1131
1130 11291128 1127 1126
1125 11231122 1124 1121
11201119 11181117 11161115
1114 1112 1113
1111 11101109
1108 1106 11051107
1104 1103
1102 1101 1100 1099 1098 1097 10951094 1096 1093 1091 1092
1090
1089 1088
1087 1086
1085 1084
1083 1082
1081
1080 1079 10761077 1078
1075
1074
1073 1072 1071
1070
1069
1068
1067 1066 1065
1064
1063
1062 1061
1060 1059 1057 1058
1056
1055 1053 1054
10511052
1049 10501048
1047
10451046 1044 1043
1042
1041
1040
1039 1038
1037
1036 1035
1034
1033 1032
1031 1030 1029
10281026 1027 1025
1024 10231021 10221020
10191018 1017 1016
1015 1014 1013
10111010 1012 1009 1008
1007 1006 1005
1004
1003 1002 1001
9991000 998997
996
995 994
993
992 990 989 991
988
986 987
985 984
982 983
981
980 979
977978 975 976
974
973 972 971 970
969
968 967 966
965964 962 963
961959 960
958
956 957
955954 953
952 951 950
949 948
947 946
945 944
943 940941 942 939 938
937 936
935 934
933
932 931
930 929
927 928 926 925
924
922 923
921 920
918 919 917
916915
914913 912
911
910
909 908907 906
905 903904
902 901
900 899 898
897 896
895 894 892893891 890889
888 887
886 884 885
883882 881
879880 877878 875 876
874 873872
871 870
869
868 867
866 864865 863
862 861
859 860
858
857 856855
854
853 852 850 851 848 849 847 846
845
844 843
842 841
840839 838
837 836835 834
833832 831
830
829 828
826827 825
823824
822
821 820
819 818 817
816
815 814
813
812
811
810 809
808 807 806 805 804802 803
801 800
799 798
796797 795
794 793 792 791
789 790
788
786 787
785 784
782 781780783 779778 777
776 775
774 773
772
771 770
769
768
767 766765 764
762 763 760 761 759
758 756757 754 755
753 752 751 750
749
748
747
746 745
744
743 742
740 741
739 738
737
736 735
734 733
731 732 730
729 728 726725724 727 722 721720 723719
718717 716
715 714
713 712
711
710 709708 707 705 706 704 703
702 701700
699
698 697 696
695694
693
692
691 689690
688 687
685 684686 683
681682 680
679 678
677 676
675 674 673
672 671 670
669
668 667
666665664
663
661 662 660
659
657658 656
654655 653
652
651 650 649
648 647 645 646 644
643
642
641 640
639 638 637
636 634635
633
632630 631
629 628 627 626 624625 623 622
621 619 620 618 617
616615 613612614
611 610 609
608 606607 605
604 603
602 601
600 599
598 597 596 595 594 593 592
591 590
589 588
587 586
584585 583
582 580 581
579 578
576 577 575 574
573 572
571 569 570
568
567566 565 564563
562 561
560 559
558 557
556 555
554
553 552
551 550
549548 547 546 545
544
542 540 543541 538 536 539537
535 534
533
532
531 530
529 528525 526527
523524 522
520521 519
518
517 516
515
514 513
512 511
510 509 508 507
506 505
504 503
502
501 500 499
498 497
496
495
494 493 492
491
489490 488
487 486 485
484
483 482
481
480
479
478 477 476 475
474 473
471 472 470
469
468 467 466
465 464 462 463
461 459460 458
457456
455 454
453
452 451
450 448 449
447 446
444445 443 442
441 440
439438437
436434 435 433 432
430431
429 428427
426 425 424
423
422
421
420 418417419
416
415 413412 414 411
410 409
408 407 406405
404 402403
401
400
399 397 398
396 395
394 393 392
391
390 388389
387 386
384383385 382 381 380379 378 377
376 375374 373
372
370368371 369 367
365366 364363
362 361 360 359 358356355354 357
353 352
351
350 349
348
347
345346 344
343
341342 339 340
337336338
335331 332334333 330 329 328
327 326 324 325
323
322321
320 319
318
317 316
315
313 314 312
311
310 309 308
306305 304307 303 301 302
300 299 297298
296295 293294
292 291
289290287288 286 284 285
283 282
281
280
278279 277
275276 274 272 273
271
270 269268
267 266
265 264
263
262 261
259 260 258
257256255 254 253
252 251 250
249 248
247
246 245
244 242243
241240 239
238
237 236
234235 233
232 231 230 229
228 227 226 225
223 224 222221220 219
218 216217
215214 213
212
211 210208 209 206 207
205 204 203202
201 200 199
198 197 196
194 195 192 193
191
190
189 188187186185 184
183 181182
179180 177175178 176 173 174
172 171
170 168 169 167
166165 164 163
162
160161 159
158 157
156
155 154
153 152
151 150
149
147148 145144146 143 142141
140 139
138
136 137 134 135 133
132 131130
128129 127
126 124 125
123 122 120 121
119 118 117116
115 114113112
111
110 109
107106108
105 104 103
102 101
99100
98 96 97
94 95 93 92
91
90 89
88 87
86 85
84 8281 807879 83
7677 7475 73 72 71
70 69
68
67 65 66
64
63
62
61 60
59
58 57
56 55
54 53
5152
50 49
48 47 46
44 45 424143 403938
3637
35 34
33 32
3029 31 272628 24 25
23 22
20 182119
17 16 14 15
13 12
11 10
9 7 8
6 5 4
3
2 1
PCA after deleting 15 outliers
REGR factor score 1 for analysis 1
4 3
2 1
0 -1
-2 -3
-4
R EG R f ac to r s co re 2 f or a na lys is 1
8
6
4
2
0
-2
-4
-6
PCA after deleting 113 outliers
Stock Style Classification Based On CITIC Style Indices
Tab.4-1 The classification of the six CITIC style indices Low B/P value High B/P value High market
capitalization
Large-cap
growth Large-cap value Mid-market
capitalization Mid-cap growth Mid-cap value Low market
capitalization
Small-cap
growth Small-cap value
Size : represented by Negotiable Market Capitalization (NMC) B/P Ratio : net assets / market value
CITIC: China International Trust Investment Company
Object
Market Capitalization
Turnover
Rate Return Rate P/E Ratio Volatility Large-Cap
Growth Large-Cap
Value Mid-Cap
Growth Mid-Cap
Value Small-Cap
Growth Small-Cap
Value
[ x 11 , x 11 ] [ x 2 1 , x 2 1 ] [ x 3 1 , x 3 1 ]
[ x 4 1 , x 4 1 ]
[ x 5 1 , x 5 1 ]
[ x 6 1 , x 6 1 ]
[ x 12 , x 1 2 ] [ x 22 , x 22 ]
[ x 32 , x 32 ]
[ x 42 , x 42 ]
[ x 52 , x 52 ] [ x 62 , x 62 ]
[ x 1 3 , x 1 3 ] [ x 23 , x 23 ] [ x 33 , x 33 ] [ x 43 , x 43 ] [ x 53 , x 53 ] [ x 63 , x 63 ]
[ x 1 4 , x 1 4 ] [ x 24 , x 24 ]
[ x 34 , x 34 ]
[ x 44 , x 44 ] [ x 54 , x 54 ]
[ x 64 , x 64 ]
[ x 1 5 , x 1 5 ]
[ x 25 , x 25 ] [ x 35 , x 35 ] [ x 45 , x 45 ] [ x 55 , x 55 ] [ x 65 , x 65 ]
Interval Data Table
Reflect: Central Tendency & Dispersion Tendency of each class
{ } { }
1, , 1, ,
min , max
i i
k k
ij ij ij ij
k m k m
x a x a
= =
= =
** Here we use Interquartile Range to prevent the influence of outliers.
Variable
The Scatter Plot of PCA
Quick Impression:
¾Size Style is the primary factor influencing activity of market deals.
¾Growth or Value Style are also important factors that can describe the behaviours of stock market in China.
LargeCapG.
LargeCapV.
SmallCapG.
SmallCapV. MidCapV.
MidCapG.
Definition of Symbolic Data
E. Diday, 1987,1996
(1) Traditional Data:
A single quantitative or Qualitative value
(2) A Set of Value or Categories (Multivalued Variable) EX: Height (w)={ 3.5, 2.1, 5 } means that :
the height of w can be ether 3.5 or 2.1 or 5 (3) An Interval
EX: Height (w)=[3,5] means that:
the height of w varies in the interval [3,5]
(4) A Set of Value with Associated Weights (Distribution Var.) histogram , membership function.
Cell
Data Table
Supplier Weekly working time
Customer Delivery’s Town Weight of delivery
S1 34 SG1,SG3
Paris(0.5), Lannion(0.5) [12,15]
S2 45 SG2
Toulouse (1) [20,25]
S3 40 SG4,SG5
Lille(0.4), Lyon(0.6) [34,40]
Example:
We consider three relations: Supplier, Delivery, and Customer
Multivalued Variable Distribution variable Interval variable
Note: Compositional data is a kind of distribution data.
Characteristics
Reduce complexity of data system, since SDA reduces
aggregation scale by pretreatment of classifying original dataset.
Powerful in summarizing global features of the data
set, because it can hold overall characteristics of the sample group, and extract internal rules of the data set.
Visualize analysis results, especially it is suitable for
knowledge mining and visualization research of large
scale data set.
Interval Data: the elements of matrix are interval of real data.
The interval data may result from three sources:
– Imprecision of the measurement:
– Expert’s knowledge: including a range of uncertainty – Summary a class of dataset (second-order-object)
The object i to be described is a class of elements.
Let be m i observations of the feature j for the object i:
ij , ij
a δ a δ
⎡ − + ⎤
⎣ ⎦
1 2
{ a a ij , ij , , a ij m i }
{ } { }
1, , 1, ,
,
min , max
i i
ij ij ij
k k
ij k m ij ij k m ij
x x
Where x a x a
ξ
= =
⎡ ⎤
= ⎣ ⎦
= =
Range of error
II. Concept of Interval Data
Interval Data Matrix
¾ Let denote n objects (classes) described by p features of interval type.
¾ Let be the interval which include all the possible values of feature j for object i , the resulting symbolic data matrix is given by:
1, 2, ,
i = n
1 , 2 , , p
Y Y Y
ij x x ij , ij
ξ = ⎣ ⎡ ⎤ ⎦
[ ]
[ ]
11 11 1 1
11 1
1 1 1
, ,
, ,
p p
p
n np n n np np
n p
x x x x
X
x x x x
ξ ξ
ξ ξ
×
⎛ ⎡ ⎤ ⎞
⎛ ⎞ ⎜ ⎣ ⎦ ⎟
⎜ ⎟
= ⎜ ⎟ = ⎜ ⎟
⎜ ⎟
⎜ ⎟ ⎜ ⎡ ⎤ ⎟
⎝ ⎠ ⎝ ⎣ ⎦ ⎠
III. Descriptive Statistics of Interval Data
Basic set: E={1,2,…,n}
X is an univariable interval-valued variable measured for each observed unit of the basic set E={1,2,…,n}
For each k∈ E, we denote
Where, is the lower bound of the interval; and is the upper bound of the interval.
]
, [
) (
X k = x k x k
x k x k
The Equi-Distribution Hypothesis:
(1) Each observed unit k∈ E is equally likely, i.e. each observed unit k∈ E has been selected with the same probability 1/n.
(2) The values of x are uniformly distributed in the interval.
[ ]
( )
⎪ ⎪
⎩
⎪⎪ ⎨
⎧
≥
<
− ≤
− <
=
∈
≤
if 1
if
if 0
, Pr
k k k
k k
k
k k k
x
x x x
x
x
x x
x x
x
ξ
ξ ξ ξ
ξ
∀ ξ ∈ R
Basic set
E={1,2,…,n}
Empirical Distribution Function of X
EDF is the distribution function of the uniform mixture of n uniform distributions defined on the interval , for all k∈ E.
[ ]
( )
{ } { }
{ }
∑
∑
∑
∈
∈
∈
≥ + ∈
−
= −
≥
⋅ ∈
< +
⋅ ∈
− +
= −
∈
≤
=
) ( ) (
1 #
#
# 1 1 0
) (
: Thus
, 1 Pr
) (
k X
j k
k
k k
X
i j k
k X k
E k
k k X
n
x E
j x
x
x n
n
x E
j n
x E
i x
x
x F n
x x x
n x F
ξ ξ
ξ ξ
ξ ξ ξ ξ
ξ ξ
] , [ ) (
X k = x k x k
{ j ∈ E ξ ≥ x j }
#
Where the notation of :
is the number of intervals with X ( j ) = [ x j , x j ] ξ ≥ x j
By taking the derivation of the empirical distribution function, we get the Empirical Density Function of X
∈ ∑ −
=
) (
1 ) 1
( f
k
X k k
X ξ n ξ x x
{ }
∈ ∑
≥ + ∈
−
= −
) (
1 # )
(
k X
j k k
X k
n
x E
j x
x
x F n
ξ
ξ ξ ξ
Here is an indicator function:
E k k
X ξ
k X ξ
k
X ∈
⎩ ⎨
⎧
∉
= ∈
) (
if 0
) (
if ) 1
) (
( ξ
1 )
) (
( k ξ 1 X
Empirical Density Function of X
From EDF
∑ ∈ −
=
E
k k k
k X
x x
n ( )
1 1 ( ) ξ
∈ R
ξ
Empirical Mean of X
( ) ξ ξ
ξ d
X : = ∫ − +∞ ∞ f X
) 1 (
: ∫ − + ∞ ∞ ∑ ( )
∈ −
⋅
= ξ ξ
ξ d
x x
X n
E
k k k
k
1 X
∑ ∈
= +
E k
k x k
x X n
2 1
∑ ∈ −
=
E
k k k
k X
X n 1 x x ( )
) (
f ( ) ξ
ξ 1
Empirical Density Function
] ,
[
if 0
] , [
if ) 1
) (
( ⎩ ⎨ ⎧
∉
= ∈
k k
k k k
X ξ x x
x x ξ ξ
1
indicator function
k k
k
k x x x
x
+ 2
[ ]
∑∫ ∈
∞ +
∞
− −
=
E
k k k
k
X d
x x
n ( ξ ) ξ ξ
1 1 ( )
1 ∑∫
∈ −
=
E k
x
x k k
k k
x d x
n ξ ξ
∑ ∈
= +
E k
k x k
x
n 2
1
Empirical Standard Deviation:
( 2 2 ) 2 ( ) 2
2
4 1 3
1 ⎥
⎦
⎢ ⎤
⎣
⎡ +
− +
+
= ∑ ∑
∈
∈ k E
k k E
k
k k k
k
Y x x
x n x
x n x
s
( ξ X ) ( ) ξ d ξ
s X : = ∫ − + ∞ ∞ − 2 f X
∫
∫
∞ +
∞
− +∞
∞
−
=
−
=
ξ ξ
ξ
ξ ξ
ξ
d M
X d
s
X X X
) ( f :
) ( f
2 2
2 2
2
ξ ξ x d x
n k E
x
x k k
k
∑∫ k
∈ −
= 1 2 ∑
∈ −
= −
E
k k k
k k
x x
x x
n 3 ( )
1 3 3
( )
∑ ∈
+ +
=
E k
k k k
k x x x
n x
2 2
3
1
Example : We observe an interval-valued variable X defined on a set E={1,2,…,8}. The observed values of X at set E are:
X(E)={[0,2]; [1,3]; [1.5,2.5]; [2,4]; [3.5,5]; [4.5,5.5]; [5,7]; [6.5,7.5]}
045 .
2 781
. 24 3
5 . 443
781 .
3 )
7 6 2 5
5 . 3 8
2 2 1 8 ( 1
2 ≅
−
=
≅ +
+ + +
+ + +
=
s X
X
Depends on the equi-distribution hypothesis:
(1) Each observed unit k∈ E has been selected with the same probability 1/n :
(2) The values of x are uniformly distributed in the interval.
Example: An interval-valued variable X defined on a set E={1,2,3}.
The observed values of X at set E are: X(E)={[0,2]; [2,4] ;[1,3]}
3 1
3 , 1
3 , 1
) 3 , 1 [ )
4 , 2 [ )
2 , 0
[ = p = p =
p
6 1 1
3 1 3
1 ~
6 1 2
4 1 3
1 ~
6 1 1
2 1 3
~ 1
) 1 3
~ (
), 2 4
~ (
), 0 2
~ (
) 3 , 1 [
) 4 , 2 [
) 2 , 0 [
) 3 , 1 [ )
3 , 1 [ )
4 , 2 [ )
4 , 2 [ )
2 , 0 [ )
2 , 0 [
− =
×
=
− =
×
=
− =
×
=
−
×
=
−
×
=
−
×
=
f f f
f p
f p
f p
0 0 0
1/7 1/5 1/4 2/7 1/3
1 2 3 4
0 1 2 3 4 1/3
1/6 f
Histogram:
(1) The interval that contains all the observed single values of X is given by I = [0, 4].
(2) There are 5 distinct observed boundaries that are ranked in increasing order:
(3) Consider the partition of I by the 4 intervals
4
, 3
, 2
, 1
,
0 1 2 3 4
0
) 4], , [3 ( ),
3), , ([2 ),
2), , 1 [ ( ), ), 1 , 0
([ f 1 f 2 f 3 f 4
=
=
=
=
= u u u u
u
X(E)={[0,2]; [1,3]; [2,4]}
0 0 0
1/7 1/5 1/4 2/7 1/3
1 2 3 4
0 1 2 3 4 1/3
1/6 f
6 1 2
4 1 3
~ 1
3 1 2
4 1 1
3 1 3
~ 1 ~
3 1 1
3 1 0
2 1 3
1
~ ~
6 1 1
2 1 3
1 ~
) 4 , 2 [ 4
) 4 , 2 [ )
3 , 1 [ 3
) 3 , 1 [ )
2 , 0 [ 2
) 2 , 0 [ 1
− =
×
=
=
⎟ =
⎠
⎜ ⎞
⎝
⎛
+ −
× −
= +
=
⎟ =
⎠
⎜ ⎞
⎝
⎛
+ −
× −
= +
=
− =
×
=
=
f f
f f
f
f f
f
f f
Calculation Process:
Example: We observe an interval-valued variable X defined on a set E={1,2,…,8}. The observed values of X at set E are:
X(E)={[0,2]; [1,3]; [1.5,2.5]; [2,4]; [3.5,5]; [4.5,5.5]; [5,7]; [6.5,7.5]}
(1) The interval that contains all the observed single values of X is given by I = [0, 7.5].
(2) The 14 distinct observed boundaries that are ranked in increasing order:
(3) Consider the partition of I by the 13 intervals 5 . 7
, 7 ,
, 5 . 1
, 1
,
0 1 2 13 14
0 = u = u = u = u =
u
13 , , 2 , 1
, ) ,
[ 1 =
= u − u j
I j j j
two repeated boundaries: 2 , 5
The corresponding Density Function is described by the list of pairs:
Let p j indicates the area of the bar of base I j in the histogram.
( [ u j − 1 , u j ) , f j ) , j = 1 , 2 , , 13
( j j j ) j j j
j x u u f u u f
p = Pr( ∈ [ − 1 , ) , = ( − − 1 ) ×
X(E)={[0,2]; [1,3]; [1.5,2.5]; [2,4]; [3.5,5]; [4.5,5.5]; [5,7]; [6.5,7.5]}
4 1 5
. 1 5 . 2
1 1
3 1 0
2 1 8
1 : ) 2 , 5 . 1 [
8 1 1
3 1 0
2 1 8
1 : ) 5 . 1 , 1 [
16 1 0
2 1 8
1 : ) 1 , 0 [
3 3
2 2
1 1
⎥⎦ =
⎢⎣ ⎤
⎡
+ − + −
× −
=
=
⎥⎦ =
⎢⎣ ⎤
⎡
+ −
× −
=
=
− =
×
=
=
f I
f I
f I
= 8 n
5 . 7
, 7 ,
, 2 ,
5 . 1
, 1
,
0 1 2 2 13 14
0 = u = u = u = u = u =
u
observed boundaries
Observed values:
Thus the corresponding list is given by:
( [ u j − 1 , u j ) , f j ) , j = 1 , 2 , , 13
,
),
,
,
),
,
,
),
,
,
),
,
,
),
,
,
),
,
,
),
,
),
,
,
),
,
,
),
,
,
,
,
,
),
,
,
),
,
⎟ ⎠
⎜ ⎞
⎝
⎟ ⎛
⎠
⎜ ⎞
⎝
⎟ ⎛
⎠
⎜ ⎞
⎝
⎟ ⎛
⎠
⎜ ⎞
⎝
⎟ ⎛
⎠
⎜ ⎞
⎝
⎛
⎟ ⎠
⎜ ⎞
⎝
⎟ ⎛
⎠
⎜ ⎞
⎝
⎟ ⎛
⎠
⎜ ⎞
⎝
⎟ ⎛
⎠
⎜ ⎞
⎝
⎛
⎟ ⎠
⎜ ⎞
⎝
⎟ ⎛
⎠
⎜ ⎞
⎝
⎟ ⎛
⎠
⎜ ⎞
⎝
⎟ ⎛
⎠
⎜ ⎞
⎝
⎛
8 5 1
. 7 7 16 [
7 3 5 . 6 16 [
5 1 . 6 5 . 5 16 [
5 3 . 5 5 24 [
5 5 5 . 4 [
24 5 2
. 4 4 48 [
4 7 5 . 3 16 [
5 1 . 3 3 8 [
3 1 5 . 2 [
4 5 1
. 2 2 4 [
) 1 2 5 . 1 8 [
5 1 . 1 1 16 [
1 1 0 [
8 ) 1 5 . 1 2 4 (
1
16 ) 1
1 5 . 1 8 (
1
16 ) 1
0 1 16 (
1
3 2 1
…
…
=
−
×
=
=
−
×
=
=
−
×
=
p p p
0
1/ 20 1/ 10 3/ 20 1/ 5 1/ 4 3/ 10
1 1. 5 2 2. 5 3 3. 5 4 4. 5 5 5. 5 6. 5 7 7. 5
) ( − − 1
×
= j j j
j f u u
p
Now, consider a new partition of the interval with 5 subintervals:
] 8 , 0
= [ I ′
] 8 , 5 . 6 [ )
5 . 6 , 5 [ )
5 , 5 . 3 [ )
5 . 3 , 5 . 1 [ )
5 . 1 , 0
[ ∪ ∪ ∪ ∪
′ = I
5 , 4 , 3 , 2 , 1
,
=
′ = ∑
∈ ′
α
α
α
I I
j
j
p
Remark that p
For
( )
12 1 5
. 1
8 / 1
8 0 1
5 . 1
8 ) 1 1 5 . 1 8 ( ) 1 0 1 16 ( 1
: ) 5 . 1 , 1 [ ) 1 , 0 [ ) 5 . 1 , 0 [
1
1 1
1
=
=
∴
=
−
×
=
=
− +
−
=
∪
=
′ =
f
f p
p I
∵
⎟ ⎠
⎜ ⎞
⎝
⎛
12 1 ), 5 . 1 , 0
Thus, [
0
1/ 20 1/ 10 3/ 20 1/ 5
1 2 3 4 5 6 7 8
With the same methodology, we get the empirical frequency distribution:
⎟ ⎠
⎜ ⎞
⎝
⎟ ⎛
⎠
⎜ ⎞
⎝
⎟ ⎛
⎠
⎜ ⎞
⎝
⎟ ⎛
⎠
⎜ ⎞
⎝
⎟ ⎛
⎠
⎜ ⎞
⎝
⎛
48 ]; 5 8 , 5 . 6 [ 48 ;
); 5 5 . 6 , 5 [ 48 ;
); 7 5 , 5 . 3 64 [
); 11 5 . 3 , 5 . 1 [ 12 ;
]; 1
5
.
1
,
0
[
[ ]
[ ]
11 11 1 1
11 1
1 1 1
, ,
, ,
p p
p
n np n n np np
x x x x
X
x x x x
ξ ξ
ξ ξ
⎛ ⎡ ⎤ ⎞
⎛ ⎞ ⎜ ⎣ ⎦ ⎟
⎜ ⎟
= ⎜ ⎟ = ⎜ ⎟
⎜ ⎟
⎜ ⎟ ⎜ ⎡ ⎤ ⎟
⎝ ⎠ ⎝ ⎣ ⎦ ⎠
IV. PCA of Interval Data
(Cazes, Chouakria, Diday& Schektman,1997) Interval Data Matrix
Let denote n objects described by p features of interval type value.
1, 2, ,
i = n
1 , 2 , , p
Y Y Y
1. The VERTICES Method
z Let denote the interval data vector of object i . This data point can be visualized as a hyper-rectangle in space with 2 p vertices.
z For p = 2, the object vector is
z With 2 2 vertices as follows:
( 1 , , ) ( [ 1 , 1 ] , , , )
i i ip i i ip ip
x = ξ ξ ′ = x x ⎡ ⎣ x x ⎤ ⎦ ′
R p
[ ] [ ]
( 1 , 1 , 2 , 2 )
i i i i i
O = x x x x ′
1 2
1 2
1 2
1 2
i i
i i
i
i i
i i
x x
x x
M x x
x x
⎡ ⎤
⎢ ⎥
⎢ ⎥
= ⎢ ⎥
⎢ ⎥
⎣ ⎦
R i
X 2
X 1 M i
R i
Transformation to Numerical Matrix
¾ Describe each interval-valued object O i by a numerical data matrix M i with 2 p rows and p columns.
¾ Compile the matrices into the numerical matrix M with rows and p columns.
[ ]
[ ]
11 11 1 1
1
1 1
, ,
, ,
p p
n p
n n n np np
x x x x
O X
O x x x x
×
⎛ ⎡ ⎤ ⎞
⎛ ⎞ ⎜ ⎣ ⎦ ⎟
⎜ ⎟
= ⎜ ⎜ ⎝ ⎟ ⎜ ⎟ ⎜ ⎠ = ⎜ ⎝ ⎡ ⎣ ⎤ ⎦ ⎟ ⎟ ⎟ ⎠
2 p n ×
11 1
11 1
1 2
1
1
p