• Aucun résultat trouvé

Numerical Experiments

Sparse Matrix Computational Techniques in Concept Decomposition Matrix

10.4 Numerical Experiments

To evaluate the performance of the proposed matrix approximation approaches SSP and DSP for the concept matrix decomposition, we apply them to three popular text databases: CRAN, MED and CISI and compare them with the concept project matrix method (CPM). The databases are downloaded from http://www.cs.utk.edu//

lsi/ [15]. The information about the three databases are given in Table 10.1.

A standard way to evaluate the performance of an information retrieval system is to compute precision and recall values. The precision is the proportion of the relevant documents in the set returned to the user; the recall is the proportion of all relevant documents in the collection that are retrieved by the system. We average the precision of all queries at fixed recall values as 10%,20%, . . . ,90%. The clustering, query and precision evaluation codes in matlab are acquired from [15]. In addi-tion of precision-recall tests, we compare their storage costs and CPU time required for query procedure and approximate matrix computation for each of the three ap-proaches. The precision computation and query time are carried out in UNIX system using matlab. The sparse matrix computation and matrix inverse are carried out in IBM Power/Intel(Xeon) hybrid system at University of Kentucky using C.

As presented in [15], the better number of clusters k for three databases are around 200 and 500. Therefore in all of the following tests, we use k=256 and k=500 for CISI and CRAN databases;k=200 andk=500 for MED database.

For DSP approach, three parameters are used in algorithm 10.3.3. “ε” is used to control the quality of the approximation matrix. For all the tests, we chooseε=1.

“ns” is maximum number of improvement steps per row in DSP. “µ” is maximum number of new nonzero candidates per step. Higher values ofns,µ lead to more work, more fill-in, and usually more accurate matrix.

We first test their query precisions. The precision test results for all three data-bases are given in Figs. 10.2–10.4. We are not surprised that CPM has better query results for every database. This is because CPM is much more dense (memory costs will be given in other figures) and hence more accurate than that of SSP and DSP.

Table 10.1 The information of three databases

Database Matrix size Number of queries Source

CISI 5609×1460 112 Fluid dynamics

CRAN 4612×1398 225 Science indices

MED 5831×1033 30 Medical documents

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Fig. 10.2 Left panel: CRAN database, k = 256. Right panel: CRAN database, k = 500

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Fig. 10.3 Left panel: CISI database, k = 256. Right panel: CISI database, k = 500

We then compare their storage costs required in previous tests by listing the num-ber of nonzeros of the approximate matrixM and CPM matrix(CTC)−1CTA. The test results presented in the left and right panel of the Fig. 10.5 are corresponding to that in the left and right panel of the Figs. 10.2–10.4 respectively. From this test we see that SSP and DSP use much less memory space than CPM. Considering the fact that with about 90% less memory costs than CPM, SSP and DSP suffer only 6%

precisions lost. They may be more attractive if storage cost is a bottle-neck. We also tried to sparsify the CPM matrix by dropping small entries. By reducing 20% of the memory storages, the precision is lost more than 40% for CISI withk=500 test.

In previous precision-recall tests, SSP looks slightly better than DSP, especially when the number of clusterkis small. However, SSP matrix has more than double number of nonzeros than that of DSP matrix. In order to see which sparsity pattern is good, we compare SSP and DSP by increasing the density of the DSP matrix.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Fig. 10.4 Left panel: MED database, k = 200. Right panel: MED database, k = 500

CRAN CISI MED

Number of nonzeros in thousands

Databases

Number of nonzeros in thousands

Databases

CPM SSP DSP var1 DSP var2

Fig. 10.5 Storage cost comparison of three different retrieval methods.

In the tests of all three databases withk=500, we either increase the update steps or the number of fill-ins at each step to compute DSP matrixM for accurately. In the right panel of the Figs. 10.2–10.4, we see the precision-recall curves for DSP corresponding to DSP var2 are better or very closer to that of SSP while still uses smaller memory space compared with SSP. That means, given the same storage con-straint, dynamic sparsity pattern strategy more accurately computes sparse approx-imate matrix than the static sparsity pattern strategy. Figure 10.5 shows the storage cost comparison of three different retrieval methods.

Finally we give the query time and total CPU time required for the query proce-dures and CPM, SSP, and DSP three matrices computation. The query time is one

CRAN CISI MED

Database query time, k = 500

CPU time in seconds

Fig. 10.6 CPU time for query procedure and matrix construction

of the most important aspects in information retrieval system. It affects the retrieval performance. We see from the upper panel of the Fig. 10.6 that SSP and DSP greatly reduce the query time required by CPM approach due to their very sparse matrices.

The total CPU time for matrix construction and query procedure is given in the bot-tom panel of the Fig. 10.6. Ifk=256, or 200, the size of theCTCis either 256 by 256 or 200 by 200, which is small, the inverse ofCTCtakes less time than SSP and DSP matrix construction time. However, fork=500, CPM takes much more time compared with SSP. SSP also is much faster than DSP if the sparse matrixM is constructed with the same density.

10.5 Summary

We have developed numerical matrix computation methods based on static sparse pattern (SSP) and dynamic sparse pattern (DSP) to compute the approximate ma-trix to the concept mama-trix decomposition. We tested and compared with CPM based retrieval schema. In our numerical experiments, the sparsity pattern based ap-proaches SSP and DSP turn out to be more competitive in terms of query precision, computational costs and memory space.

With the comparison of SSP and DSP sparsity pattern strategies, DSP displays some of the following advantages: First is that it computes a more accurate approxi-mate matrix than SSP given the same density constraint. Second is that the accuracy of the approximate decomposition matrix can be controlled easily. By allowing more update steps, we can compute a more accurate approximate decomposition matrix.

Third, as the information retrieval database may be updated periodically to accom-modate new documents or to remove some out-of-dated documents, the computed sparsity pattern may be used as a static sparsity pattern in some intermediate data-base update. But the dynamic sparsity pattern strategy is more expensive in terms of computational cost. Note that in information retrieval, these computations are calledpre-processing computation to prepare the database for the purpose of re-trieval. Thus, an expensive one-time cost can be allowed if the prepared database enables more accurate and faster retrieval. If the storage costs and query time are the bottle-neck during the query procedure, the dynamic sparsity pattern strategy looks more attractive.

References

1. I. S. Dhillon and D. S. Modha. Concept decompositions for large sparse text data using clustering.Mach. Learn., 42(1):143–175, 2001.

2. G. H. Golub and C. F. van Loan.Matrix Computations. The Johns Hopkins University Press, Baltimore, MD, 3rd edition, 1996.

3. J. Gao and J. Zhang. Clustered SVD strategies in latent semantic indexing. Inform. Process.

Manage., 41(5):1051–1063, 2005.

4. E. Chow. A priori sparsity patterns for parallel sparse approximate inverse preconditioners.

SIAM J. Sci. Comput., 21(5):1804–1822, 2000.

5. M. Grote and T. Huckle. Parallel preconditioning with sparse approximate inverses. SIAM J. Sci. Comput., 18:838–853, 1997.

6. J. Hartigan.Clustering Algorithms. Wiley, New York, 1975.

7. J. Hartigan and M. Wong. Algorithm AS136: A k-means clustering algorithm. Appl. Stat., 28:100–108, 1979.

8. A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Comput. Surv., 31(3):264–323, 1999.

9. E. Chow and Y. Saad. Approximate inverse preconditioners via sparse-sparse iterations.SIAM J. Sci. Comput., 19(3):995–1023, 1998.

10. J. D. F. Cosgrove, J. C. Diaz, and A. Griewank. Approximate inverse preconditionings for sparse linear systems.Int. J. Comput. Math., 44:91–110, 1992.

11. L. Y. Kolotilina. Explicit preconditioning of systems of linear algebraic equations with dense matrices.J. Soviet Math., 43:2566–2573, 1988.

12. K. Wang, J. Zhang, and C. Shen. A class of parallel multilevel sparse approximate inverse preconditioners for sparse linear systems.J. Scalable Comput. Pract. Exper., 7:93–106, 2006.

13. K. Wang and J. Zhang. MSP: a class of parallel multistep successive sparse approximate inverse preconditioning strategies.SIAM J. Sci. Comput., 24(4):1141–1156, 2003.

14. S. T. Barnard, L. M. Bernardo, and H. D. Simon. An MPI implementation of the SPAI pre-conditioner on the T3E.Int. J. High Perform Comput. Appl., 13:107–128, 1999.

15. J. Gao and J. Zhang. Text retrieval using sparsified concept decomposition matrix. Technical Report No. 412-04, Department of Computer Science, University of Kentucky, KY, 2004.

Transferable E-cheques: An Application