• Aucun résultat trouvé

In Chapter 2 we touched upon how we might implement the ideas presented in this thesis when we discussed estimation algorithms. In this chapter we return to this topic and further explore the issues and possibilities that arise when writing the computer programs that implement the style of statistical modeling we would like to do. Generally speaking, the advent of fast, massively parallel computer systems encourage us to pursue this nonlinear/multivariate exploratory style of modeling. The increased size of these computer systems allow us to work on much larger and more realistic data sets, and their increased speed enables us to use computationally intense algorithms that in some cases ameliorate our lack of analytic or theoretical results.

However, we must take care to use the machines in sensible and ecient ways if we are to protably pursue this style of modeling. In addition to the data mining worries discussed in Chapter 4, the nonlinearity and large size of some problem domains we might consider will cause extra numerical problems that we will have to guard against.

6.1 Parallel Computer Implementations

Much of the computation necessary for exploratory statistical modeling can be found directly in the equations describing the various estimators used. These equations

127

128 CHAPTER6. IMPLEMENTATIONISSUES

typically involve the repeated evaluation of various functions, some fairly complex, across large sets of data. This style of repetitive computation has a natural analog in the data parallelcomputing paradigm of the parallel computing world, a paradigm commonlysupported on massivelyparallel computers such as the Connection Machine system that was used to generate some of the results in this thesis. The data parallel paradigm logically assigns one data element (e.g. of an array) to each processor in the parallel machine, and evaluates functions of that data simultaneously across selected subsets of those processors. This contrasts with process level parallel computing, where dierent stages in an algorithm are allocated to dierent processors and run simultaneously, passing data from one stage to the next.

Although on a supercial level the data parallel computing paradigm is a nice match for the kinds of computations we would like to do, it is nontrivial to come up with applications that use this style in an ecientand sensible way. The noveltyof the paradigm and the machines that support it mean that much of the software necessary must be written from a low level. On one hand writing software \from the ground up"

on these powerful machines provides an opportunity to rethink algorithms, but on the other hand we will also be forced to reconsider issues such as data layout, oating point precision, and numerical stability of algorithms. Before undertaking such an eort it makes sense to understand where the motivations and potential benets lie in doing so.

6.1.1 Large Data Sets

Many studies that have applied learning networks to time series analysis have used quite small data sets, such as the annual record of sunspot activity. Why do we need fast and/or parallel computers to study a data set with only a few hundred points? Indeed, a fast workstation may be sucient in this particular case even for the most demanding learning algorithms. In general, however, we argue that for real world problems our tendency will be to try to use as much data as possible in our

6.1. PARALLEL COMPUTER IMPLEMENTATIONS 129 modeling eorts, both to improve our situation from an estimation standpoint, and to increase the complexity of functions which we can learn. Extensions to the option pricing application from Chapter 3, for instance, could involve a sampling across the operating range of all the feasible inputs (e.g. stock price, strike price, time to expiration, risk free rates, volatility) which could easily lead one to use data sets with millions of examples.

Thus despite initial impressions, we will often want to use much larger sets of data than we might imagine initially. Convergence results for learning networks such as those found in Barron (1991), and Girosi and Anzellotti (1993), for instance, show that the size of errors a statistical model makes is a function of the complexity of the underlying relationship being learned and the number of data points available for training. For a xed underlying relationship, then, the way we get better models is to use more data points. However if for whatever reason we must be mostly passive in our collection of data, we must go about nding this extra data in smarter ways than simply collecting more data.

One way of doing this in our modeling context is to use observations from similar relationships in some way, either to help set some prior for the model (for example by setting the general functional form of the cross-sectional models of Chapter 5), or directly to solve a joint estimation problem (for example formulating one model for all stocks in the market). Another possible solution to the sample size problem is to create \new" data, which can be done in a number of ways. If we know the noise model of the data, we can simply resample the existing data and add the appropriate noise to obtain more data1. If our estimate of the underlying relationship is good, it may be possible to do even better than this by using the function estimate to interpolate between examples. Techniques such as the latter have been used to analyze chaotic time series, for instance see Farmer and Sidorowich (1989).

From an implementation viewpoint, then, we will often generate data sets large

1This is the idea behind naive bootstrap methods.

130 CHAPTER6. IMPLEMENTATIONISSUES

enough to use a few hundred processors eciently if we are careful to lay out the data so that the data expanding style of resampling and interpolation will not cause unnecessary data motion. Another overriding eciency consideration related to this is how much time we will have to spend reading the data into the machine and write out the results - if the I/O performance of our parallel machine is not fast enough, for instance, it may not be worth the eort of writing highly ecient computational routines.

6.1.2 Computationally Intense Algorithms

The ability to handle large data sets is nice for solving real world problems, but there is another perhaps more interesting way that parallel computers can help with exploratory statistical modeling - they provide the speed to use algorithms too com-putationally intense to consider otherwise. These algorithms fall into two broad classes: rst, learning network specic ways to parallelize the networks' iterative, time consuming estimation algorithms; and second, general methods for identifying good models and estimating our condence in them.