Signals and Numerical Information- Information-Interpretation and Analysis

A. J.

PERLIS

Carnegie Institute of Technology

Inside a computer all information is numerical and this implies that its use and evaluation must be accomplished by numerical transactions within the machine. These transactions are generally organized and de-scribed by what we call programming languages.

It is a truism that a fool can ask more questions in an hour than all the wise men in the world can answer in a hundred years. The whole problem of information retrieval, I think, is related to that particular point. In this case, the wise men are the set of programming and formatting techniques that we are capable of bringing to bear, and the fools are the (so far, fortunately) largely mythical people who hope to sit at computers and ask any old question that comes along. Nevertheless, starts are being made in various places on various small problems to solve the problem of re-trieval of information in those areas. Some of them are rather trivial, others more complex. All are partial and will undoubtedly remain so for the foreseeable future.

Mention was made of language translation by the previous speaker.

The information that is in so far, from all fronts engaged in doing lan-guage translation on computers, is that effectively no progress has been made toward producing usable translations for technical people in various fields. The reason for this is, I think, summed up in a nutshell in the two words "context" and "semantics," and how they relate to one another.

Semantics, or what meaning we give (either operationally or purely intel-lectually) to information received from some source has not yet been suitably formalized either in the field of logic or, unfortunately, in the field of computation, so that obtaining information as to the meaning of processing within a computer is a very difficult proposition, and at the moment we can say that very few positive results have been obtained out-side of a few restricted areas. These restricted areas are those where the classification of information has been going through a sifting process for centuries, and I refer to certain restricted parts of mathematics. In these restricted parts of mathematics where the information transformations are arithmetic in nature, an increasingly useful amount of information is ob-tained and it is the success in this area that has led people to predict the

ultimate success in other areas which at the moment share nothing in computers are what we might call problem-oriented. Now, what I mean by problem-oriented is that someone or some group assumes that a prob-lem can be described in a certain way, such as "A library can be operated on a computer-it's all bits, has fast input, has a higher-speed printer than any I've ever seen before. We can get photographic output, and in a few years, I'm told, we'll get photographic input."

We have a problem-how do we store a library in a computer? Stated in this way, such problems always can obtain partial solutions, which ulti-mately fall far short of the dreams of the proposers, but are at the ex-treme limit of the abilities of the people who actually achieve them. Some people look at tasks not as problems but as procedures, and all the success we have had in computation to date has been because certain specific areas have been attacked from the standpoint of obtaining procedures.

Indeed, all computation is based on procedures. It is only when we are able specifically to describe procedures that we get any mileage at all out of computers.

How do we describe procedures? The first place to start is with the con-cept of data representation or format, in the words of the previous speaker.

This, I think, is the key to all successful use of a computer in any prob-lem, be it information retrieval, Monte Carlo work, or simulation of traf-fic systems. The basic key is data representation.

What is the data that we choose to use in a computer? How big is it?

What is its precision? What do we wish to do with it?

In the outside world we have one picture which is a very heavily con-texted dependent picture of information and hence data representation which is constantly organized and parsed, if you will, by our mind. The first stage in the use of the computer is in effect to deduce the appropriate, approximate information representation that is going to take place inside the computer. Now immediately we can eliminate this problem by stating that all information inside the computer is a string of zeroes and ones.

But it is precisely because we do not have to say that, and can still have the computer process at a rapid rate, that we are able to make progress in the use of computers in information problems. For example, real num-bers are abstractions in the outside world and approximations in the in-side world, but in certain problems they are the natural data to be used in describing the data-the natural format to be used for describing data.

For example, scientific computations are of this form. Those of you who

SIGNALS AND NUMERICAL INFORMATION 37 have had sufficient experience in computing will recall the history of com-putation where we started out with'numbers that were merely integers;

then we had an internal set of transformations programmed, if you will, which allowed us to represent approximate real numbers by pairs of in-tegers, and thereafter deduced a set of operations on these pairs that were natural for the operations on reals, All of the internal transformations were then procedurized once and for all.

Those who came along later and worked on alphanumeric information were aware of the fact that these too were represented as strings of digits which, however, could be procedurized as soon as we knew what the oper-ations were that we wished to perform upon them, Lately, we find that computers are being considered-one computer has even been constructed -whose basic data representations are what we call list-structures. The class of problems for which we need these structures, as the natural in-ternal data, is a more complex, and indeed, a newer class of problem than those for which real numbers were sufficient.

Other forms of data representations will be found in time. It may in-deed turn out that the basic importance of a computer in the intellectual life of mankind is through the fact that it places a problem before the mathematicians of our society to develop a whole host of new arithme-tics-arithmetics which allow us to manipulate in the same way that the piano postulates provide us with the basis for manipulating an arithmetic or, if you will, ordinary integer arithmetic, that will enable us to manipu-late in the same natural easy way trees of information and list-structures of information which at the present time are handled by means of non-formalized procedures. The real intelligent use of computers in informa-tion retrieval and other problems will await the soluinforma-tion of at least this problem.

Now for each data structure that we happen to deduce as appropriate for our problem there is a natural set of operations which seem to occur and the understanding that one has of a particular problem is, in a large part, determined by how totally he is able to define the set of natural oper-ations. In arithmetic we all know what they are. When we get over to more complex structures like matrices or lists, we find that other opera-tions have to be added to the compendium in order for us to say in a pre-cise and in a brief way the basic computations that we wish to have per-formed.

It is the job of the programmer at the present time to find the set of in-ternal procedures which will carry out the transformations from one rep-resentation, the natural one that we as users would like to possess, and the unnatural one, the one that the computer actually does possess. All information processing inside computers-or almost all-is concerned with these procedure transformations.

Now once we have decided on a data structure and the basic operations for a problem, the next step, if the problem is large enough (and one which, so to speak, assures continuing support) is the definition-at least, this is the usual chain of events which takes place-of a language. I think this is the second act of intellectual import which has occurred in the past several years with respect to computers. It is the fact that we now design or invent languages almost on a moment's notice. Language, in-stead of being something which is studed au naturel, is now designed to fit a specific purpose in a computer and there is no limit to the number of such languages that can be designed for specific purposes.

Why design a language? Well, that's a good question. People who have already designed one always ask it of those who are about to start. The reason, of course, is to cut down the amount of explicit relationships we must explain to a computer if the number of such explicit relationships is large, either because it is large in a single problem or because the num-ber of users who have to so express themselves is large. As soon as that situation occurs, along comes the need for the design of a language, just to increase the flow of communication between man and machine. These languages all follow much the same sort of path. They proceed from in-ternal representation of desired data, to applications of the appropriate operations, to the creation of sentences, and from sentences, the creation of programs, the specification of the sequencing rules by which these grams are to be executed, the specification of a library by which these pro-grams are to be accumulated and indexed and accessed, and the imbed-ding of all of this in a kind of operating system on a machine. If one looks at a large part of the intellectual effort now going on in the United States in the so-called programming area, one finds that it is involved in one or more of these areas and not much else.

What does it mean now in these terms to recognize information inside a computer? If we wanted to be very blunt, we are able to recognize in-formation only when we can make a selection of one of two programs to be executed as a result of this recognition. Thus recognition is a selection process of one of two programs. This isn't very helpful because all com-plex problems are ultimately broken down this way. No comcom-plex problem would probably ever be programmed if it had to depend on such a recog-nition defirecog-nition.

Let's consider one specific problem. There has been proposed from time to time the development of so-called information-retrieval systems.

An information-retrieval system depends on, it seems to me, several things. One is a corpus. This corpus is the set of facts and relationships which is stored in the computer. Second, there is a set of allowable queries.

Third, there is the processing of these queries to produce the desired

in-SIGNALS AND NUMERICAL INFORMATION 39 information-retrieval problem. Given the corpus, the number of relationships is ever increasing. Given the input language, it starts out simple and gets more complex, as we learn more and more about our abilities to parse these grammars. And finally, on the education side, it becomes more and more essential that we ourselves learn this language independent of English or whatever other language we use in order that we can make use of this mechanism.

Thus, when we talk about information-retrieval systems, we can break it down, I think, into several disjoint parts and several problems. First, there is a problem of accessing this corpus of information. It is clear that we do not wish to access it in most computers by direct table look-up.

Somehow or other we have to derive a code for information from which we can deduce approximately where to find what we are looking for.

This code will inevitably not be a constant code. It will inevitably lead to redundancies. That is, there will be several pieces of information which fall roughly in the same ballpark. It is inevitably a case that we will miss some information. No code will be perfect if it's going to be interesting.

Having devised this code, there is then the problem of transforming ques-tions, appropriately written, into sequences of codes and sequences of pro-cedures which pick out, in some sense, the best candidate or candidates from this corpus. This means to me as a programmer that the informa-tion-retrieval problem can probably, at least in certain worthwhile in-stances, only be solved by both passing the corpus through a prescanner, human, and by teaching people to ask questions in a fixed and generally context-free way.

All of the experience with language translation to date has shown that we get very little information out of language translations, precisely be-cause the computer and the processing techniques we have are context-bound and the languages on which we seem to operate are not. Several experiments in language translation and in information retrieval have given us a glimmer of hope that partial solutions can be obtained, but

these solutions are going to depttnd in large part on the development of explicit languages in which terms we ask questions of this corpus.

I am not particularly interested myself personally in the information-retrieval problem. It is one of these big messy problems that's going to take a long time to solve, and there are lots of smaller, nicer, nonmessy problems which are more easy to solve. And it is of course the case that this field, like all others, dare not wait for formalization to commence.

It should also not expect to get good solutions for some time to come.

It should certainly not expect to find a solution in hardware. What recog-nition you are going to be able to buy and what processing you are going to be able to buy, you are going to be able to buy through programming, and not through hardware. Toward that end I would like to recommend that all of you who are in the information-retrieval field become very familiar with the subject of mechanical languages and become very fa-miliar with the subject of statistics. The mechanical languages will teach you how to format queries and programs. The statistics will teach you how to organize a corpus.

Finally, with respect to the one big information-retrieval problem-language translation-one of the things that seems to come up as soon as you dig a little deeply into this problem is the fact that there isn't one.

We find that as usual we have overemphasized an urgency which only existed by virtue of extrapolations. There does not seem to be any great shortage of human bodies to translate information these days. There does not seem to be any urgent need for machine translations on a production basis. I leave with you a question which I cannot answer, though I have a sneaking suspicion what the answer is. Is there an urgent need for total mechanical systems at this time for doing information retrieval, or is it possible that the most rational information-retrieval systems we can create at this time are complexes of man and machine in which the man part is by far the biggest and most important? There is no prior reason why all problems, merely because they are large and because they involve millions of bits of information, have to go on computers. They seem to start that way but gradually sanity and the size of the tasks cause them to be re-placed.

There is a basic law with respect to computing which states that any-. thing you want to pay enough for can be done on one or more machines,

currently on order.

Mechanical Resolution of

Dans le document ee ion (Page 45-51)