STRING MATCHING - Data Mining

String matching is a very important area of research for successful develop-ment of data mining systems, particularly for text databases and in mining of data through the Internet by a text-based search engine. In this section, we briefly introduce the string matching problem [24].

Let P = a\a<2 ... am and T = b\b<2 ... bn denote finite strings (or sequences) of characters (or symbols) over a finite alphabet E, where m, n are positive

22 INTRODUCTION TO DATA MINING

integers greater than 0. In its simplest form, the pattern or string match-ing problem consists of searchmatch-ing the text T to find the occurrence(s) of the pattern P in T (m < n).

Several variants of the basic problem can be considered. The pattern may consist of a finite set of sequences P = {P1, P2,..., Pfc}, where each P* is a pattern from the same alphabet and the problem is to search for occurrence(s) of any one of the members of the set in the text. The patterns may be fully or partially specified.

• Let $ denote a "don't care" or "wild card" character; then the pattern A$B denotes a set of patterns AAB, ABB, ACB, etc. - that is, any pattern that begins with A, ends with B, and has a single unspecified character in the middle. The character $ is called a "fixed length don't care" (FLDC) character and may appear at any place in the pattern.

• A special character 0 is used to denote the infinite set of patterns

$ - {$, $$, $$$,...} and is called a "variable length don't care" (VLDC) character.

Patterns containing special characters $ or 0 are called partially specified;

otherwise, they are termed fully specified.

The string matching problem has been extensively studied in the litera-ture. Several linear time algorithms for the exact pattern matching problem (involving fully specified patterns) have been developed by researchers [41]-[43].

No linear time algorithm is yet known for the string matching problem with a partially specified pattern. The best known result for pattern matching us-ing a pattern consistus-ing of wild card characters is by Fischer and Patterson [44]

with complexity O(nlog2mloglogmlogc), where c is the size of the alpha-bet. Several two-dimensional exact pattern matching algorithms have been proposed in Refs. [45]-[47].

There are other variation of the string matching when the pattern is not fully specified. For example, finding the occurrences of similar patterns with small differences in the text. Let us consider trying to find the occurrences of patterns similar to (say) "birth," with maximum difference in two character positions in the text. Here the patterns "birth," "broth," "booth," "worth,"

"dirty," etc., will be considered to be valid occurrence in the text. All these above variations of the string matching problem is usually known as Approx-imate String Matching in the literature.

The string (or pattern) matching problem becomes even more interest-ing when one attempts to directly match a pattern in a compressed text or database. String matching finds widespread applications in diverse areas such as text editing, text search, information retrieval, text mining, Web mining, Bioinformatics, etc. String matching is a very essential component in text analysis and retrieval in order to automatically extract the words, keywords, and set of terms in a document, and also in query processing when used in text mining.

BIOINFORMATICS 23

We have devoted Chapter 4 to string matching, encompassing a detailed description of the classical algorithms along with a number of examples for each of them.

1.12 BIOINFORMATICS

A gene is a fundamental constituent of any living organism. Sequence of genes in a human body represent the signature(s) of the person. The genes are portions of the deoxyribonucleic acid, or DNA for short. J. D. Watson and F. H. Crick proposed a structure of DNA in 1953, consisting of two strands or chains. Each of these chains is composed of phosphate and deoxyribose sugar molecules joined together by covalent bonds. A nitrogenous base is attached to each sugar molecule. There are four bases: adenine [A], cytosine [C], guanine [G], and thymine [T]. From information theoretic perspective, the DNA can be considered as a string or sequence of symbols. Each symbol is one of the four above bases A, C, G, or T.

In the human body there are approximately 3 billion such base pairs. The whole stretch of the DNA is called the genome of an organism. Obviously, such a long stretch of DNA cannot be sequenced all at once. Mapping, search, and analysis of patterns in such long sequences can be combinatorially explosive and can be impractical to process even in today's powerful digital computers.

Typically, a DNA sequence may be 40,000-100,000 base pairs long. In practice, such a long stretch of DNA is first broken up into 400-2000 small fragments. Each such small fragment typically consists of approximately 1000 base pairs. These fragments are sequenced experimentally, and then reassem-bled together to reconstruct the original DNA sequence. Genes are encoded in these fragments of DNA. Understanding what parts of the genome encode which genes is a main area of study in computational molecular biology or Bioinformatics [7, 48]. The results of string matching algorithms and their derivatives have been applied in search, analysis and sequencing of DNA, and other developments in Bioinformatics.

Microarray experiments are done to produce gene expression patterns, that provide dynamic information about cell function. The huge volume of such data, and their high dimensions, make gene expression data to be suitable candidates for the application of data mining functions like clustering, visu-alization, and string matching. Visualization is used to transform these high-dimensional data to lower-high-dimensional, human understandable form. This aids subsequent useful analysis, leading to efficient knowledge discovery. Mi-croarray technologies are utilized to evaluate the level of expression of thou-sands of genes, with applications in colon, breast, and blood cancer treatment [48].

Proteins are made up of polypeptide chains of amino acids, which consist of the DNA as the building block. General principles of protein structure, stability, and folding kinetics are being explored in Bioinformatics, using

lat-24 INTRODUCTION TO DATA MINING

tice models. These models represent protein chains involving some param-eters, and they allow complete explorations of conformational and sequence spaces. Interactions among spatially neighboring amino acids, during folding, are controlled by such factors as bond length, bond angle, electrostatic forces, hydrogen bonding, hydrophobicity, entropy, etc. [49]. The determination of an optimal conformation of a three-dimensional protein structure constitutes protein folding. This has wide-ranging applications in pharmacogenomics, and more specifically to drug design.

The different aspects of the applicability of data mining to Bioinformatics are described in detail in Chapter 10.

Dans le document Data Mining (Page 40-43)