• Aucun résultat trouvé

General applications of data mining

Dans le document DATA MINING IN AGRICULTURE (Page 29-33)

In this section, some general application of data mining is presented, with the aim of showing the applicability of data mining techniques in many research fields.

An overview of the applications in agriculture discussed in this book is given in Section 1.5.

1.3.1 Data mining for studying brain dynamics

Data mining techniques are successfully applied in the field of medicine. Some recent works include, for instance, the detection of cancers from proteomic profiles [149], the prediction of breast cancer survivability [56], the control of infections in hospitals [27] and the analysis of diseases such as bronchopulmonary dysplasia [199]. In this section we will focus instead on another disease, epilepsy, and on a recently proposed data mining technique for studying this disease [20, 31].

Epilepsy is a disorder of the central nervous system that affects about 1% of the population of the world. The rapid development of synchronous neuronal fir-ing in persons affected by this disease induces seizures, which can strongly affect their quality of life. Seizure symptoms include the known uncontrollable shaking, accompanied by loss of awareness, hallucinations and other sensory disturbances.

As a consequence, persons affected by epilepsy can have issues in social life and career opportunities, low self-esteem, restricted driving privileges, etc. Epilepsy is mainly treated with anti-epileptic drugs, which unfortunately do not work in about 30% of the patients diagnosed with this disease. In such cases, the seizure could be cured by surgery, but not all the patients can be cured in this way. The main prob-lem is that the procedure cannot be performed on brain regions that are essential for the normal functioning of the patient. In order to check the eligibility for surgery, electroencephalographic analysis is performed on the patient’s brain.

Since not all the patients can be treated by surgery and since surgery is a very invasive procedure, especially if we know that the procedure is performed on the brain, there have been other attempts to control epileptic seizures. These attempts have to do with the electronic stimulations of the brain. One of these is the chronic vagus nerve stimulation. A device can be inplanted subcutaneously in the left side of the chest for electric stimulations of the cervical vagus nerve. Such device is programmed to deliver electrical stimulation with a certain intensity, duration, pulse width, and frequency. This method for controlling epileptic seizures has been suc-cessfully applied, and patients had the possibility to benefit from it, after that the device has been tuned. Each patient has to be stimulated in his own way, and there-fore the stimulation parameters need to be tuned in newly implanted patients. This process is very important, because the device must be personalized for the patient’s needs.

Unfortunately, the only way for tuning the device is currently a trial-and-error procedure. Once the device has been implanted, it is tuned on initial parameters, and patient reports help in modifying such parameters until the ones that better fit the patient are found. The problem is that the patient, during this process, may still continue experiencing seizures because the parameter values are not good for him, or he may not tolerate some other parameter values. Then, locating the optimal pa-rameters more rapidly would save money due to fewer doctor visits, and would help the patient at the same time. Data from electroencephalography have been collected from epileptic patients and they have been analyzed by data mining techniques, in order to predict the efficacy of the numerous combinations of stimulation parameters.

In these studies, support vector machines (Chapter 6) have been used in the

experi-ments presented in [20], whereas a biclustering approach (Chapter 7) has been used in [31]. The results of the analysis suggest that patterns can be extracted from elec-troencephalographic measures that can be used as markers of the optimal stimulation parameters.

1.3.2 Data mining in telecommunications

The telecommunication field has some interesting applications of data mining. In fact, as pointed out in [197], the data generated in the telecommunications field has reached unmanageable limits of information, and data mining techniques have showed their advantages in helping to manage this information and transforming it into useful knowledge. In the quoted paper, a real-time data mining method is proposed for analyzing telecommunications data.

An interesting application in this field consists of the detection of the users that potentially will perform fraudulent activities against telecommunication companies.

Million of dollars are lost every year by telecommunication companies because of frauds. Therefore, the detection of users that can have a fraudulent behavior is useful for the companies in order to monitor and avoid such activities. The hope is to identify the fraudulent users as soon as possible, starting from the time they subscribe.

The studies that are the focus of this section are related to a telecommunication company and details can be found in [69]. The aim of the studies is to develop a system for identifying fraudulent users at the time of applications. In this example, a neural network approach is used (see Chapter 5). The data used for training the neural network are collected from different databases managed by the company. The data consist of information regarding each single user and the classification of the user’s behavior as fraudulent or not. For each user, information such as name, address, data of birth, ID number, etc., are collected. The classification of the user’s behavior is performed by an expert by checking his payment history. Once the neural network is trained, it is supposed to do this job on new users, whose payment history is not available yet.

The personal information that each user provides when he subscribes can contain clues about his future behavior. If a user has the same name and ID number of another user in the database which already had a fraudulent behavior, then there is a high probability that this behavior will be repeated again. In the specific case discussed in [69], a public database is available where insolvency situations mostly related to banks and stores are registered. Therefore, the user’s behavior can be checked also in other situations beyond the ones related to the telecommunication company itself.

Users having the same address can also behave in similar ways. Moreover, when the application for a new phone line is filled, the new user is asked to provide an existing phone number as reference. The new and the existing phone lines have high probabilities to be classified in the same way. By using this information, a particular kind of fraudulent behavior can be detected. Before that the telecommunication company finds out that a particular line is related to a fraud and it blocks such line, the fraudster can apply for a new phone line under another name but providing the

old line during the application. This could be repeated in a sort of chain, if the line provided in the application is not verified.

The user’s behaviors can be classified as fraudulent or not. This is a simplified classification in 2 classes only. In general, each subscriber can be classified in more than 2 classes when he applies for a new phone line. In the first class, the most fraudulent users can be cataloged: they do not pay bills or their debt/payment ratio is very high and they have suspicious activities related to long distance calls. The otherwisefraudulent users are instead those that have a sudden change in their calling behavior which generates an abnormal increase of the bill amount. Users having two or more unpaid bills and having a debt less than 10 times their monthly bill are classified as insolvent. Finally, users who paid all the bills or with one unpaid bill only can be classified as normal.

The neural network used in these studies is a multilayer perceptron in which the neurons are organized on three layers (see Section 5.1). The 22 neurons on the input layer correspond to the 22 pieces of information collected from the user during the application. The 2 neurons on the output layer allow the network to distinguish only between two classes: fraudsters and non-fraudsters. The internal layer, the hidden layer, contains 10 neurons. The data obtained from the databases of the telecommunication company and successively classified by an expert are divided in a training set, a validation set and a testing set. In this way, it is possible to control if the network is correctly learning how to classify the data during the training phase using the validation set. After this process, the network can then be tested on known data, the ones in the testing set. For more details about validation techniques, refer to Chapter 8.

1.3.3 Mining market data

Data mining applied to finance is also referred to asfinancial data mining. Some of the most recent papers on this topic are [240], in which a new adaptive neural network is proposed for studying financial problems, and [247], in which stock market tendency is studied by using a support vector machine approach. In fact, in finance, one of the most important problems is to study the behavior of the market. The large number of stock markets provides a considerable amount of data every day in the United States only. These data can be visualized and analyzed by experts. However, the quantity of data allows the visualization of small parts of all the available data per time and the expert’s work can be difficult. Automated techniques for extracting useful information from these data are therefore needed. Data mining techniques can help solve the problem, as in the application presented in [25].

Recently, stock markets are represented as networks (or graphs). As discussed in Section 1.2.2, the success of a data mining method strongly depends on the data representation used. In this approach, a network connecting different nodes repre-senting different stocks seems to be the optimal choice. The network representation of a set of data is currently widely used in finance, and also in other applied fields. In this example, each node of the network represents a stock and two nodes are linked

in the network if their marketing price is similar over a certain period of time. Such network can be studied with the purpose of revealing the trends that can take place in the stock market.

Given a certain set of marketing data, a network can be associated to it. In the network, stocks having similar behaviors are connected by links. Grouping together stocks with similar market properties is useful for studying the market trends. Clus-tering techniques can be used for this purpose. However, in this case, the problem is different from the usual. Section 1.2.1 introduces clustering techniques as techniques for grouping data in different clusters. In this case, there is only one complex vari-able, the network, and its nodes have to be partitioned. Similar nodes can be grouped in the same cluster, which defines a sort of sub-network of the original one. In such sub-networks, nodes are connected to each other, because they are similar. These kinds of networks are calledcliquesin graph theory. Thus, this clustering problem can be seen as the problem of finding a clique partition of the original network. Such problem is considered challenging because the number of clusters and the similarity criterion are usually not known a priori.

Recently, in [10], the food market in the United States has been analyzed by using this approach. The food market in United States is one of the largest in the world, since it is a major exporter and significant consumer of food products. For instance, the agricultural exports in the US were about $68 billion for the year 2006. The food sector in the US includes retailers, wholesalers and all food services that link the farmers to the consumers. In general, the food market industry in the US has a significant global impact and it provides a representative sample for food economic studies.

In [10], the food market of the US has been represented by a network and its trends have been analyzed by looking for a clique partition of such network. An optimization problem has been formulated for this purpose, and it has been solved by using the software CPLEX9 [114]. The obtained cliques showed the markets with a high correlation. For instance, the clustering showed thatbeverages,grocery stores, and packaged foodsmarkets have significantly high market capitalization.

This can also help in predicting the behaviors of different stock markets. Indeed, if some market in a clique is known, then the trend of other markets in the same clique has to be similar to the known one.

Dans le document DATA MINING IN AGRICULTURE (Page 29-33)