4 Tane Algorithm for Noisy Data Detection

Data quality analysis results show that the errors in the training data as well as in the testing data affect the predictive accuracy of the ANFIS based soft sensor models.

This section discusses an efficient algorithm, TANE algorithm, to identify the noisy data pairs in the dataset.

The TANE algorithm, which deals with discovering functional and approximate dependencies in large data files, is an effective algorithm in practice [13]. The TANE algorithm partitions attributes into equivalence partitions of the set of tuples.

By checking if the tuples that agree on the right-hand side agree on the left-hand side, one can determine whether a dependency holds or not. By analyzing the identified approximate dependencies, one can identify potential erroneous data in the relations.

In this research, relationship of the three input parameters (Q_in, Q_sol, and T₀) and the average air temperature (Tavg) is analyzed using TANE algorithm. For equiva-lence partition, all the four parameters are rounded off to zero decimal points.

After data pre-processing, four approximate dependencies are discovered, as shown in Table4. Although all these dependencies reflect the relationships among the parameters, the first dependency is the most important one because it shows that the selected input parameters have consistent association relationship with the average air temperature except a few data pairs, which is a very important depen-dency for average air temperature estimation.

To identify exceptional tuples by analyzing the approximate dependencies, it is required to investigate the equivalence partitions of both left-hand and right-hand sides of an approximate dependency. It is non-trivial work that could lead to the discovery of problematic data. By analyzing the first dependency, conflicting tuples are identified. The total dataset has 7,132 data pairs. For the first approximate dependency from Table4, 42 conflicting data pairs are present which needs were fixed using data interpolations for better performance of ANFIS-GRID model.

5 Results

The developed ANFIS-GRID model is validated using experimental results [7]. The model performance is measured using RMSE

Table 4 Approximate functional dependencies detected using the TANE algorithm

Index Approximate dependencies Number of rows with conflicting tuples

1 Q_in, Q_sol, T₀!T_avg 42

2 Q_in, T₀, T_avg!Q_sol 47

3 Qin, Q_sol, T_avg!T₀ 43

4 Q_in, T₀, T_avg!Q_sol 54

152 S. Jassar et al.

For Eq. (3),Nis the total number of data pairs,T^avgis the estimated andTavgis the experimental value of average air temperature.

Initially, ANFIS-GRID model uses the raw data for both the training as well as the testing. Figure 5 compares ANFIS-GRID estimated average air temperature values with the experimental results. First plot in Fig.5shows that ANFIS-GRID estimated average air temperature values are in agreement with the experimental results, with RMSE 0.56C. However there are some points at which estimation is not following the experimental results. For example, around 152–176 and 408–416 h of the year time, there is a significant difference between estimated and experi-mental results.

For checking the effect of data quality on ANFIS-GRID performance, the training and testing datasets are cleaned using TANE algorithm. The conflicting data pairs are replaced with the required data pairs. Then the cleaned dataset is applied for the training and the testing of ANFIS-GRID model. A comparison of the model output with clean data and the experimental results is shown in second plot of Fig.5.

Fig. 5 Comparison of ANFIS-GRID estimated and the measured temperature values

12 Data Quality in ANFIS Based Soft Sensors 153

Figure 6 clearly shows the effect of data quality on predictive accuracy of ANFIS-GRID model. The RMSE is improved by 37.5–0.35C. RMSE is considered as a measure of predictive accuracy. Less RMSE means less difference between the estimated values and the actual values. Finally, predictive accuracy is improved with decrease in RMSE. Figure 6 also shows the improvement in R²values.

6 Conclusion

ANFIS-GRID based inferential model has been developed to estimate the average air temperature in the distributed space heating systems. This model is simpler than the subtractive clustering based ANFIS model [2] and may be used as the air temperature estimator for the development of an inferential control scheme. Grid partition based FIS structure is used as there are only three input variables. The training dataset is also large enough as compared to the modifiable parameters of the ANFIS. As the experimental data is used for both the training as well as the testing of the developed model, it is expected that data can have some discrepancies. TANE algorithm is used to identify the approximate functional dependencies among the input and the output variables. The most important approximate dependency is analyzed to identify the data pairs with uneven patterns.

The identified data pairs are fixed and again the developed model is trained and tested with the cleaned data. Figure6shows that the RMSE is improved by 37.5%

and R²is improved by 12%. Therefore, it is highly recommended that the quality of datasets should be analyzed before they are applied in ANFIS based modelling.

Fig. 6 Comparison of results

154 S. Jassar et al.

References

1. M.T. Tham, G.A. Montague, A.J. Morris, P.A. Lant, Estimation and inferential control.

J. Process Control1, 3–14 (1993)

2. S. Jassar, Z. Liao, L. Zhao, Adaptive neuro-fuzzy based inferential sensor model for estimat-ing the average air temperature in space heatestimat-ing systems. Build. Environ.44, 1609–1616 (2009)

3. Z. Liao, A.L. Dexter, An experimental study on an inferential control scheme for optimising the control of boilers in multi-zone heating systems. Energy Build.37, 55–63 (2005) 4. B.D. Klein, D.F. Rossin, Data errors in neural network and linear regression models:

an experimental comparison. Data Quality5, 33–43 (1999)

5. J.S.R. Jang, ANFIS: adaptive-network-based fuzzy inference system. IEEE Trans. Syst.

Man Cybern.2, 665–685 (1993)

6. Z. Liao, A.L. Dexter, A simplified physical model for estimating the average air temperature in multi-zone heating systems. Build. Environ.39, 1013–1022 (2004)

7. BRE, ICITE, Controller efficiency improvement for commercial and industrial gas and oil fired boilers,A craft project, Contract JOE-CT98-7010, 1999–2001

8. J. Wang, B. Malakooti, A feed forward neural network for multiple criteria decision making.

Comput. Oper. Res.19, 151–167 (1992)

9. Y. Huh, F. Keller, T. Redman, A. Watkins, Data quality. Inf. Softw. Technol.32, 559–565 (1990)

10. Bansal, R. Kauffman, R. Weitz, Comparing the modeling performance of regression and neural networks as data quality varies. J. Manage. Inf. Syst.10, 11–32 (1993)

11. D. O’Leary, The impact of data accuracy on system learning. J. Manage. Inf. Syst.9, 83–98 (1993)

12. M. Wei, et al., Predicting injection profiles using ANFIS. Inf. Sci.177, 4445–4461 (2007) 13. Y. Huhtala, J. Karkkainen, P. Porkka, H. Toivonen, TANE: an efficient algorithm for

discovering functional and approximate dependencies. Comput. J.42, 100–111 (1999)

12 Data Quality in ANFIS Based Soft Sensors 155

.

Chapter 13 The Meccano Method for Automatic Volume

Dans le document Lecture Notes in Electrical Engineering Volume 68 (Page 175-180)