Article ID: | iaor20052316 |
Country: | India |
Volume: | 41 |
Issue: | 3 |
Start Page Number: | 178 |
End Page Number: | 187 |
Publication Date: | Sep 2004 |
Journal: | OPSEARCH |
Authors: | Despotis D.K., Koliastasis D. |
Keywords: | datamining |
In knowledge discovery from databases and warehouses, there is an ongoing number of algorithms and practices that can be used for the very same application, for example to predict the value of a specific field. These algorithms are trained on a portion of the original data set and are tested on the remaining data set for their accuracy. Furthermore, the estimated value of the target field is often tested against a second data set for evaluation purposes. This paper examines the factors affecting the performance, as it is defined by the produced error rate, of some popular predictive data mining algorithms such as decision trees, neural nets, regression, etc., on many data sets from different sources. These factors may be either the number of attributes, the type of each field, the number of missing values, etc. Finally, it is tested whether it is possible to gauge a priori which algorithm(s) will produce the lowest error rate for each specific data set. As a result some heuristic rules are to be listed in order to facilitate the decision maker in selecting the best possible technique.