Article ID: | iaor20072633 |
Country: | United Kingdom |
Volume: | 33 |
Issue: | 11 |
Start Page Number: | 3088 |
End Page Number: | 3106 |
Publication Date: | Nov 2006 |
Journal: | Computers and Operations Research |
Authors: | Yang Jaekyung, Olafsson Sigurdur |
Keywords: | heuristics, datamining |
Preprocessing the data to filter out redundant and irrelevant features is one of the most important steps in the data mining process. Careful feature selection may improve both the computational time of inducing subsequent models and the quality of those models. Using fewer features often leads to simpler and easier to interpret models, and selecting important feature can lead to important insights into the application. The feature selection problem is inherently a combinatorial optimization problem. This paper builds on a metaheuristic called the nested partitions method that has been shown to be particularly effective for the feature selection problem. Specifically, we focus on the scalability of the method and show that its performance is vastly improved by incorporating random sampling of instances. Furthermore, we develop an adaptive variant of the algorithm that dynamically determines the required sample rate. The adaptive algorithm is shown to perform very well when applied to a set of standard test problems.