Article ID: | iaor200973331 |
Volume: | 57 |
Issue: | 6 |
Start Page Number: | 1496 |
End Page Number: | 1509 |
Publication Date: | Nov 2009 |
Journal: | Operations Research |
Authors: | Sarkar Sumit, Li Xiao-Bai |
Data-mining techniques can be used not only to study collective behavior about customers, but also to discover private information about individuals. In this study, we demonstrate that decision trees, a popular classification technique for data mining, can be used to effectively reveal individuals' confidential data, even when the identities of the individuals are not present in the data. We propose a novel approach for organizations to protect confidential data from such a classification attack. The key components of this approach include a set of entropy-based measures to evaluate disclosure risks of individual records, an optimal pruning algorithm to identify high-risk records, and a pair of data-swapping procedures to reduce the disclosure risks. The proposed method provides the best trade-off between data utility and privacy protection against classification attacks. It can be applied to data with both numeric and categorical attributes. An experimental study on six real-world data sets shows that the proposed method is very effective in protecting privacy while enabling legitimate data mining and analysis.