Batch Sample Design from Databases for Logistic Regression

0.00 Avg rating—0 Votes

Article ID:	iaor201789
Volume:	33
Issue:	1
Start Page Number:	87
End Page Number:	101
Publication Date:	Feb 2017
Journal:	Quality and Reliability Engineering International
Authors:	Mehrotra Sanjay, Apley Daniel W, Ouyang Liwen
Keywords:	statistics: inference, statistics: general, datamining, experiment, statistics: regression, statistics: sampling

Abstract:

The prevalence of large observational databases offers potential for identifying predictive relationships among variables of interest, although observational data are generally far less informative and less reliable than experimental data. We consider the problem of selecting a subset of records from a large observational database, for the purpose of designing a small but powerful experiment involving the selected records. It is assumed that the database contains the predictor variables but is missing the response variable, and that the purpose is to fit a logistic regression model after the response is obtained via the experiment. Active learning methods, which treat a similar problem, usually select records sequentially and focus on the single objective of classification accuracy. In contrast, many emerging applications require batch sample designs and have a variety of objectives that may include classification accuracy or accuracy of the estimated parameters, the latter being more in line with the optimal design of experiments (DOE) paradigm. The aim of this paper is to explore batch sampling from databases from a DOE perspective, particularly regarding the configuration, performance, and robustness of the designs that result from the different criteria. Through extensive simulation, we show that DOE‐based batch sampling methods can substantially outperform random sampling and the entropy method that is popular in active learning. We also provide insight and guidelines for selecting appropriate design criteria and modeling assumptions.

Reviews

Required fields are marked *. Your email address will not be published.