Publication: Impact of Regressand Stratification in Dataset Shift Caused by Cross-Validation
Loading...
Identifiers
Date
2022-07-19
Authors
Saez, Jose A.
Romero-Bejar, Jose L.
Advisors
Journal Title
Journal ISSN
Volume Title
Publisher
Mdpi
Abstract
Data that have not been modeled cannot be correctly predicted. Under this assumption, this research studies how k-fold cross-validation can introduce dataset shift in regression problems. This fact implies data distributions in the training and test sets to be different and, therefore, a deterioration of the model performance estimation. Even though the stratification of the output variable is widely used in the field of classification to reduce the impacts of dataset shift induced by cross-validation, its use in regression is not widespread in the literature. This paper analyzes the consequences for dataset shift of including different regressand stratification schemes in cross-validation with regression data. The results obtained show that these allow for creating more similar training and test sets, reducing the presence of dataset shift related to cross-validation. The bias and deviation of the performance estimation results obtained by regression algorithms are improved using the highest amounts of strata, as are the number of cross-validation repetitions necessary to obtain these better results.
Description
MeSH Terms
Spain
Operations Research
Algorithms
Mathematics
Operations Research
Algorithms
Mathematics
DeCS Terms
Algoritmos
España
Investigación operativa
Matemática
España
Investigación operativa
Matemática
CIE Terms
Keywords
cross-validation, dataset shift, target shift, stratification, regression, Covariate shift, Model, Adaptation, Selection, Tutorial, Networks, Tests
Citation
Sáez, J.A.; Romero-Béjar, J.L. Impact of Regressand Stratification in Dataset Shift Caused by Cross-Validation. Mathematics 2022, 10, 14