Share Email Print

Proceedings Paper

Test data reuse for evaluation of adaptive machine learning algorithms: over-fitting to a fixed 'test' dataset and a potential solution
Author(s): Alexej Gossmann; Aria Pezeshk; Berkman Sahiner
Format Member Price Non-Member Price
PDF $17.00 $21.00

Paper Abstract

After the initial release of a machine learning algorithm, the subsequently gathered data can be used to augment the training dataset in order to modify or fine-tune the algorithm. For algorithm performance evaluation that generalizes to a targeted population of cases, ideally, test datasets randomly drawn from the targeted population are used. To ensure that test results generalize to new data, the algorithm needs to be evaluated on new and independent test data each time a new performance evaluation is required. However, medical test datasets of sufficient quality are often hard to acquire, and it is tempting to utilize a previously-used test dataset for a new performance evaluation. With extensive simulation studies, we illustrate how such a "naive" approach to test data reuse can inadvertently result in overfitting the algorithm to the test data, even when only a global performance metric is reported back from the test dataset. The overfitting behavior leads to a loss in generalization and overly optimistic conclusions about the algorithm performance. We investigate the use of the Thresholdout method of Dwork et. al. (Ref. 1) to tackle this problem. Thresholdout allows repeated reuse of the same test dataset. It essentially reports a noisy version of the performance metric on the test data, and provides theoretical guarantees on how many times the test dataset can be accessed to ensure generalization of the reported answers to the underlying distribution. With extensive simulation studies, we show that Thresholdout indeed substantially reduces the problem of overfitting to the test data under the simulation conditions, at the cost of a mild additional uncertainty on the reported test performance. We also extend some of the theoretical guarantees to the area under the ROC curve as the reported performance metric.

Paper Details

Date Published: 7 March 2018
PDF: 12 pages
Proc. SPIE 10577, Medical Imaging 2018: Image Perception, Observer Performance, and Technology Assessment, 105770K (7 March 2018); doi: 10.1117/12.2293818
Show Author Affiliations
Alexej Gossmann, Tulane Univ. (United States)
Aria Pezeshk, U.S. Food and Drug Administration (United States)
Berkman Sahiner, U.S. Food and Drug Administration (United States)

Published in SPIE Proceedings Vol. 10577:
Medical Imaging 2018: Image Perception, Observer Performance, and Technology Assessment
Robert M. Nishikawa; Frank W. Samuelson, Editor(s)

© SPIE. Terms of Use
Back to Top
Sign in to read the full article
Create a free SPIE account to get access to
premium articles and original research
Forgot your username?