Robust High-Dimensional Mean Estimation with Low Data Size: An Empirical Study

Presenter: Cullen Anderson

Faculty Sponsor: David Mix Barrington

School: UMass Amherst

Research Area: Computer Science

Session: Poster Session 4, 2:15 PM - 3:00 PM, Auditorium, A80

ABSTRACT

Robust statistics aims to compute quantities to represent data where a fraction of it may be arbitrarily corrupted. The most essential statistic is the mean, and in recent years, there has been a flurry of theoretical advancement for efficiently estimating the mean in high dimensions on corrupted data. While several algorithms have been proposed that achieve near-optimal error, they all rely on large data size requirements as a function of dimension. In this work, we perform an extensive experimentation over various mean estimation techniques where data size might not meet this requirement due to the high-dimensional setting. 

For data with inliers generated from a Gaussian with known covariance, we find experimentally that several robust mean estimation techniques can practically improve upon the sample mean. However, this consistent improvement is conditioned on a couple of simple modifications to how the steps to prune outliers work in the high-dimension low-data setting, and when the inliers deviate significantly from Gaussianity. In fact, with these modifications, they are typically able to achieve roughly the same error as taking the sample mean of the uncorrupted inlier data, even with very low data size. In addition to controlled experiments on synthetic data, we also explore these methods on large language models, deep pretrained image models, and non-contextual word embedding models that do not necessarily have an inherent Gaussian distribution. We show both the challenges of achieving this goal, and that our updated robust mean estimation methods can provide significant improvement over using just the sample mean.

RELATED ABSTRACTS