18 Dec
15:00 - 17:00

Flexible Parsimonious Models for Complex Data and the Challenge of Rare Words

Researchers throughout academia, industry, and government are generating data at scales and levels of complexity far beyond what could previously have been imagined. Complex data demand statistical models that are sufficiently flexible to adapt to meaningful, underlying signals, allowing scientists to discover unexpected patterns. Yet as society relies more heavily on statistical algorithms to make decisions impacting everyday life, it becomes increasingly important for a method's output to be interpretable by non-experts. This demands parsimony: that simpler explanations be favored over more complicated ones.

This talk will begin with an overview of the use of flexible and interpretable statistical modeling.  We will see a series of complex data modeling tasks in which computationally efficient, data-adaptive procedures are designed that yield easily understandable outputs.  These tasks include forecasting for high-dimensional time series and learning large networks that capture the dependence among large ensembles of measured variables.

The second half of the talk will go into depth into one particular challenge that arises in building prediction models based on text data.  In typical data sets, most words appear in only a very small fraction of documents.  Such ""rare words"" are usually discarded by analysts -- either explicitly in a preprocessing step or implicitly by using a method that cannot make use of these words.  We argue that such a practice can be highly wasteful: rare words can contain much useful information for a prediction task.  In fact, this problem of rareness occurs in many domains beyond text processing, including the study of the microbiome, where many microbial species are rarely observed.  The challenge posed by such ""rare features"" has received little attention despite its prevalence.  We show, both theoretically and empirically, that not explicitly accounting for the rareness of features can greatly reduce the effectiveness of an analysis. We next propose a framework for aggregating rare features into denser features in a flexible manner that creates better predictors of the response.  An application to online hotel reviews demonstrates the gain in accuracy achievable by proper treatment of rare words."

The first D3M research theme workshop will be on Tuesday, 18 December, 15:00-17:00 hours at SBE, TS53 C-1.03 (Colloquium room 1) by Jacob Bien (University of Southern California).