A major bottleneck in realizing the full translational potential of Electronic Health Record data is the lack of precise labels for many diseases. Though `silver-standard' proxies such as ICD-9 billing codes are often used in place of true labels, these often suffer from dismal specificity. On the other hand, supervised phenotyping approaches require intensive expert effort to manually annotate disease labels via chart review. Previous studies have introduced unsupervised machine learning algorithms using silver-standard proxies to infer labels. However, these methods only consider a single disease at a time and hence perform suboptimally when the goal is to phenotype multiple diseases simultaneously, as in PheWAS. Here we introduce sureLDA, an unsupervised multi-disease prediction algorithm that combines clustering and topic modeling methods to infer disease probabilities simultaneously, accurately, and at scale. As we demonstrate, sureLDA outperforms existing unsupervised phenotyping methods and even performs comparably to supervised learning with several hundred labels.

Learning Objective: After participating in this session, the learner should be better able to:

1) Understand how to use sureLDA to infer disease phenotypes from Electronic Health Record data.
2) Understand the strenghts and shortcomings of this method for different data characteristics.


Yuri Ahuja (Presenter)
Harvard University

Doudou Zhou, Harvard University
Zeling He, Harvard University
Jiehuan Sun, Harvard University
Victor Castro, Partners Healthcare
Vivian Gainer, Partners Healthcare
Shawn Murphy, Partners Healthcare
Chuan Hong, Harvard University
Tianxi Cai, Harvard University

Presentation Materials: