MEDLINE is the National Library of Medicine's premier bibliographic database for biomedical literature. A highly valuable feature of the database is that each record is manually indexed with a controlled vocabulary called MeSH. Most MEDLINE journals are indexed cover-to-cover, but there are about 200 selectively indexed journals for which only articles related to biomedicine and life sciences are indexed. In recent years, the selection process has become an increasing burden for indexing staff, and this paper presents a machine learning based system that offers very significant time savings by semi-automating the task. At the core of the system is a high recall classifier for the identification of journal articles that are in-scope for MEDLINE. The system is shown to reduce the number of articles requiring manual review by 54%, equivalent to approximately 40,000 articles per year.

Learning Objective: After participating in this session, the learner should be better able to:

1) Understand the current issues in selectively indexing scientific journals, and be able to discuss the importance of developing automated approaches to determining if an article is out-of-scope for MEDLINE.

2) Able to discuss the strengths of different approaches to automated selection of articles for indexing and select an appropriate approach for maximizing accuracy or coverage.

3) Understand the importance of accuracy in labeling a scientific publication as out-of-scope for MEDLINE.


Alastair Rae (Presenter)
National Library of Medicine

Max Savery, National Library of Medicine
James Mork, National Library of Medicine
Dina Demner-Fushman, National Library of Medicine

Presentation Materials: