Many repositories and knowledge bases have been established to ease data sharing. Most of these solutions are domain- specific and none of them recommend datasets proactively to researchers. Further, there has been an exponential increase in the number of datasets added to dataset repositories in the past last two decades. For example, in the Gene Expression Omnibus (GEO) repository, an average of 34 datasets were added to GEO daily in the last five years (i.e. 2014 to 2018). This gives a glimpse of the increasing number of datasets being made available online, considering that there are many other online data repositories as well. Naturally, it is challenging for a researcher to track repositories for potential use2. The aim of this work is to proactively recommend datasets to researchers based on their profile who could potentially reuse the datasets. In the experiments, 101,279 datasets (title and summary) as of April 10, 2019, were selected from GEO. Researchers’ publications (title and abstract) were collected from PubMed after verifying collected publications with CV (obtained using a web portal) to handle the author disambiguation as shown in Figure 1.
Learning Objective: Formulate an approach to recommend datasets to researchers based on their publications.
Braja Patra (Presenter)
The University of Texas Health Science Center at Houston
Kirk Roberts, The University of Texas Health Science Center at Houston
Hulin Wu, The University of Texas Health Science Center at Houston