Enhancing PubMed: from a Medical Database to beyond
An introduction to tools supporting PubMed
Introduction
PubMed is a free Web literature search service developed and maintained by the National Center for Biotechnology Information (NCBI), and it is also a part of NCBI’s Entrez retrieval system that provides access to a diverse set of 38 databases. PubMed currently includes citations and abstracts from over 5000 life science journals for biomedical articles dating back to 1948. Since its inception, PubMed has served as the primary tool for electronically searching and retrieving biomedical literature. Every day, users worldwide issue millions of queries, relying on its access to stay updated on the latest developments and make discoveries in their respective fields.
In this post, we introduce some efforts made by various researchers to make PubMed more useful in different directions.
PubMed Labs
As the biomedical literature grows at an exponential rate, the National Center for Biotechnology Information (NCBI) at the National Library of Medicine has recently developed PubMed Labs, an experimental platform for users to test new features/tools and provide feedback, which enables us to make more informed decisions about potential changes to improve the search quality and overall usability of PubMed.
PubMed Labs is implemented as a standalone service, separate from the production operation of PubMed, so that it is non-disruptive to the routine information seeking process of current PubMed users. It now has several unique features that can distinguish it from PubMed. For example, given a free-text query as input, search results are sorted by Best Match in order to provide users with the most pertinent information (in PubMed, the default sort order is Most Recent), which can provide users with better results. PubMed Labs also has a more modern user interface, allowing the user to discover relevant content in an easier way.
PubMed Knowledge Graph
In the field of healthcare and medicine, experts usually communicate using medical jargon, which is compiled and stored in computer processable collections such as SNOMED CT, ICD-10, PubChem, and gene ontology. These medical language terms (e.g., genes, drugs, proteins, species, and mutations) are the backbone of quality healthcare.
Conventional text mining tools are ineffective in handling this data because these tools struggle to accurately interpret and extract information from the intricate and domain-specific language used in medical literature. The complexity and specificity of medical terms, as well as the diverse ways they can be expressed in literature, pose significant challenges for traditional text mining approaches. As a result, the tools may fail to provide precise and comprehensive results, making literature searches in healthcare and medicine more cumbersome. Therefore, many studies have been devoted to building open-access datasets to solve these bio-entity recognition problems.
One way to build an open-access dataset is to integrate a comprehensive one by capturing bio-entities, disambiguated authors(distinguish between different authors who may share the same or similar names), funding, and affiliation information from the literature. Using this exact idea, A knowledge graph called “PubMed Knowledge Graph” (PKG) is generated by extracting bio-entities from 29 million PubMed abstracts, disambiguating author names, integrating funding data through the National Institutes of Health (NIH) ExPORTER, collecting affiliation history and educational background of authors from ORCID, and identifying fine-grained affiliation data from MapAffil.
Being an open dataset, PKG contains rich information ready to be deployed, facilitating the effortless development of applications such as finding experts, searching bio-entities, analysing scholarly impacts, and profiling scientist’s careers.
Bio-LDA
Statistical modelling techniques such as Latent Dirichlet Allocation (LDA) have been used nowadays to make the automated identification of topics from large document collections and corpora possible. LDA model is a model for a text corpus viewed as a collection of bags of words. It assumes that people write an article with several topics in mind; each topic is associated with a different conditional distribution over a fixed set of words. A collection of documents, in this case, can be seen as being generated by the same set of topics with different probability distributions for each document. While previous applications of LDA in the biomedical domain have yielded several benefits, few considered the extension of the LDA model to include bio-terms.
A Bio-LDA model was proposed recently, extending the LDA model by incorporating bio-terms as input variables to the classic LDA model. To train the model, a great number of bio-terms are extracted through the biomedical journals and papers provided on the PubMed database.
Experimental results show that Bio-LDA, in contrast to natural language processing methods, can automatically derive a collection of topics of related biological terms that map to clearly understandable biological themes. The topics created using Bio-LDA are also surprisingly succinct in identifying the bioterms associated with particular topic areas.
Conclusion
PubMed, as the primary tool for electronically searching and retrieving biomedical literature, can be used either in a more intelligent way or to make contributions to other fields of research. In this post, we discussed three different directions about the usage of PubMed. Moreover, we have also shared some promising scopes that could be used for future exploration of these approaches. Through continuous investigation and refinement, we believe that the use of PubMed can open up exciting opportunities for us in the future.
References
- Fiorini, N., Canese, K., Bryzgunov, R., Radetska, I., Gindulyte, A., Latterner, M., Miller, V., Osipov, M., Kholodov, M., Starchenko, G., Kireev, E., & Lu, Z. (2018). PubMed Labs: An experimental platform for improving biomedical literature search (Version 1). arXiv. https://doi.org/10.48550/ARXIV.1806.04004
- Xu, J., Kim, S., Song, M., Jeong, M., Kim, D., Kang, J., Rousseau, J. F., Li, X., Xu, W., Torvik, V. I., Bu, Y., Chen, C., Ebeid, I. A., Li, D., & Ding, Y. (2020). Building a PubMed knowledge graph. In Scientific Data (Vol. 7, Issue 1). Springer Science and Business Media LLC. https://doi.org/10.1038/s41597-020-0543-2
- Wang, H., Ding, Y., Tang, J., Dong, X., He, B., Qiu, J., & Wild, D. J. (2011). Finding Complex Biological Relationships in Recent PubMed Articles Using Bio-LDA. In J. Langowski (Ed.), PLoS ONE (Vol. 6, Issue 3, p. e17243). Public Library of Science (PLoS). https://doi.org/10.1371/journal.pone.0017243
Catch the latest version of this article over on Medium.com. Hit the button below to join our readers there.