The Giant Hole in the Biomedical Thesaurus

By Rob Mitchum // October 7, 2014

The thesaurus is an essential tool for writers, a helpful reference when they need help expanding their vocabulary. But thesauri are also important for computational text analysis, where automated techniques need information about words with similar meanings to properly categorize text. In scientific fields filled with jargon, where synonyms may not be as rigorously documented as other corners of the English language, the lack of a comprehensive thesaurus could hinder new efforts to reveal new connections and discoveries in millions of published research papers.

A new paper in PLOS Computational Biology, co-authored by the CI’s Svetlozar Nestorov, James Evans, and Andrey Rzhetsky, examines the severity of this issue and proposes an ambitious solution, “not unlike the sequencing of the human genome in scale and importance.” By adapting a statistical framework used by ecologists to estimate missing species, the team (which also included David Blair and Kanix Wang of the Institute for Genomics and Systems Biology) found that more than 9 out of 10 biomedical synonyms are missing from existing references, limiting the progress of promising text analyses, such as those underway at Rzhetsky’s Conte Center for Computational Neuropsychiatric Genomics and Evans’ Knowledge Lab.

Natural language processing is the use of algorithms to extract structure and knowledge from free text. Some natural language processing methods don’t need synonyms — named-entity recognition (NER), which detects the presence of specific terms in a large body of text, does not need to know the meanings of the words it seeks. But another method, called named-entity normalization (NEN), needs to know when two or more words mean the same thing to find, for example, new off-label uses for drugs or novel genetic associations.

At least, that’s the theory, but the researchers first set out to quantitatively measure the benefits of documenting synonyms. Using the Unified Medical Language System (UMLS) Metathesaurus, a publicly-available biomedical thesaurus, the authors built dictionaries of “Diseases and Syndromes” and “Pharmacological Substances,” with and without synonyms, then measured the performance of text-mining algorithms trained with these dictionaries on medical and scientific text. Including the synonyms improved retrieval of relevant terms by roughly 30 percent, demonstrating their value for these techniques.

That raised another question: if synonymy is important, just how comprehensive are currently available biomedical thesauri? A comparison of several such references found very little overlap in terminology, suggesting that a significant proportion of synonyms remain undocumented. To estimate just how many are missing, the researchers borrowed a model from an unlikely source: ecology.

When ecologists attempt to take a census of the species in a particular ecosystem, they use statistics to estimate how many species are missing and not counted in their observations. The researchers adapted this model to look for missing synonyms, finding that the currently available thesauri miss at least 90 percent of potential synonymous relationships in biomedical text. For certain sub-topics, such as Pharmacological Substances, 99 of 100 synonyms are absent from current resources.

To fill these important gaps, the authors propose a broad new initiative to more comprehensively catalog biomedical terminology, akin to similar efforts in the genetics community to gather and curate functional gene networks. A combination of automated algorithms and user input could create perennially updated references that both cover more synonyms and evolve alongside changing trends in biomedicine.

“These lexical resources must move well beyond fixed dictionaries of manually curated annotations,” the authors write. “Instead, they should become ‘living’ databases, constantly evolving and expanding like search engines that index the economy of the changing web.”

But this is no small task, they warn, in terms of both effort and potential. For instance, according to the missing synonyms model, the “complete” Pharmacologic Substances terminology alone should include some 2.5 million concepts and 8 million synonyms — ten times the number of definitions in the most recent edition of the Merriam-Webster’s Collegiate Dictionary.

“The development of this technology is not unlike the sequencing of the human genome in scale and importance,” the authors write. “A vast library of linguistic relationships among an ever expanding collection of words and phrases would allow a quantum leap in machine reading, understanding and intelligence, with applications relevant not only to biomedicine but all fields of science and scholarship.”