Automatic Concept Indexing and Classification for Improved Retrieval in the Hazardous Substances Data Bank
Doszkocs, Tamas; Chang, Hua Florence; Aronson, Alan; Thomas, Phillip,
National Library of Medicine
Wilder,
Dean, MSD, Inc.
Zamora,
Antonio, Consultant
Workshop on Operational Text Classification Systems, ACM/SIGIR
2001
New Orleans, Louisiana, USA
September 13, 2001
We are implementing an operational automatic indexing, classification and retrieval system for the Hazardous Substances Data Bank (HSDB) subset [1] of the TOXNET system [2] at the National Library of Medicine (NLM).
Our effort is a typical example of several of this Workshop’s focus areas: integrating automated indexing and classification systems with pre-existing software and organizational procedures, automated assignment of textual data to manually organized and maintained subject headings and classes, and leveraging diverse R & D efforts, tools and resources for increased retrieval effectiveness and ease of use in an interdisciplinary and heterogeneous operational search system environment [3].
While the TOXNET retrieval system http://toxnet.nlm.nih.gov/ is an advanced search system with proximity-based retrieval and ranked output capabilities, it does not currently employ any automatic phrase indexing techniques, nor does it attempt to map text content to important NLM classification, taxonomy and ontology resources, such as Medical Subject Headings (MeSH) http://www/pubs/factsheets/mesh.html, the Unified Medical Language System http://www/pubs/factsheets/umls.html and the UMLS Semantic Network http://www/pubs/factsheets/umlssemn.html
In the HSDB Automatic Indexing and Classification Project we are reusing and extending the algorithms and software developed as part of NLM’s Indexing Initiative [4][5][6]. Given that the multidisciplinary scope of the HSDB database extends beyond the traditional boundaries of medicine, in addition to mapping appropriate HSDB text fields, e.g. Human Health Effects, to MeSH and classifying the MeSH indexing terms into the semantic hierarchies of the UMLS Semantic Network, we also plan to use a natural language parser to augment the automatic MeSH indexing with natural language phrases. In addition, we are experimenting with novel user interface and display options that combine the proximity-based ranked retrieval with the chemical and topical dimensions implicit in user queries.