Automatic Concept Indexing and Classification for Improved Retrieval in the Hazardous Substances Data Bank

Doszkocs, Tamas; Chang, Hua Florence; Aronson, Alan; Thomas, Phillip,

National Library of Medicine 

Wilder, Dean, MSD, Inc. 

Zamora, Antonio, Consultant 

Workshop on Operational Text Classification Systems, ACM/SIGIR 2001  
New Orleans, Louisiana, USA  
September 13, 2001 

 

We are implementing an operational automatic indexing, classification and retrieval system for the Hazardous Substances Data Bank (HSDB) subset [1] of the TOXNET system [2] at the National Library of Medicine (NLM).  

Our effort is a typical example of several of this Workshop’s focus areas: integrating automated indexing and classification systems with pre-existing software and organizational procedures, automated assignment of textual data to manually organized and maintained subject headings and classes, and leveraging diverse R & D efforts, tools and resources for increased retrieval effectiveness and ease of use in an interdisciplinary and heterogeneous operational search system environment [3]. 

While the TOXNET retrieval system http://toxnet.nlm.nih.gov/ is an advanced search system with proximity-based retrieval and ranked output capabilities, it does not currently employ any automatic phrase indexing techniques, nor does it attempt to map text content to important NLM classification, taxonomy and ontology resources, such as Medical Subject Headings (MeSH) http://www/pubs/factsheets/mesh.html, the Unified Medical Language System http://www/pubs/factsheets/umls.html and the UMLS Semantic Network http://www/pubs/factsheets/umlssemn.html  

In the HSDB Automatic Indexing and Classification Project we are reusing and extending the algorithms and software developed as part of NLM’s Indexing Initiative [4][5][6]. Given that the multidisciplinary scope of the HSDB database extends beyond the traditional boundaries of medicine, in addition to mapping appropriate HSDB text fields, e.g. Human Health Effects, to MeSH and classifying the MeSH indexing terms into the semantic hierarchies of the UMLS Semantic Network, we also plan to use a natural language parser to augment the automatic MeSH indexing with natural language phrases. In addition, we are experimenting with novel user interface and display options that combine the proximity-based ranked retrieval with the chemical and topical dimensions implicit in user queries.

References 

  1. Fonger, George Charles; “Hazardous Substances Data bank (HSDB) as a source of environmental fate information on chemicals”, Toxicology 103 (1995) 137-145  
  2. Wexler, P.; “TOXNET: An evolving web resource for toxicology and environmental health information”, Toxicology, Vol. 157, Nos. 1-2, 1-10, January 2001            
  3. Doszkocs, Tamas; Thomas, Phillip and Wilder, Dean, ”Implementing Multiple Interfaces for a Diverse User Community in a Heterogeneous Web Database Environment”, LITA national Forum 2000, http://www.lita.org/forumY2K/Doszkocs/index.htm      
  4. Nelson Stuart J.; Aronson, Alan; Doszkocs, Tamas; Wilbur, John; Bodenreider, Olivier; Chang, Hua Florence; Mork, James; McCray, Alexa. 
    Automated Assignment of Medical Subject Headings. 
    Poster presentation at: AMIA 1999 Annual Symp.; 1999 Nov 9; Washington, DC. http://www.amia.org/pubs/symposia/D005608.PDF      
  5. Aronson, Alan et al, “The NLM Indexing Initiative”, Proceedings of the 2000 Annual AMIA Symposium, Aronson AR, Bodenreider O, Chang HF, Humphrey SM, Mork JG, Nelson SJ, Rindflesch TC and Wilbur WJ. The NLM Indexing Initiative.  Proc AMIA Symp 2000(20 Suppl):17-21.
  6. Aronson AR. Effective Mapping of Biomedical Text to the UMLS Metathesaurus: The MetaMap Program. Proc AMIA Symp 2001, to appear.