Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson 9443 Springboro Pike Miamisburg, OH 45342 http://www.lexisnexis.com mark.wasson@lexisnexis.com In May 1990, a three-person development team and two data analysts put a new system into production at LexisNexis that would select and classify news documents by topic and add them to "specialized libraries", searchable collections of documents with a broad topic or theme in common, such as Europe Political and Business News or Insurance Law. Customers submitting their own topically related searches within an appropriate library usually saw much higher retrieval accuracy than if they searched the entire set of available documents. This was the release of the Term-based Topic Identification System (TTI) which was based on a document classification system that I created in 1985 while a student at the University of Iowa. Shortly after its release, TTI would become the basis for the NEXIS Indexing R&D program. Through a series of system releases, first led by me and then later led by my colleague Mark Shewhart, the NEXIS Indexing program used the classification process to introduce controlled vocabulary term (CVT) indexing for more than 70,000 company, people, organization, place, news, business and general interest topics across data from more than 23,000 sources, including news publications, company reports, legislative documents and Web content. Accuracy rates for recall and precision both exceed 90% for most topics and sources for documents that contain major references, and the system processes tens of thousands of documents daily. We have a high accuracy, high speed, large scale, multiple source classification system that is cost-effective to use and maintain, based on a classification algorithm that is fundamentally unchanged in the seventeen years since it was created. This talk will focus on the underlying algorithm. Key components and design aspects in a topic definition (called a "topic request" in TTI and "concept definition" in NEXIS Indexing) include the following: * Words, phrases and word concepts All words and phrases are searchable in these tools, as are a handful of punctuation symbols. There are no noise words. A legal term like "insurance interest" is not the same as "insurance as to the interest" but they look the same to a search engine with "as", "to" and "the" as noise words. There is no automatic depluralization, morphological variant generation or thesaurus-based expansion. It is generally up to the person or tool creating a topic definition to specify all and only the terms and term variants they need. The exception? Some entity indexing support tools will automatically generate likely name variants. A "word concept" is simply a set of terms that behave in essentially the same way with respect to some topic, i.e., they are functional equivalents. In our model, they are treated as the same term for frequency counting and term weighting purposes - it is a common feature or meaning rather than the actual form that is important. A "European Community" topic may have a Countries word concept that includes the names of all the relevant countries. If one article mentions Germany three times and a second article mentions Great Britain, Hungary and Greece once each, in both instances we would count three occurrences of the Countries word concept. The Countries word concept will also have a corresponding weight. Individual terms are not weighted in our model. * Our use of frequency and weighting (well-known functions like tf-idf term weighting are nowhere to be found) We use frequency and weighting, but it is at the word concept level rather than the term level. In TTI the topic definition builder can perform iterative trial and error testing to find appropriate word concept weights. The builder also has access to a tool that suggest word concept weights. The tool is based on stepwise linear regression. Word concepts that look good when tested in isolation sometimes can do more harm than good when combined with other word concepts. In addition to suggesting the best weights, stepwise linear regression helps identify these problem word concepts. In Company Indexing and other entity indexing releases, we determined the best weights for each word concept and hid those in the software to help automate definition building and optimize system performance. We allow both positive and negative weights. Boolean AND NOT is often too strong, as the offending term may simply be used in passing or with another meaning. Negative weights are a lot weaker, but they can still prevent problem documents from being tagged. Of course, we do have equivalents to Boolean OR and AND NOT. E.g., if "President Clinton" is in the headline, we can be pretty safe adding a corresponding controlled vocabulary index term to the document without any other evidence. Similarly, AND NOT can be used to exclude some publication sections from further processing. E.g., do not process the Entertainment section of a publication for articles about the agriculture industry. * Headlines, leading text and "blocked" terms When our topic definitions are applied to documents, we can count a term (word concept) more heavily in important parts of documents, such as in the headline, abstract, leading text, case name or company name fields. We can also prevent it from being counted when it is part of a larger, topically unrelated term. E.g., we do not want to count an occurrence of "French" towards a Europe- related topic if it is part of the phrase "French fries". * Source-independent vs. source-dependent information TTI was originally designed to support defining topics on a source-by-source basis in order to get its best accuracy. However, with thousands of sources, that is not a practical approach, so users are able to build definitions that can apply across multiple sources. In NEXIS Indexing, we took the approach that a single topic definition should be able to apply across as many sources as possible. However, not every source has the same structure. Headlines, leads and datelines are common in news, whereas case names, opinions and courts are common in case law documents. Topic definition and source definition information are maintained separately so that we can keep topic information independent of sources, while we can still exploit the source characteristics and differences across sources. * Topic scores With frequency and weighting, we can calculate topic scores, which we normalize on a 100-point scale and then compare to thresholds to determine whether or not the document is about the corresponding topic. If the document is about the topic, we insert an appropriate corresponding controlled vocabulary term (the primary CVT), any appropriate related controlled vocabulary terms (the secondary CVTs), and the normalized topic score. We actually have multiple thresholds to distinguish broad categories of "major reference" to topic, "strong passing reference" or "weak passing reference". So if our company name classifier determines that the document is about Microsoft Corp., for example, we add a primary CVT (likely the standard form of the company name), some secondary CVTs (perhaps the ticker symbol, relevant industry codes, geographic information) and the topic score. Because a document may be about more than one topic, we will index it for any topic whose score is above the lowest threshold. We can also use topic scores near some threshold as a basis for sending documents off to an editorial staff for further verification. * The trade-offs between manual support work, machine learning and other automatic support technology TTI's use of stepwise linear regression is the closest thing we have to anything that resembles machine learning. TTI uses primarily a manual, iterative development process for creating topic definitions. Not including testing - which is necessary for all approaches - it can take an analyst 4-8 hours to build one topic definition. For Named Entity indexing, most of the topic definition building has been automated - name variants can be easily generated, and characteristics like length and the presence or absence of cue words (e.g., "Corp") are used to determine word concept assignment. For the first 50,000 named entity topic definitions, the average labor cost to build the topic definitions was less than five staff minutes apiece, including manual intervention or override in more than a quarter of the automatically generated topic definitions. NEXIS Indexing's Topical Indexing approach goes back to its TTI roots to a mostly iterative topic definition creation process - which averages about four staff hours per definition, again not counting final testing. Numerous tests, from the earliest TTI prototypes in the 1980s through recent Topical Indexing tests for new topic definitions and sources, have routinely achieved recall and precision rates of over 90% for most topics and sources. Today and for the foreseeable future, TTI and the NEXIS Indexing releases will continue to process tens of thousands of documents a day for each of tens of thousands of topics Acknowledgements In addition to Mark Shewhart, numerous colleagues have helped make these tools a success, including Tom Kresin, Christi Wilson, Chris Anderson, Jill Sellers, Teresa Macgregor, Jeanne Roberts, Bobbi Ketring, Sharon Leigh, Chris White and a number of other software engineers and data analysts. Related Papers Leigh, S. (1991). "The Use of Natural Language Processing in the Development of Topic Specific Databases". Proceedings of the Twelfth National Online Meeting. Wasson, M. (2000). "Large-scale Controlled Vocabulary Indexing for Named Entities". Proceedings of the ANLP-NAACL 2000 Conference.