Test Collections

TREC-AP

The TREC-AP text categorization test collection is derived from proprietary AP news data. See the detailed description below for how to get the CD-ROMs with the data.

Once you get the CD-ROMs, you may be interested in producing results that are comparable with published results (in particular those in the Lewis, Schapire, Callan, and Papka SIGIR '96 paper). To do that, you will want to use the same training and test set, and the same definitions of the 20 binary categories. Here's those definitions:

List of ids of all training documents: training.origid (or compressed version training.origid.gz).
List of ids of all test documents: test.origid (or compressed version test.origid.gz).
Twenty lists of ids. Each list contains the ids of all positive instances of one of the 20 categories of documents used in the LSCP SIGIR-96 paper: cats20.origid (or compressed version cats20.origid.gz).

Entries in each of these files have the form

lewisid: trecid

where lewisid is the identifier that I used in my experiments, and trecid is the identifier that is actually present in the documents as received on the CD-ROMs. So the trecid's are what you'll need to use to pull the right documents out of the AP files on the CD-ROMs.

Detailed Description

The TREC-AP text categorization test collection is based on a subset of the AP newswire stories from the TREC/TIPSTER text retrieval test collection. A total of 242,918 AP stories from the years 1988 through 1990 are included in the TREC/TIPSTER data. However, in processing this data, we corrected some formatting anomalies in the stories and screened out certain internal editorial notes. We then selected only those stories which had exactly one <HEAD> field (i.e., title) and <TEXT> field (i.e., the body of the article), and meeting other well-formedness criteria. The result was a set of 209,783 AP stories that, combined with 20 category definitions, we call the TREC-AP text categorization test collection.

The 20 category definitions are based on the presence of certain substrings in the <FIRST> element of the article's header. We give the list of positive examples for each category above, along with the list of all 209,783 documents in the training and test sets.

Note that results on some or all of the same 20 categories, but with a different, proprietary, set of AP documents have also been published (Lewis & Gale, 1994; Cohen, 1995; Cohen & Singer, 1996; Lewis, 1995b). These results are not comparable with results on TREC-AP. Of papers in 1996 and earlier, only the Lewis, Schapire, Callan, and Papka results in SIGIR '96 are considered "TREC-AP" results. (I'd be interested in knowing of results subsequent to 1996 that use the TREC-AP data.)

The documents in the TREC-AP collection appear on the TIPSTER Information Retrieval Text Research Collection CD-ROMs, Volumes 1 to 3, March 1994 revision. The CD-ROMs are available to TREC participants and are also distributed by the Linguistic Data Consortium. For information on the TREC-AP data preparation, contact Dave Lewis.

Return to home page for David D. Lewis