Test Collections (Under Construction)

Test collections are standard data sets used to measure the effectiveness of information retrieval systems.  Most were originally developed to support research on IR, but practitioners often find them useful as well.  Here's a few widely used ones:

Reuters-21578 (and Reuters-22173): The most widely used text categorization test collection.

RCV1 (Reuters Corpus Volume 1): A large, high quality, recently released collection of news stories.  Likely to become the new standard benchmark in text categorization research. 

TREC-AP : A text categorization task based on the Associated Press articles used in the NIST TREC evaluations. 


Return to home page for David D. Lewis