Currently the most widely used test collection for text categorization research, though likely to be superceded over the next few years by RCV1. The data was originally collected and labeled by Carnegie Group, Inc. and Reuters, Ltd. in the course of developing the CONSTRUE text categorization system. Further details, including discussion of previous versions of the collection (e.g. Reuters-22173), are available in the README file.
The collection is available here as a gzipped tar archive (8.2 MB; 28.0 MB uncompressed). The UCI KDD archive also has an entry for the collection, including a copy. The version at UCI is identical, and I encourage you to get the UCI copy if available to save bandwidth at this site. Previous locations of the collection (now no longer active) were http://www.research.att.com/~lewis/reuters21578.html and ftp:://canberra.cs.umass.edu/pub/reuters.
Various researchers have prepared data files useful for work with Reuters-21578. Contact me if you would like me to host such resources here; I am happy to if their disk space requirements are modest. Currently the only such resource available here is a PROLOG fact base about countries contributed by Ronen Feldman.
Return to home page for David D. Lewis