DMOZ Web Collection


This collection is based on the Russian-language section of the catalog and is used as a training set for Web page classification.

The collection consists of the sites from the DMOZ second-level categories (starting from World -> Russian), which don't contain explicit copyright notices. To keep the collection size reasonable no more than 500 pages from each web site were included in the collection (performing breadth-first traversal of each Web site's structure graph starting from the home page).

Dataset Parameters
  • Number of pages: 300 000
  • Number of Web sites: 2 100
  • Encoding: cp1251 (documents in other encodings are considered garbage)
  • Usage: as a training set
Rights to Use

The copyright holders are the authors of the Web pages. Web sites which prohibit copying of their content were not included in the collection.

This collection is distributed by the program committee only to those who wish to perform the task of Web site or Web page classification tracks. If you want to get access to the collection you will have to sign the usage agreement.

Tracks in Which the Collection Was Used
  • Classification of Web sites
  • Classification of Web pages