ROMIP: Russian Information Retrieval Evaluation Seminar

 News 
 About 
 Manifesto 
 General principles 
 Participation 
 Test collections 
 Relevance tables 
 History 
 2004 
 2005 
 Publications 
 Forum 

По-русскиПо-русски
 

DMOZ Web Collection

Description

This collection is based on the Russian-language section of the dmoz.org catalog and is used as a training set for Web page classification.

The collection consists of the sites from the DMOZ second-level categories (starting from World -> Russian), which don't contain explicit copyright notices. To keep the collection size reasonable no more than 500 pages from each web site were included in the collection (performing breadth-first traversal of each Web site's structure graph starting from the home page).

Dataset Parameters
  • Number of pages: 300 000
  • Number of Web sites: 2 100
  • Encoding: cp1251 (documents in other encodings are considered garbage)
  • Usage: as a training set
Rights to Use

The copyright holders are the authors of the Web pages. Web sites which prohibit copying of their content were not included in the collection.

This collection is distributed by the program committee only to those who wish to perform the task of Web site or Web page classification tracks. If you want to get access to the collection you will have to sign the usage agreement.

Tracks in Which the Collection Was Used
  • Classification of Web sites
  • Classification of Web pages