ROMIP: Russian Information Retrieval Evaluation Seminar

 News 
 About 
 Manifesto 
 General principles 
 Participation 
 Test collections 
 Relevance tables 
 History 
 2004 
 2005 
 Publications 
 Forum 

По-русскиПо-русски
 

Narod.ru Web Collection

Description

The collection consists of a pseudorandom selection of about 3% of web sites hosted in Russia by the national free hosting provider narod.ru. Non-HTML documents and pages built with use of the standard templates provided by narod.ru were excluded from the collection. In relation to the whole Russian segment of the Web the size of the collection consists about 0.12-0.30%.

Dataset Parameters
  • Size of HTML data: 7+ Gb
  • Number of pages: 728 000+
  • Number of web sites: 22 000
  • Encoding: cp1251 (documents in other encodings are considered as garbage)
Rights to Use

Rights to use the Narod.ru collection are granted by Yandex, the owner of the collection. To get access to the collection you must sign the usage agreement.

Data Format

The collection is distributed in xml files of a certain format. These files are split into two groups: narod.* and narod_training.*. Files from the second group contain documents which were used as a training set in the track of Web page classification.

Tracks in Which the Collection Was Used
  • Ad hoc search in a Web collection
    • 2003
    • 2004
    • 2005
    • 2006
  • Ad hoc search in a mixed collection
  • Similar documents search
  • Classification of Web sites
    • 2003
    • 2004
    • 2005
    • 2006
  • Classification of Web pages
  • Facts extraction
    • 2004
  • Question answering
  • Query-biased summarization