Test collections

History

2003

2004

2005

Narod.ru Web Collection

Description

The collection consists of a pseudorandom selection of about 3% of web sites hosted in Russia by the national free hosting provider narod.ru. Non-HTML documents and pages built with use of the standard templates provided by narod.ru were excluded from the collection. In relation to the whole Russian segment of the Web the size of the collection consists about 0.12-0.30%.

Dataset Parameters

Size of HTML data: 7+ Gb
Number of pages: 728 000+
Number of web sites: 22 000
Encoding: cp1251 (documents in other encodings are considered as garbage)

Rights to Use

Rights to use the Narod.ru collection are granted by Yandex, the owner of the collection. To get access to the collection you must sign the usage agreement.

Data Format

The collection is distributed in xml files of a certain format. These files are split into two groups: narod.* and narod_training.*. Files from the second group contain documents which were used as a training set in the track of Web page classification.

Tracks in Which the Collection Was Used

Ad hoc search in a Web collection
- 2003
- 2004
- 2005
- 2006
Ad hoc search in a mixed collection
- 2005
- 2006
Similar documents search
- 2005
- 2006
- 2007
Classification of Web sites
- 2003
- 2004
- 2005
- 2006
Classification of Web pages
- 2005
- 2006
Facts extraction
- 2004
Question answering
- 2006
Query-biased summarization
- 2005
- 2006
- 2007