ROMIP: Russian Information Retrieval Evaluation Seminar

 News 
 About 
 Manifesto 
 General principles 
 Participation 
 Test collections 
 Relevance tables 
 History 
 2004 
 2005 
 Publications 
 Forum 

По-русскиПо-русски
 

BY.web Collection

Description

BY.web collection (provided by Yandex) is a subset of pages from the .by domain which were present in the index of Yandex on May, 2007. Collection contains all pages (not deeper than 3 links from start page) for each known site from the .by domain.

Dataset Parameters
  • Size of HTML data: 8 Gb
  • Encoding: cp1251 (documents in other encodings are considered as garbage)
Features

  • Percent of links leading to the pages in the collection is about 25%.

Rights to Use

Rights to use BY.web collection are granted by Yandex, the owner of the collection. To get access to the collection you must sign the usage agreement.

Data Format

The collection is distributed in xml files of a certain format.

Tracks in Which the Collection Was Used
  • Ad hoc search in a Web collection
  • Ad hoc search in a mixed collection
  • Classification of Web-sites
  • Classification of Web pages