ROMIP: Russian Information Retrieval Evaluation Seminar

 News 
 About 
 Manifesto 
 General principles 
 Participation 
 Test collections 
 Relevance tables 
 History 
 2004 
 2005 
 Publications 
 Forum 

По-русскиПо-русски
 

BY.web Collection

Description

BY.web collection (provided by Yandex) is a subset of pages from the .by domain which were present in the index of Yandex on May, 2007. Collection contains all pages (not deeper than 3 links from start page) for each known site from the .by domain.

Dataset Parameters
  • Size of HTML data: 8 Gb
  • Encoding: cp1251 (documents in other encodings are considered as garbage)
Features

  • Percent of links leading to the pages in the collection is about 25%.

Rights to Use

Open sourced by Yandex.

Data Format

The collection is distributed in xml files of a certain format.

Tracks in Which the Collection Was Used
  • Ad hoc search in a Web collection
  • Ad hoc search in a mixed collection
  • Classification of Web-sites
  • Classification of Web pages