BY.web Collection


BY.web collection (provided by Yandex) is a subset of pages from the .by domain which were present in the index of Yandex on May, 2007. Collection contains all pages (not deeper than 3 links from start page) for each known site from the .by domain.

Dataset Parameters
  • Size of HTML data: 8 Gb
  • Encoding: cp1251 (documents in other encodings are considered as garbage)

  • Percent of links leading to the pages in the collection is about 25%.

Rights to Use

Rights to use BY.web collection are granted by Yandex, the owner of the collection. To get access to the collection you must sign the usage agreement.

Data Format

The collection is distributed in xml files of a certain format.

Tracks in Which the Collection Was Used
  • Ad hoc search in a Web collection
  • Ad hoc search in a mixed collection
  • Classification of Web-sites
  • Classification of Web pages