BY.web collection (provided by Yandex)
is a subset of pages from the .by domain which were present in the index of
Yandex on May, 2007. Collection contains all pages (not deeper than 3 links from start page) for
each known site from the .by domain.
Dataset Parameters
Size of HTML data: 8 Gb
Encoding: cp1251 (documents in other encodings are considered as garbage)
Features
Percent of links leading to the pages in the collection is about 25%.
Rights to Use
Rights to use BY.web collection are granted by
Yandex, the owner of the collection.
To get access to the collection you must sign the
usage agreement (in Russian).
Data Format
The collection is distributed in xml files of a certain format.