BY.web collection (provided by )
is a subset of pages from the .by domain which were present in the index of
Yandex on May, 2007. Collection contains all pages (not deeper than 3 links from start page) for
each known site from the .by domain.
Dataset Parameters
Size of HTML data: 8 Gb
Encoding: cp1251 (documents in other encodings are considered as garbage)
Features
Percent of links leading to the pages in the collection is about 25%.
Rights to Use
Rights to use BY.web collection are granted by
, the owner of the collection.
To get access to the collection you must sign the
usage agreement (in Russian).
Data Format
The collection is distributed in xml files of a certain format.