KM.ru collection is a copy of www.km.ru
multiportal (about 90% from the total amount of www.km.ru on May, 2007).
It consists of documents from 57 sites.
Dataset Parameters
Size of HTML data: 13.7 Gb
Number of pages: 3 010 455
Number of web sites: 57
Encoding: cp1251 (documents in other encodings are considered as garbage)
Features
Collection contains multiple duplicate copies for some documents. For
example, original version of document, version for printing, archive copy.
Many documents contains multiple informational blocks and content of these
blocks often is not closely related to rest of the page (e.g. list of
headers of other articles with links to the articles)
Many sites included into collection have complex link structure that can be
used to increase quality of search results. Note that quality of link
structure may differ significantly for different sites.
Rights to Use
Rights to use KM.ru web collection are granted by "KM-online" company, the owner of the
collection.
To get access to the collection you must sign the
usage agreement (in Russian).
Data Format
The collection is distributed in xml files of a certain format.