KM.ru Web Collection
RIRES: Russian Information Retrieval Evaluation Seminar

 News 
 About 
 Manifesto 
 Call for participation 
 General principles 
 Participation 
 Tracks 
 Participants 
 Test collections 
 Publications 
 Relevance tables 
 History 
 2004 
 2005 
 Forum 

По-русскиПо-русски
 

KM.ru Web Collection

Decription

KM.ru collection is a copy of www.km.ru multiportal (about 90% from the total amount of www.km.ru on May, 2007). It consists of documents from 57 sites.

Dataset Parameters
  • Size of HTML data: 13.7 Gb
  • Number of pages: 3 010 455
  • Number of web sites: 57
  • Encoding: cp1251 (documents in other encodings are considered as garbage)
Features

  • Collection contains multiple duplicate copies for some documents. For example, original version of document, version for printing, archive copy.
  • Many documents contains multiple informational blocks and content of these blocks often is not closely related to rest of the page (e.g. list of headers of other articles with links to the articles)
  • Many sites included into collection have complex link structure that can be used to increase quality of search results. Note that quality of link structure may differ significantly for different sites.

Rights to Use

Rights to use KM.ru web collection are granted by "KM-online" company, the owner of the collection. To get access to the collection you must sign the usage agreement (in Russian).

Data Format

The collection is distributed in xml files of a certain format.

Tracks in Which the Collection Was Used
  • Ad hoc search in a Web collection
  • Ad hoc search in a mixed collection