Web Collection
RIRES: Russian Information Retrieval Evaluation Seminar

 Call for participation 
 General principles 
 Test collections 
 Relevance tables 

По-русскиПо-русски Web Collection


The collection consists of a pseudorandom selection of about 3% of web sites hosted in Russia by the national free hosting provider Non-HTML documents and pages built with use of the standard templates provided by were excluded from the collection. In relation to the whole Russian segment of the Web the size of the collection consists about 0.12-0.30%.

Dataset Parameters
  • Size of HTML data: 7+ Gb
  • Number of pages: 728 000+
  • Number of web sites: 22 000
  • Encoding: cp1251 (documents in other encodings are considered as garbage)
Rights to Use

Rights to use the collection are granted by Yandex, the owner of the collection. To get access to the collection you must sign the usage agreement (in Russian).

Data Format

The collection is distributed in xml files of a certain format. These files are split into two groups: narod.* and narod_training.*. Files from the second group contain documents which were used as a training set in the track of Web page classification.

Tracks in Which the Collection Was Used
  • Ad hoc search in a Web collection
    • 2003
    • 2004
    • 2005
    • 2006
  • Ad hoc search in a mixed collection
  • Similar documents search
  • Classification of Web sites
    • 2003
    • 2004
    • 2005
    • 2006
  • Classification of Web pages
  • Facts extraction
    • 2004
  • Question answering
  • Query-biased summarization