RIRES: Russian Information Retrieval Evaluation Seminar

 Call for participation 
 General principles 
 Test collections 
 Relevance tables 


The First Russian Information Retrieval Evaluation Seminar

First seminar was organized in 2003 with final workshop attached to the Russian Conference on Digital Libraries (St. Petersburg, October 2003). We had nine applications for participation but only seven teams were able to complete tasks on schedule. Among the '03 participants were several important industry representatives including two major players on the Russian web search market. The participation from academia was lower probably because research prototypes were not ready for scale of considered tasks and deadlines were tight.

ROMIP'2003 had two tracks - "adhoc" retrieval and Web-site classification using 7Gb+ subset of the domain.

Queries for "adhoc" track were selected from the daily log of the popular Russian Web retrieval system Yandex. To prevent fine-tuning of results participants were asked to perform 15000 queries and for each query submit the first 100 results to the organizing committee. Queries for evaluation (about 50) were selected after all the participants submit their results.

For the evaluation of results we used the TREC-like pooling mechanism. However our evaluation procedure had several significant differences:

  • We collected multiple assessment judgments per query/document pair (at least two) to improve recall approximation and decrease the influence of subjectivity.
  • To minimize discrepancy in assessor's reconstructed information needs for different assessors we used the "extended" version of the search problem specification. An extended version of the search problem includes the native language description of expected results and was prepared during the selection of queries to be evaluated. The purpose of extended description was to clarify the query and minimize the number of possible interpretations.
  • Evaluation of a query pool was shared between three assessors and each of them provided judgments for 70% of query-document pairs. This way we can collect more information about assessors and therefore we can use more sophisticated approaches for deducing final judgments.

The training set for the classification track was based on the existing Web catalog for sites. We selected about 170 categories from the second level of hierarchy. Each of selected categories had at least 5 samples. Participants were asked to assign a list containing maximum 5 categories to each of 22000 web sites from the collection. At the evaluation stage all the assignments from 17 selected categories were judged by at least two assessors.

Contact us: romip[AT]