Overview

Russian Information Retrieval Evaluation Seminar

The Initiative

Russian information retrieval evaluation initiative was launched in 2002 with purpose to increase communication and support community of researchers (from both academia and industry) in the area of text retrieval for Russian language collections by providing infrastructure necessary for evaluation of information retrieval methodologies. In particular, series of Russian Information Evaluation Retrieval Seminars (ROMIP seminars) is planned to be held on yearly basis.

In many respects ROMIP seminars are similar to other world information retrieval events such as TREC, CLEF, NTCIR, etc. Initiation of the new one was motivated by several reasons:

absence of publicly available Russian test collections
relatively low interest for the creation of Russian language tracks/collections within the framework of the existing evaluation initiatives (as far as we know only CLEF'2003 had Russian document collection but it was rather small);
low rate of participation of Russian research groups in the existing evaluation initiatives.

Similar to TREC ROMIP has cycle nature and is overseen by a program committee consisting of representatives from academia and industry. Given collection and tasks participants run their own system on the data and submit results to the organizing committee. Collected results are independently judged and the cycle ends with a workshop for sharing experience and discussing future plans.

However, we don't precisely copy TREC tasks and methodology. Indeed we adapt them to our circumstances and combine them with other recent approaches in the information retrieval evaluation domain.

The First Seminar

First seminar was organized in 2003 with final workshop attached to the Russian Conference on Digital Libraries (St. Petersburg, October 2003). We had nine applications for participation but only seven teams were able to complete tasks on schedule. Among the RIRES'03 participants were several important industry representatives including two major players on the Russian web search market. The participation from academia was lower probably because research prototypes were not ready for scale of considered tasks and deadlines were tight.

ROMIP'2003 had two tracks - "adhoc" retrieval and Web-site classification using 7Gb+ subset of the narod.ru domain.

Queries for "adhoc" track were selected from the daily log of the popular Russian Web retrieval system Yandex (www.yandex.ru). To prevent fine-tuning of results participants were asked to perform 15000 queries and for each query submit the first 100 results to the organizing committee. Queries for evaluation (about 50) were selected after all the participants submit their results.

For the evaluation of results we used the TREC-like pooling mechanism. However our evaluation procedure had several significant differences:

We collected multiple assessment judgments per query/document pair (at least two) to improve recall approximation and decrease the influence of subjectivity.
To minimize discrepancy in assessor's reconstructed information needs for different assessors we used the "extended" version of the search problem specification. An extended version of the search problem includes the native language description of expected results and was prepared during the selection of queries to be evaluated. The purpose of extended description was to clarify the query and minimize the number of possible interpretations.
Evaluation of a query pool was shared between three assessors and each of them provided judgments for 70% of query-document pairs. This way we can collect more information about assessors and therefore we can use more sophisticated approaches for deducing final judgments.

The training set for the classification track was based on the existing Web catalog for narod.ru sites. We selected about 170 categories from the second level of hierarchy. Each of selected categories had at least 5 samples. Participants were asked to assign a list containing maximum 5 categories to each of 22000 web sites from the collection. At the evaluation stage all the assignments from 17 selected categories were judged by at least two assessors.