ROMIP: Russian Information Retrieval Evaluation Seminar

 General principles 
 Test collections 
 Relevance tables 


General principles

In the framework of ROMIP seminar it is suggested to use the cyclic approach. Within each (annual) cycle one or several test collections are selected based on interest of participants out of a set of projects working on creation of test collections. The selected projects are realized; after completion of one round new projects are selected with respect to the gained experience and current priorities of the participants.

The structure of the seminar constitute the collection of tracks -- sections devoted to particular projects (with fixed task and principles of evaluation).

The major principle of ROMIP is definition of tasks for evaluation and formulation of principles of conducting the evaluation as a joint effort with the participants. The organizing committee only plays the role of coordinating the sections.

Another basic principle of ROMIP is utilization of results of evaluation exclusively for research purposes and prohibition of utilization of results for marketing/commercial purposes without prior agreement with the participant.

Structure of the annual cycle

  1. Preparatory stage.
    At this stage, the list of participants is being compiled, the list of tasks and methodology of creating test collections and of evaluation are being defined. Formats and ways of exchanging data and official metrics for evaluation are being agreed. The schedule is being fixed.

    In order to participate in the seminar, a participant must fill in the application which will be reviewed by organizing committee, pay the participation fee (which compensates the initial expenses for creation and distribution of collection of data), and sign the necessary agreements (licenses).

    All participants are assigned an anonym (which will not be directly associated with information about particular participants; for example, colours), which will be used for anonymous evaluation and publication of results. Information in accordance with anonym and the corresponding participant will be only available to a limited subset of organizing committee.

  2. Preparation of test collections.
    The organizing committee compiles test collections of data and tasks and distributes them among the participants. Depending on the source of data, it may be required to sign the agreement of non-proliferation of data and limitation of usage of the collection by the participant.

  3. Conduction of experimental passages of search engines.
    Each participant conducts the query tasks independently using his/her own equipment. When presenting the results (the obtained responses), the participant must use the previously assigned anonym (for example, the anonym may be used as a login/password for ftp-server) and meet the agreed deadlines and formats of presentation of results.

  4. Evaluation of the obtained results.
    The organizing committee arranges the conduction of evaluation of the obtained results (for the most part with engagement of independent experts). The specific methodology of evaluation depends on the task in question and is agreed during the preparation stage. The information about all evaluation results will be available to all participants; however, this information will use the anonym to refer to the participants.

    The participants will have the opportunity to help at the stage of evaluation. The specific procedure of participation in evaluation is currently being formalized.

  5. Analysis of the obtained results, preparation for publication.
    Participants are expected to independently analyse the obtained results of their search engine and prepare a paper describing (general) principles of their approach and the observed results. It is not required to reveal the incognito and all the details of realization (this depends on the will of the participant), it is sufficient to describe in a general way what well-known methods have been used and what distinguishes the used approach from others. However, provision of more detailed information about systems, results and problems is encouraged.

  6. Seminar meeting
    The prepared papers are going to be presented on the seminar meeting, and later on published in the proceedings. In order to popularize the seminar and stimulate the research in the field of information retrieval in Russia, it is intended to combine the seminar with a Russian conference of a similar topic.

Principles of evaluation

Specific procedures of evaluation differ according to different tasks of information retrieval and are formed for particular tracks, but it is nevertheless possible to outline a number of general basic principles:

  • Equality of systems.
    If possible, evaluation procedure must guarantee the equality of systems during evaluation of results. For instance, it is desired to avoid any solutions which may cause unequal distribution of the number of documents being evaluated for a particular system.

  • Anonymity of the source of the result.
    At the stage of evaluation the anonymity of the source of the result must be maintained: therefore, those who are evaluating the results cannot know what specific system(s) produced the result. This is necessary for anonymity of final evaluation results as well as for objectivity in evaluation.

  • Selective evaluation.
    Above all, selective evaluation is conditioned by the limited nature of the available resources to conduct evaluation (available experts, time and financial constraints, etc.)

    Selective evaluation also allows for increase of scale of the tasks which are fulfilled by the systems at the same expense for evaluation at the acceptable level.

  • Use of approved approaches.
    It is preferred to use approved methodologies of evaluation since it increases the confidence in the obtained results.

  • Independence of the evaluation procedure from the system output
    If separate elements of the system output are evaluated, the result of evaluation should not depend on placement of the element in the initial output.

    For instance, if the system output under evaluation is represents a list of documents, the documents for evaluation of their relevance to the query should not appear to the expert in the same order as they appear in the system output.

  • Involvement of participants in result evaluation.
    Involvement of participants allows for increase of available expert evaluation without increase of expenses. However, as opposed to independent experts participants are privies, and the risk of obtaining inadequate results rises. The procedure of evaluation should foresee methods to reveal and eliminate such problems.

Track selection

The choice of tracks for the following seminar will be based on interest of participants and opportunities to organize the tracks. A more formal procedure of track selection will consist of the following steps:
  • A set of "possibly realizable" tracks is being formed.
    A "possible" track implies any track that corresponds to the seminar topic. The set of possible tracks is open and every interested participant can suggest his/her variant for general discussion.

    Tracks lacking all the required information are accepted for discussion, however, in order for a track to get the status of "realizable" it must contain the full description, as well as a proof of availability of the necessary resources (data, expert time, etc.)

  • Each track is available for open vote.
    The aim of voting is to identify the interest of each participant in each of the possible tracks (one can vote for participation in several tracks).

  • The tracks with the largest amount of votes are selected.
    The selection is designed to reach the maximal profit (interest of the participants) within limited available resources (finance and time constraints of evaluation).

A track description includes answers to the following questions:

  • Evaluation of solution methods of which task is the track designed for?
  • What set of data is proposed to use? (with specification of characteristics: size, legality, heterogeneity...)
  • What will be the tasks? how many? in what way will they be formulated? (from logs, created artificially...)
  • In what format is it proposed to receive output from the systems?
  • How will the evaluation procedure be organized? How much manual work will be required and what are the surmised expenses of evaluation?
  • What can be the evaluation measurements?
  • What motivates the "sensibility" of the obtained figures and conclusions made based on them about superiority of one method over another? (Methodological aspects) For example:
    • Stability of results in relation to the number of tasks
    • Stability of results in relation to evaluation procedure (the sequence of evaluation or other factors related to expert work)
    • Protectability of falsification of results by participants