ROMIP: Russian Information Retrieval Evaluation Seminar

 General principles 
 Test collections 
 Relevance tables 


Legal Documents Collection 2007


This collection is created and provided by Kodeks in 2007.

It consists of documents from the legislation of Russian Federation, Moscow and St.Petersburg by the state on the second week of December, 2006. The collection contains HTML documents and unlike the Web collections is much more uniform.


  • Title of document is inserted into the title field of document content
  • Formating of documents is made by styles, which are not included
  • Tags Hx are not used in the text of documents.
    (If you want to detect headers you need to analyze tags P for which value of class attribute is "headertext".)
  • Unique feature of this collection is availability of multiple editions of the same document. Multiple editions are stored as multiple content tags. Date attribute of content tag defines when this edition was added. Initial (first) revision has no date attribute.

Dataset Parameters
  • Size of HTML data (bz2 archives): 1.6 Gb
  • Number of pages: 300 000
  • Encoding: cp1251
Rights to Use

The rights to use are granted to ROMIP by Kodeks, which is the owner of the collection. To get access to the collection you must sign the usage agreement.

Data Format

The collection is distributed in xml files of a certain format.

Tracks in Which the Collection Was Used
  • Ad hoc search in a collection of legal documents
  • Ad hoc search in a mixed collection
  • Classification of legal documents