ROMIP: Russian Information Retrieval Evaluation Seminar

 News 
 About 
 Manifesto 
 General principles 
 Participation 
 Test collections 
 Relevance tables 
 History 
 2004 
 2005 
 Publications 
 Forum 

По-русскиПо-русски
 

Simple format of documents in test collections

Documents in ROMIP collections are kept in the XML form.

For each document the following is stored:
  • identifier (URL in case of Web collections)
  • content (without any modifications)
  • collection identifier (name and date of creation)

One XML file usually contains multiple documents to decrease number of files in the test collection.

Content of the document is encoded as BASE64 to preserve original markup, etc.

A sample document in the ROMIP format is below (XML file):

<?xml version="1.0"?>
<romip:dataset xmlns:romip="http://www.romip.ru/data/common">

<collection>
   <collectionID>Name of data set</collectionID>
   <date>Date of creation 
        (shows time when documents were modified for the last time)</date>
</collection>

<document>
  <docID>identifier (URL in case of Web collections)</docID>
  <docURL>full original URL for the document (optional tag)</docURL>  
  <content encoding="base64">
    content in base64
  </content>
</document>

<document>
  ... next document ...
</document>

...

</romip:dataset>

Standart parser

We offer simple Java-based parser that can be extended to convert data to format used by your system. Parser is provided "as-is", feel free to change it.