Test collections

Relevance tables

History

2003

2004

2005

Simple format of documents in test collections

Documents in ROMIP collections are kept in the XML form.

For each document the following is stored:

identifier (URL in case of Web collections)
content (without any modifications)
collection identifier (name and date of creation)

One XML file usually contains multiple documents to decrease number of files in the test collection.

Content of the document is encoded as BASE64 to preserve original markup, etc.

A sample document in the ROMIP format is below (XML file):

<?xml version="1.0"?>
<romip:dataset xmlns:romip="http://www.romip.ru/data/common">

<collection>
   <collectionID>Name of data set</collectionID>
   <date>Date of creation 
        (shows time when documents were modified for the last time)</date>
</collection>

<document>
  <docID>identifier (URL in case of Web collections)</docID>
  <docURL>full original URL for the document (optional tag)</docURL>  
  <content encoding="base64">
    content in base64
  </content>
</document>

<document>
  ... next document ...
</document>

...

</romip:dataset>

Standart parser

We offer simple Java-based parser that can be extended to convert data to format used by your system. Parser is provided "as-is", feel free to change it.