Simple format of documents in test collections
Documents in ROMIP collections are kept in the XML form.For each document the following is stored:
One XML file usually contains multiple documents to decrease number of files in the test collection.
Content of the document is encoded as BASE64 to preserve original markup, etc.
A sample document in the ROMIP format is below (XML file):
<?xml version="1.0"?> <romip:dataset xmlns:romip="http://www.romip.ru/data/common"> <collection> <collectionID>Name of data set</collectionID> <date>Date of creation (shows time when documents were modified for the last time)</date> </collection> <document> <docID>identifier (URL in case of Web collections)</docID> <docURL>full original URL for the document (optional tag)</docURL> <content encoding="base64"> content in base64 </content> </document> <document> ... next document ... </document> ... </romip:dataset>
We offer simple Java-based parser that can be extended to convert data to format used by your system. Parser is provided "as-is", feel free to change it.