Simple format of documents in test collectionsDocuments in ROMIP collections are kept in the XML form. For each document the following is stored:
One XML file usually contains multiple documents to decrease number of files in the test collection. Content of the document is encoded as BASE64 to preserve original markup, etc. A sample document in the ROMIP format is below (XML file): <?xml version="1.0"?> <romip:dataset xmlns:romip="http://www.romip.ru/data/common"> <collection> <collectionID>Name of data set</collectionID> <date>Date of creation (shows time when documents were modified for the last time)</date> </collection> <document> <docID>identifier (URL in case of Web collections)</docID> <docURL>full original URL for the document (optional tag)</docURL> <content encoding="base64"> content in base64 </content> </document> <document> ... next document ... </document> ... </romip:dataset> Standart parserWe offer simple Java-based parser that can be extended to convert data to format used by your system. Parser is provided "as-is", feel free to change it. |