п»ї
![]() |
Simple format of documents in test collectionsDocuments in ROMIP collections are kept in the XML form. For each document the following is stored:
One XML file usually contains multiple documents to decrease number of files in the test collection. Content of the document is encoded as BASE64 to preserve original markup, etc. A sample document in the ROMIP format is below (XML file):
<?xml version="1.0"?>
<romip:dataset xmlns:romip="http://www.romip.ru/data/common">
<collection>
<collectionID>Name of data set</collectionID>
<date>Date of creation
(shows time when documents were modified for the last time)</date>
</collection>
<document>
<docID>identifier (URL in case of Web collections)</docID>
<docURL>full original URL for the document (optional tag)</docURL>
<content encoding="base64">
content in base64
</content>
</document>
<document>
... next document ...
</document>
...
</romip:dataset>
Standart parserWe offer simple Java-based parser that can be extended to convert data to format used by your system. Parser is provided "as-is", feel free to change it. |