![]() |
Format of documents in news collectionDocuments in ROMIP collections are kept in the XML form. For each news the following is stored:
One XML file usually contains multiple documents to decrease number of files in the collection. Content and title of the source document are stored in BASE64. A sample document in the ROMIP format is below (XML file):
<?xml version="1.1"?>
<romip:dataset xmlns:romip="http://www.romip.ru/data/common" collectionId="ROMIP-2006-News">
<header>
<version>1.1</version>
<license type="yandex" uri="http://romip.ru/license/yandex.html"/>
<collection-description>
This is ROMIP news collection....
</collection-description>
</header>
<document>
<docID>040404-27793</docID>
<docURL> document URL (base 64)</docURL>
<subject encoding="base64"> title of news (base64)</subject>
<agency>news agency name (base64)</agency>
<timestamp>
<date>20040402</date>
<daytime>50493</daytime>
</timestamp>
<content encoding="base64">
content (base64)
</content>
</document>
<document>
... next document ...
</document>
...
</romip:dataset>
|