Format of documents in news collectionDocuments in ROMIP collections are kept in the XML form. For each news the following is stored:
One XML file usually contains multiple documents to decrease number of files in the collection. Content and title of the source document are stored in BASE64. A sample document in the ROMIP format is below (XML file): <?xml version="1.1"?> <romip:dataset xmlns:romip="http://www.romip.ru/data/common" collectionId="ROMIP-2006-News"> <header> <version>1.1</version> <license type="yandex" uri="http://romip.ru/license/yandex.html"/> <collection-description> This is ROMIP news collection.... </collection-description> </header> <document> <docID>040404-27793</docID> <docURL> document URL (base 64)</docURL> <subject encoding="base64"> title of news (base64)</subject> <agency>news agency name (base64)</agency> <timestamp> <date>20040402</date> <daytime>50493</daytime> </timestamp> <content encoding="base64"> content (base64) </content> </document> <document> ... next document ... </document> ... </romip:dataset> |