Format of documents in news collection
RIRES: Russian Information Retrieval Evaluation Seminar

 News 
 About 
 Manifesto 
 Call for participation 
 General principles 
 Participation 
 Tracks 
 Participants 
 Test collections 
 Publications 
 Relevance tables 
 History 
 2004 
 2005 
 Forum 

По-русскиПо-русски
 

Format of documents in news collection

Documents in ROMIP collections are kept in the XML form.

For each news the following is stored:
  • identifier (string)
  • header (of news article)
  • source:
    • news agency
    • URL of news article on the Web
  • time of publication
  • content (without any changes)

One XML file usually contains multiple documents to decrease number of files in the collection.

Content and title of the source document are stored in BASE64.

A sample document in the ROMIP format is below (XML file):

<?xml version="1.1"?>
<romip:dataset xmlns:romip="http://www.romip.ru/data/common" collectionId="ROMIP-2006-News">

<header>
 <version>1.1</version>
 <license type="yandex" uri="http://romip.ru/license/yandex.html"/>
 <collection-description>
      This is ROMIP news collection....
 </collection-description>
</header>

<document>
  <docID>040404-27793</docID>
  <docURL> document URL (base 64)</docURL>
  <subject encoding="base64"> title of news (base64)</subject>
  <agency>news agency name (base64)</agency>
  <timestamp>
     <date>20040402</date>
     <daytime>50493</daytime>
  </timestamp>
  <content encoding="base64"> 
      content (base64)
  </content>
</document>

<document>
  ... next document ...
</document>
...

</romip:dataset>