Format of documents in news collection
RIRES: Russian Information Retrieval Evaluation Seminar

 Call for participation 
 General principles 
 Test collections 
 Relevance tables 


Format of documents in news collection

Documents in ROMIP collections are kept in the XML form.

For each news the following is stored:
  • identifier (string)
  • header (of news article)
  • source:
    • news agency
    • URL of news article on the Web
  • time of publication
  • content (without any changes)

One XML file usually contains multiple documents to decrease number of files in the collection.

Content and title of the source document are stored in BASE64.

A sample document in the ROMIP format is below (XML file):

<?xml version="1.1"?>
<romip:dataset xmlns:romip="" collectionId="ROMIP-2006-News">

 <license type="yandex" uri=""/>
      This is ROMIP news collection....

  <docURL> document URL (base 64)</docURL>
  <subject encoding="base64"> title of news (base64)</subject>
  <agency>news agency name (base64)</agency>
  <content encoding="base64"> 
      content (base64)

  ... next document ...