This collection is created and provided by Kodeks in 2007.
It consists of documents from the
legislation of Russian Federation, Moscow and St.Petersburg by the state on
the second week of December, 2006. The collection contains HTML
documents and unlike the Web collections is much more uniform.
Features:
Title of document is inserted into the title field of document content
Formating of documents is made by styles, which are not included
Tags Hx are not used in the text of documents. (If you want to detect headers
you need to analyze tags P
for which value of class attribute is "headertext".)
Dataset Parameters
Size of HTML data (bz2 archives): 1.6 Gb
Number of pages: 300 000
Encoding: cp1251
Rights to Use
The rights to use are granted to ROMIP by Kodeks, which is the owner of the
collection. To get access to the collection you must sign the usage
agreement (in Russian).
Data Format
The collection is distributed in xml files of a certain format.