Open warc file with python

I am trying to open a warc file using python using the toolkit at the following link: http://warc.readthedocs.org/en/latest/

When opening a file:

import warc
f = warc.open("00.warc.gz")

      

Everything is fine and object f:

<warc.warc.WARCFile instance at 0x1151d34d0>

      

However, when I try to read everything in the file using:

for record in f:
     print record['WARC-Target-URI'], record['Content-Length']

      

The following error appears:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/xxx/anaconda/lib/python2.7/site-packages/warc/warc.py", line 390, in         __iter__
record = self.read_record()
File "/Users/xxx/anaconda/lib/python2.7/site-packages/warc/warc.py", line 373, in read_record
header = self.read_header(fileobj)
File "/Users/xxx/anaconda/lib/python2.7/site-packages/warc/warc.py", line 331, in read_header
raise IOError("Bad version line: %r" % version_line)
IOError: Bad version line: 'WARC/0.18\n'

      

Is it because my version of the warc file is not supported by the warc tool I'm using or something else?

+3


source to share


2 answers


The ClueWeb09 dataset is available in WARC 0.18 format. However, he has several questions. Some records are malformed .

The most common problem is adding a new line in the WARC header. There are also several cases of other malformed headers.

Also, it doesn't use the standard end-of-line \ r \ n markers, which is actually your problem.



the warc-clueweb library can handle this. It is a python special library for working with WARC files ClueWeb09. According to the documentation

Only minor changes were made to the original library. The original warc library documentation still contains

+4


source


Yes, thanks for @eyelash's explanation about this issue.

Actually, some of the entries in Clueweb-09 are incorrect. But the official warc library and the above recommended git repo warc -clueweb library , there are some problems.



This fork repo was unable to process the Clueweb12 dataset and another problem is that it could skip 1-2 documents when working with each .warc.gz file.

So, I changed a little code to support the Clueweb09 and Cluewe12 datasets. Here is my repo which has been tested for 100 billion pages, my warc tools are forked and modified from the warc-clueweb library and the official repo.

0


source







All Articles