Store large text corpus in Python

Question

I am trying to build a large text corpus from a Wikipedia dump . I represent each article as a document object consisting of:

source text: string
preprocessed text: a list of tuples where each tuple contains a (stem) word and the position of the word in the original text
additional information such as title and author

I am looking for an efficient way to save these objects to disk. The following operations are possible:

There is no need to delete an object after adding it.

I could imagine the following methods:

Serializing each article into a separate file, for example using pickle: the downside here is probably a lot of calls to the operating system.
Store all documents in one XML file or document blocks in multiple XML files: Saving a list that represents a preprocessed document in xml format uses a lot of overhead and I think its rather slow to read the list from xml
Using an existing package for storing corpus: I found the Corpora package which seems to be very fast and efficient, but it only supports storing lines and header including metadata. Simply placing preprocessed text in the header makes it incredibly slow.