Store large text corpus in Python

I am trying to build a large text corpus from a Wikipedia dump . I represent each article as a document object consisting of:

  • source text: string
  • preprocessed text: a list of tuples where each tuple contains a (stem) word and the position of the word in the original text
  • additional information such as title and author

I am looking for an efficient way to save these objects to disk. The following operations are possible:

  • adding a new document
  • access to documents via identifier
  • iteration over all documents

There is no need to delete an object after adding it.

I could imagine the following methods:

  • Serializing each article into a separate file, for example using pickle: the downside here is probably a lot of calls to the operating system.
  • Store all documents in one XML file or document blocks in multiple XML files: Saving a list that represents a preprocessed document in xml format uses a lot of overhead and I think its rather slow to read the list from xml
  • Using an existing package for storing corpus: I found the Corpora package which seems to be very fast and efficient, but it only supports storing lines and header including metadata. Simply placing preprocessed text in the header makes it incredibly slow.

What would be a good way to do this? Maybe a package for this purpose that I haven't found yet?

+3


source to share





All Articles