How best to distribute python packages with _large_ data dependencies

I am working on a new Python package that depends on many fairly large (> 20MB each) data files. In particular, the library expects data files to be in a directory data/

at runtime.

I currently place them in the "data" directory as part of the distribution package and have configured a script to install these files on the user's system via python install

. This works for now, but it looks like it will prevent me from loading the distro into PyPI, given that the tarball will probably exceed a few hundred MB.

Alternatively, I would like to "host" the files on a remote site, to be kind to PyPI, and download and install the files automatically. Is this possible using existing Python distribution methods? If so, could you please describe how to do this or give an example? If this is not possible, what are the best methods for removing it?

Any insight you could offer would be greatly appreciated.


source to share

1 answer

NLTK has a similar situation in the data distribution of its corpuses. On my Linux distribution, the data is in a separate package, so I did some research installing it using setuptools on Windows.

If you try to use a corpus and you don't have one, nltk will ask you to run the bootloader (

) function . Internally, it uses the LazyCorpusLoader as a prop for the corpus objects that need data, and then loads the data when needed.

Likewise sys.path

, it looks for multiple paths ahead of time so that the user can put them where they want. You can also change

to add your own data location.



All Articles