Can Conda be used as a "virtualenv" for Hadoop Streaming Job (in Python)?
We are currently using Luigi, MRJob and other frameworks to run Hadoo streaming jobs using Python. We can already submit jobs with our virtual user, so no specific Python dependencies are installed in the nodes ( see the article ). I was wondering if someone has done a similar thing with the Anaconda / Conda Package manager.
PD. I also know Conda-Cluster , however this looks like a harder / harder solution (and it's behind the paid line).
source to share
I don't know how to package the conda environment in tar / zip and then deploy it in another box and am ready to use it like in the example you mentioned, which might not be possible. At least without Anaconda, all worker nodes can have issues moving between different OSes.
Anaconda Cluster was built to address this issue (disclaimer: I am an Anaconda Clacon developer), but it takes a more sophisticated approach, basically we use a configuration management system (salt) to install anaconda on all cluster nodes and manage conda environments.
We use a configuration management system because we also deploy hadoop stack (spark and his friends) and we need to target large clusters, but really if you only need to deploy anaconda and not have many nodes you should be able to do it's easy to use cloth (which Anaconda Cluster also uses in some parts) and run it on a regular laptop.
If you are interested in the Anaconda Cluster group docs: http://continuumio.github.io/conda-cluster/
source to share