Statistical Calculation Workflow Suggestions

Question

Statistical Calculation Workflow Suggestions

Note. I decided to ask this here and not at stats.stackexchange.com because it is about software workflow tools, not any specific methods. I felt that people more intimately familiar with the actual software packages would be able to help more, because I specifically try to avoid the general answer I get from scientists that should always use R or Matlab and then do gradient shapes like making stuff for big data.

I am about to start a large project that includes a lot of data mining (mainly via SQL), a lot of quick and dirty basic statistics (general linear models, covariance estimation, etc.), much of the more advanced techniques (Bayesian stuff , extended samplers, nonparametric characteristics), a strong need for scaling processes for multiprocessing and the need to create good graphs.

I am currently pretty good with Python and related scientific tools (NumPy, scikits, matplotlib, and even PyCUDA / MPI for multiprocessing ... I've never done SQL before). However, I find it often happens that the methods I need are relatively slow in Python and don't scale well as datasets get large. I only know a little C / C ++ and not at all about Boost.Python or Cython.

I know a lot of statisticians use R, but I've also heard that R is just a tiny step away from something like Matlab, which is a way of slowing down and burdening with weirdly defined built-in functions.

My question is, what is a good workflow / toolkit for doing this kind of statistical work. What tools should I consider when I want to take some Python code I wrote and make it faster / better by moving it to another language or packaging Python libraries in C ++. Is Boost.Python something that will allow me to support advanced math algorithms in C ++ and then use them in Python? Is this a good thing to consider when doing statistical work, or is Boost.Python too negligible in statistical functions?

I've also seen PyR2, which allows you to access pretty much all R, but in Python. Is it fast enough to use on big data?

Any other suggestions for statistical workflow would be great!

+3

python r statistics boost-python

ely 17 Mar 12 at 0:09

source to share