Recommend a language or tool for managing large datasets
I have a large dataset (1 GB of pure compressed text).
I am currently rewriting a dataset based on information in the data, for example:
- Turn 2009-10-16 on Friday
- Count how many times happened and how long they lasted for
I am currently doing all this in Java. I am wondering if anyone knows of a tool or language that was actually designed for this type of work. It is possible in Java, but I write a lot of code tables.
source to share
I use Python to do this type of thing at work all the time. The scripts are straightforwardly written as Python is dead easy to learn and has wonderful documentation for the main language libraries and functions. Python, combined with the command line, makes my job easy.
In your case, in just one file, I'll write a script and just do:
zcat big_file.dat.gz | my_script.py
Or, you can use the Python libraries to handle compressed files if you don't like working on the command line.
As also mentioned by others, Perl is just as good. Or they'll do the trick.
source to share
I would suggest using AWK. The first line of the Wikipedia entry says it all.
AWK is a programming language designed to process textual data, either in files or data streams.
source to share