Recommend a language or tool for managing large datasets

Question

Recommend a language or tool for managing large datasets

I have a large dataset (1 GB of pure compressed text).

I am currently rewriting a dataset based on information in the data, for example:

Turn 2009-10-16 on Friday
Count how many times happened and how long they lasted for

I am currently doing all this in Java. I am wondering if anyone knows of a tool or language that was actually designed for this type of work. It is possible in Java, but I write a lot of code tables.

+2

programming-languages dataset

Steve 16 oct. '09 at 15:46

source to share

7 replies

Perl is the answer. It was created for processing text data.

+5

jitter 16 oct. '09 at 15:50

source to share

A detailed discussion about manipulating large data sets in the case of string data can be found here . It discusses additional languages and their specific benefits, and Unix / Linux shell scripts as an alternative.

+3

luvieere 16 oct. '09 at 15:57

source to share

Perl

+2

Bob 16 oct. '09 at 15:53

source to share

I use Python to do this type of thing at work all the time. The scripts are straightforwardly written as Python is dead easy to learn and has wonderful documentation for the main language libraries and functions. Python, combined with the command line, makes my job easy.

In your case, in just one file, I'll write a script and just do:

zcat big_file.dat.gz | my_script.py

Or, you can use the Python libraries to handle compressed files if you don't like working on the command line.

As also mentioned by others, Perl is just as good. Or they'll do the trick.

+2

Dr. Watson 16 oct. 09 at 15:58

source to share

Depending on how the data is structured, you may not focus on the language, but storage is something that you can feed into the database and let the database do the hard work?

+1

Joe 16 oct. '09 at 16:07

source to share

I would suggest using AWK. The first line of the Wikipedia entry says it all.

AWK is a programming language designed to process textual data, either in files or data streams.

0

cdiggins 18 oct. '09 at 0:59

source to share

Steve · Accepted Answer · 2009-11-01T17:54:58+0000

I ended up using scala for this. I find it strong enough for the job I am doing. I can easily integrate it into my Java code.

Recommend a language or tool for managing large datasets

More articles: