Recommend a language or tool for managing large datasets

I have a large dataset (1 GB of pure compressed text).

I am currently rewriting a dataset based on information in the data, for example:

  • Turn 2009-10-16 on Friday
  • Count how many times happened and how long they lasted for

I am currently doing all this in Java. I am wondering if anyone knows of a tool or language that was actually designed for this type of work. It is possible in Java, but I write a lot of code tables.

+2


source to share


7 replies


I ended up using scala for this. I find it strong enough for the job I am doing. I can easily integrate it into my Java code.



0


source


Perl is the answer. It was created for processing text data.



+5


source


A detailed discussion about manipulating large data sets in the case of string data can be found here . It discusses additional languages ​​and their specific benefits, and Unix / Linux shell scripts as an alternative.

+3


source


+2


source


I use Python to do this type of thing at work all the time. The scripts are straightforwardly written as Python is dead easy to learn and has wonderful documentation for the main language libraries and functions. Python, combined with the command line, makes my job easy.

In your case, in just one file, I'll write a script and just do:

zcat big_file.dat.gz | my_script.py

Or, you can use the Python libraries to handle compressed files if you don't like working on the command line.

As also mentioned by others, Perl is just as good. Or they'll do the trick.

+2


source


Depending on how the data is structured, you may not focus on the language, but storage is something that you can feed into the database and let the database do the hard work?

+1


source


I would suggest using AWK. The first line of the Wikipedia entry says it all.

AWK is a programming language designed to process textual data, either in files or data streams.

0


source







All Articles