How to filter and sort multiple gb of data?
I have a 36gb large file that contains about 600 lines in this structure:
LogID,SensorID,ValueNumeric,ValueString,DateAdded,VariableName,Name
1215220724,1182,1,0000,,2016-01-04 12:56:57.7770000,Scan_Frequency,MIC_2
I'm only interested in date, value, variable name and stream (name). The problem is that there are many duplicate records and that the data is not ordered by date.
My current solution is that I go through the first 100,000 rows, read the name of the variables (of which 833 of them are), and create a table in the DB for each one. For the primary key, I am using date (I cut off seconds and milliseconds), so my DB will not have duplicates. I know this is not the best string for primary key.
Then I read the file again and entered this data into the tables, but it was slowing down. My estimate is that in the end I should have at least 10 times the number of rows.
Someone has a better idea how to read such a large file and sort it by date and remove duplicates. It would be sufficient to save the data in 5 minute intervals.
source to share
I would use Elasticsearch + Logstash solution (they are free and work great with default installations). Logstash is designed to collect data from multiple sources (including CSV files) and Elasticsearch is a NoSql database that does an amazing job for both indexing documents and querying them.
See this question and answer for a starting point, and here is the documentation .
source to share
Your database will offer a tool to import the csv files directly. This is most likely much faster than using JDBC. What's more, the chances are high, it also offers a tool to remove duplicates that you specify during import. Once you have the data in the database, it will take care of sorting the data for you. Just to give you an example: if you've been using MySQL, there is a MySQL import utility mysqlimport that also offers the option to remove duplicates during import using - replace .
source to share