Only accepting certain values ​​from a list in perl

I will first describe what I have and then the problem.

I have a text file that is structured as such

----------- Start of file-----
<!-->
name,name2,ignore,name4,jojobjim,name3,name6,name9,pop
-->
<csv counter="1">
1,2,3,1,6,8,2,8,2,
2,6,5,1,5,8,7,7,9,
1,4,3,1,2,8,9,3,4,
4,1,6,1,5,6,5,2,9
</csv>
-------- END OF FILE-----------

      

I also have a perl program that has a map:

 my %column_mapping = (
"name" => 'name',
"name1" => 'name_1',
"name2" => 'name_2',
"name3" => 'name_3',
"name4" => 'name_4',
"name5" => 'name_5',
"name6" => 'name_6',
"name7" => 'name_7',
"name9" => 'name_9',
)

      

My dynamic insert statement (Assuming I am connected to a database and headers is my array of header names like test1, test2, ect)

my $sql = sprintf 'INSERT INTO tablename ( %s ) VALUES ( %s )',
    join( ',', map { $column_mapping{$_} } @headers ),
    join( ',', ('?') x scalar @headers ); 

my $sth = $dbh->prepare($sql);

      

Now for the problem I have: I need a way to only do insertion in headers and values ​​that are in the map. There are multiple names in the data file rendered as exmaple that are not in the map, is there a way I can ignore them and the numbers associated with them in the csv section?

basically do a subset of the csv to turn it into:

name,name2,name4,name3,name6,name9,
 1,2,1,8,2,8,
 2,6,1,8,7,7,
 1,4,1,8,9,3,
 4,1,1,6,5,2,

      

so that my insert tab only inserts the ones on the map. The data file is always different and out of order and the unknown amount will be on the card.

Ideally an efficient way to do this as this script will go through thousands of files and each millions of csv lines file contains hundreds of columns.

It is only a text file that is readable, but not csv, not sure if the csv libraries can work in this scenario or not.

+3


source to share


2 answers


Typically, you put a set of valid indices in a list and then massive chunks .



@valid = grep { defined($column_mapping{ $headers[$_] }) } 0 .. $#headers;

...

my $sql = sprintf 'INSERT INTO tablename ( %s ) VALUES ( %s )',
  join( ',', map { $column_mapping{$_} } @headers[@valid] ),
  join( ',', ('?') x scalar @valid);
my $sth = $dbh->prepare($sql);

...

my @row = split /,/, <INPUT>; 
$sth->execute( @row[@valid] );

...

      

+4


source


Since these are roughly four different questions in one, I'm going to take a higher level approach to a wider set of problems and leave you with programming information (or you can ask new questions about the details).

I would quickly change the data format. Mixing CSV columns into an XML file is weird and inefficient as I'm sure you know. Use CSV file for bulk data. Use an XML file for complex metadata.

Headers have worse XML commentary, now you parse comments; comments should be ignored. If you must save mixed XML / CSV format, put the headers in the appropriate XML tag. Otherwise, what to use XML?

Since you are going to parse a large file, use the XML SAX parser. Unlike the more traditional DOM parser, which has to parse the entire document before doing anything, the SAX parser will process it when it reads the file. This will save tons of memory. I leave SAX processing as an exercise, starting with XML :: SAX :: Intro .



In a SAX parser, extract data from <csv>

and use a CSV parser. Text :: CSV_XS is a good choice. It is efficient and solved all CSV data parsing problems you are likely to run into.

When you finally get to the Text :: CSV_XS object, call getline_hr

in a loop to get the strings as hashes, apply your collation and insert into your database. @mob's solution is ok , but I would go with SQL :: Abstract to create SQL rather than doing it manually. This will protect against both SQL injection attacks and more mundane things like headers containing SQL metacharacters and reserved words.

It is important to separate the processing of the analyzed data from the analysis of the data. I'm pretty sure the disgusting data format will change, either for the worse or for the better, and you don't want to tie your code to it.

+3


source







All Articles