How do I extract columns from fixed width format in Perl?
I am writing a Perl script to run and capture various data items such as:
1253592000
1253678400 86400 6183.000000
1253764800 86400 4486.000000
1253851200 36.000000 86400 10669.000000
1253937600 0.000000 86400 9126.000000
1254024000 0.000000 86400 2930.000000
1254110400 0.000000 86400 2895.000000
1254196800 0.000000 8828.000000
I can grab every line of this text file with no problem.
I have a working regex to capture each of these fields. Once I have a string in a variable i.e. $ Line - How can I grab each of these fields and put them in their own variables, even if they have different delimiters?
source to share
This example shows how to parse a string with either a space delimiter ( split ) or a fixed column layout ( unpack ). When used unpack
, if you are using upper case (A10, etc.), Spaces will be removed for you. Note : as brian d foy points out, the approach is split
not suitable for a situation with missing fields (for example, the second row of data), since the position information of the field will be lost; unpack
- the way to go here if we don't understand your data.
use strict;
use warnings;
while (my $line = <DATA>){
chomp $line;
my @fields_whitespace = split m'\s+', $line;
my @fields_fixed = unpack('a10 a10 a12 a28', $line);
}
__DATA__
1253592000
1253678400 86400 6183.000000
1253764800 86400 4486.000000
1253851200 36.000000 86400 10669.000000
1253937600 0.000000 86400 9126.000000
1254024000 0.000000 86400 2930.000000
1254110400 0.000000 86400 2895.000000
1254196800 0.000000 8828.000000
source to share
Use my moduleDataExtract::FixedWidth
. It is the most fully featured and well tested for working with Fixed Width columns in perl. If that's not fast enough, you can go in unpack_string
and eliminate the need for heuristic boundary detection.
#!/usr/bin/env perl
use strict;
use warnings;
use DataExtract::FixedWidth;
use feature ':5.10';
my @rows = <DATA>;
my $de = DataExtract::FixedWidth->new({
heuristic => \@rows
, header_row => undef
});
say join ('|', @{$de->parse($_)}) for @rows;
--alternatively if you want header info--
my @rows = <DATA>;
my $de = DataExtract::FixedWidth->new({
heuristic => \@rows
, header_row => undef
, cols => [qw/timestamp field2 period field4/]
});
use Data::Dumper;
warn Dumper $de->parse_hash($_) for @rows;
__DATA__
1253592000
1253678400 86400 6183.000000
1253764800 86400 4486.000000
1253851200 36.000000 86400 10669.000000
1253937600 0.000000 86400 9126.000000
1254024000 0.000000 86400 2930.000000
1254110400 0.000000 86400 2895.000000
1254196800 0.000000 8828.000000
source to share
I'm not sure about the column names and formatting, but you can customize this recipe as you like using Text :: monospaced
use strict;
use warnings;
use Text::FixedWidth;
my $fw = Text::FixedWidth->new;
$fw->set_attributes(
qw(
timestamp undef %10s
field2 undef %10s
period undef %12s
field4 undef %28s
)
);
while (<DATA>) {
$fw->parse( string => $_ );
print $fw->get_timestamp . "\n";
}
__DATA__
1253592000
1253678400 86400 6183.000000
1253764800 86400 4486.000000
1253851200 36.000000 86400 10669.000000
1253937600 0.000000 86400 9126.000000
1254024000 0.000000 86400 2930.000000
1254110400 0.000000 86400 2895.000000
1254196800 0.000000 8828.000000
You can split the line. It looks like your separator is just a space? You can do something in order:
@line = split(" ", $line);
This will match all spaces. Then you can check the boundaries and access each field via $ line [0], $ line [1], etc.
Split can also accept a regular expression rather than a string as a delimiter.
@line = split(/\s+/, $line);
It can do the same.
source to share
A fixed width constraint can be done like this:
my @cols;
my %header;
$header{field1} = 0; // char position of first char in field
$header{field2} = 12;
$header{field3} = 15;
while(<IN>) {
print chomp(substr $_, $header{field2}, $header{field3}); // value of field2
}
My Perl is very rusty, so I'm pretty sure there are syntax errors in there. but that is the essence of it.
source to share