How do I extract columns from fixed width format in Perl?

I am writing a Perl script to run and capture various data items such as:

1253592000
1253678400                 86400                 6183.000000
1253764800                 86400                 4486.000000 
1253851200  36.000000      86400                10669.000000
1253937600  0.000000       86400                 9126.000000
1254024000  0.000000       86400                 2930.000000
1254110400  0.000000       86400                 2895.000000
1254196800  0.000000                             8828.000000

      

I can grab every line of this text file with no problem.

I have a working regex to capture each of these fields. Once I have a string in a variable i.e. $ Line - How can I grab each of these fields and put them in their own variables, even if they have different delimiters?

+2


source to share


6 answers


This example shows how to parse a string with either a space delimiter ( split ) or a fixed column layout ( unpack ). When used unpack

, if you are using upper case (A10, etc.), Spaces will be removed for you. Note : as brian d foy points out, the approach is split

not suitable for a situation with missing fields (for example, the second row of data), since the position information of the field will be lost; unpack

- the way to go here if we don't understand your data.



use strict;
use warnings;

while (my $line = <DATA>){
    chomp $line;
    my @fields_whitespace = split m'\s+', $line;
    my @fields_fixed = unpack('a10 a10 a12 a28', $line);
}

__DATA__
1253592000                                                  
1253678400                 86400                 6183.000000
1253764800                 86400                 4486.000000
1253851200 36.000000       86400                10669.000000
1253937600  0.000000       86400                 9126.000000
1254024000  0.000000       86400                 2930.000000
1254110400  0.000000       86400                 2895.000000
1254196800  0.000000                             8828.000000

      

+13


source


Use my moduleDataExtract::FixedWidth

. It is the most fully featured and well tested for working with Fixed Width columns in perl. If that's not fast enough, you can go in unpack_string

and eliminate the need for heuristic boundary detection.



#!/usr/bin/env perl
use strict;
use warnings;
use DataExtract::FixedWidth;
use feature ':5.10';

my @rows = <DATA>;
my $de = DataExtract::FixedWidth->new({
  heuristic => \@rows
  , header_row => undef
});

say join ('|',  @{$de->parse($_)}) for @rows;

    --alternatively if you want header info--

my @rows = <DATA>;
my $de = DataExtract::FixedWidth->new({
  heuristic => \@rows
  , header_row => undef
  , cols => [qw/timestamp field2 period field4/]
});

use Data::Dumper;
warn Dumper $de->parse_hash($_) for @rows;

__DATA__
1253592000
1253678400                 86400                 6183.000000
1253764800                 86400                 4486.000000
1253851200  36.000000      86400                10669.000000
1253937600  0.000000       86400                 9126.000000
1254024000  0.000000       86400                 2930.000000
1254110400  0.000000       86400                 2895.000000
1254196800  0.000000                             8828.000000

      

+3


source


I'm not sure about the column names and formatting, but you can customize this recipe as you like using Text :: monospaced

use strict;
use warnings;
use Text::FixedWidth;

my $fw = Text::FixedWidth->new;
$fw->set_attributes(
    qw(
        timestamp undef  %10s
        field2    undef  %10s
        period    undef  %12s
        field4    undef  %28s
        )
);

while (<DATA>) {
    $fw->parse( string => $_ );
    print $fw->get_timestamp . "\n";
}

__DATA__
1253592000
1253678400                 86400                 6183.000000
1253764800                 86400                 4486.000000
1253851200 36.000000       86400                10669.000000
1253937600  0.000000       86400                 9126.000000
1254024000  0.000000       86400                 2930.000000
1254110400  0.000000       86400                 2895.000000
1254196800  0.000000                             8828.000000

      

0


source


You can split the line. It looks like your separator is just a space? You can do something in order:

@line = split(" ", $line);

      

This will match all spaces. Then you can check the boundaries and access each field via $ line [0], $ line [1], etc.

Split can also accept a regular expression rather than a string as a delimiter.

@line = split(/\s+/, $line);

      

It can do the same.

-1


source


If all fields are fixed width the same and formatted with spaces, you can use the following split

:

@array = split / {1,N}/, $line;

      

where N

is a field with a field. This will give space for each blank field.

-1


source


A fixed width constraint can be done like this:

my @cols;
my %header;
$header{field1} = 0; // char position of first char in field
$header{field2} = 12;
$header{field3} = 15;

while(<IN>) {

   print chomp(substr $_, $header{field2}, $header{field3}); // value of field2 


}

      

My Perl is very rusty, so I'm pretty sure there are syntax errors in there. but that is the essence of it.

-2


source







All Articles