Perl unpack ("A4 / A *") length + byte syntax in regex form

Question

Perl unpack ("A4 / A *") length + byte syntax in regex form

As perlpacktut pointed out, you can use the X / Y * unpacked string to first get the length of the byte stream and then read exactly the same number of bytes. However, I am struggling to find something similar in a regex with, say, regular ASCII numbers and strings. For example, the Bencoded line looks like this:

[length]:[bytes]
4:spam
4:spam10:green eggs

I remember being able to pull this off once, but only using {{}, and I don't have any code to use right now. Can you do this without? {} (Which is super experimental) using one of the new 5.10 grabs / backlinks?

The obvious expression doesn't work:

/(\d+)\:(.{\1})/g
/(\d+)\:(.{\g-1})/g

+3

regex perl unpack backreference string-length

Brendan byrd 16 Mar 12 at 1:53

source to share

2 answers

No, I don't think this is possible without using (??{ ... })

, which would be the following:

/(\d++):((??{".{$^N}"}))/sg

+1

Qtax 16 Mar '12 at 3:22

source to share

brian d foy · Accepted Answer · 2012-03-16T10:13:00+0000

Do it with a flag /g

and anchor regex \G

, but in a scalar context. This stores the position in the string immediately after the last pattern match (or start for the first). This way you can walk along the string. Get the length, omit the colon, and then use substr to pick the correct number of characters. You can actually assign pos

, so update it for the symbols you just extracted. redo

until you have more matches:

use v5.10.1;

LINE: while( my $line = <DATA> ) {
    chomp( $line );
    {
    say $line;
    next LINE unless $line =~ m/\G(\d+):/g;  # scalar /g!
    say "\t1. pos is ", pos($line); 
    my( $length, $string ) = ( $1, substr $line, pos($line), $1 );
    pos($line) += $length; 
    say "\t2. pos is ", pos($line); 
    print "\tFound length $length with [$string]\n";
    redo;
    }
    }

__END__
4:spam6:Roscoe
6:Buster10:green eggs
4:abcd5:123:44:Mimi

Notice the edge edge on the last line of input. This 3:

is part of a line, not a new entry. My output is:

4:spam6:Roscoe
    1. pos is 2
    2. pos is 6
    Found length 4 with [spam]
4:spam6:Roscoe
    1. pos is 8
    2. pos is 14
    Found length 6 with [Roscoe]
4:spam6:Roscoe
6:Buster10:green eggs
    1. pos is 2
    2. pos is 8
    Found length 6 with [Buster]
6:Buster10:green eggs
    1. pos is 11
    2. pos is 21
    Found length 10 with [green eggs]
6:Buster10:green eggs
4:abcd5:123:44:Mimi
    1. pos is 2
    2. pos is 6
    Found length 4 with [abcd]
4:abcd5:123:44:Mimi
    1. pos is 8
    2. pos is 13
    Found length 5 with [123:4]
4:abcd5:123:44:Mimi
    1. pos is 15
    2. pos is 19
    Found length 4 with [Mimi]
4:abcd5:123:44:Mimi

I thought there might be a module for this, and there is: Bencode . He does what I did. This means that I have worked hard for nothing. Always watch CPAN. Even if you are not using a module, you can watch their solution :)

Perl unpack ("A4 / A *") length + byte syntax in regex form

More articles: