Why can't LWP :: UserAgent get this site completely?

It only outputs a few lines from the beginning.

#!/usr/bin/perl

use strict;
use warnings;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;
my $response = $ua->get('http://www.eurogamer.net/articles/df-hardware-wii-u-graphics-power-finally-revealed');
print $response->decoded_content;

      

+3


source to share


2 answers


I made the following modification:

my $response = $ua->get( 'http://www.eurogamer.net/articles/df-hardware-wii-u-graphics-power-finally-revealed' );
say $response->headers->as_string;

      

And I saw this:

Cache-Control: max-age=60s
Connection: close
Date: Wed, 06 Feb 2013 23:51:15 GMT
Via: 1.1 varnish
Age: 0
Server: Apache
Vary: Accept-Encoding
Content-Length: 50519
Content-Type: text/html; charset=ISO-8859-1
Client-Aborted: die
Client-Date: Wed, 06 Feb 2013 23:50:50 GMT
Client-Peer: 94.198.83.18:80
Client-Response-Num: 1
X-Died: Illegal field name 'X-Meta-Twitter:card' at .../HTML/HeadParser.pm line 207.
X-Varnish: 630361704

      



It doesn't seem like the tag <meta name="twitter:card" content="summary" />

on line 27. He says he's dead.

It seems to translate any tag meta

with an attribute name

into a title "X-Meta-\u$attr->{name}"

. "Then it tries to store the attribute content

value as the X-meta title value. Like this (starting at line 194):

if ($tag eq 'meta') {
    my $key = $attr->{'http-equiv'};
    if (!defined($key) || !length($key)) {
        if ($attr->{name}) {
            $key = "X-Meta-\u$attr->{name}"; # <-- Here the little trick
        } elsif ($attr->{charset}) { # HTML 5 <meta charset="...">
            $key = "X-Meta-Charset";
            $self->{header}->push_header($key => $attr->{charset});
            return;
        } else {
            return;
        }
    }
    $self->{'header'}->push_header($key => $attr->{content});
}

      

I entered a modified copy of this module into the PERL5LIB directory. I wrapped the step push_header

in a block eval

and fully loaded the page.

+6


source


I had exactly the same problem ...

I decided to turn off the 'parse_head' option that HTML :: HeadParser allows.



    $self->{ua}->parse_head(0);

      

I know it's not a good idea to turn this feature off, but I prefer accessibility over the correct decoded documents.

+3


source







All Articles