How can I efficiently extract HTML content using Perl?

I am writing a scanner in Perl that is supposed to extract the contents of web pages that are on the same server. I am currently using the HTML :: Extract module to do the job, but I found the module to be a bit slow, so I looked into its source code and found out that it does not use a cache connection for LWP :: UserAgent .

My last resort is to grab the source code HTML::Extract

and modify it to use the cache, but I really want to avoid that if I can. Does anyone know of any other module that can do the same job better? I just need to take all the text in an element <body>

with the HTML tags removed.

+2


source to share


4 answers


I am using pQuery for my web search. But I've also heard well about Web :: Scraper .

Both of these along with other modules appeared in the answers on SO for similar questions to yours:



+4


source


HTML::Extract

the functions look very simple and uninteresting. If the modules, as referred draegfun, you're not interested, you can do all that HTML::Extract

really uses LWP::UserAgent

and HTML::TreeBuilder

independently, without requiring a special code at all, and then you will be able to work in caching on your own terms.



+1


source


I am using Web :: Scraper for my cleaning needs. This is very good for fetching data, and since you can call ->scrape($html, $originating_uri)

it then it is very easy to cache the resulting result.

0


source


Do you need to do it in real time? How does inefficiency affect you? Are you doing the task sequentially so that you need to fetch one page before moving on to the next? Why do you want to avoid the cache?

Can your crawler download pages and pass them on to someone else? Perhaps your crawler can even run in parallel or in some kind of distributed fashion.

0


source







All Articles