Parse html code to find the field

Question

Parse html code to find the field

I have this page http://www.elseptimoarte.net/ . The page has a search box. If I put for example "batman" it gave me some search results with the url of each result: http://www.elseptimoarte.net/busquedas.html?cx=003284578463992023034%3Alraatm7pya0&cof=FORID%3A11&ie=ISO-8859 -1 & oe = ISO-8859-1 & q = batman # 978

I would like to parse the html code to get the url like firse links: Example: www.elseptimoarte.net/peliculas/batman-begins-1266.html

The problem is I am using curl (in bash), but when I do curl -L -s http://www.elseptimoarte.net/busquedas.html?cx=003284578463992023034%3Alraatm7pya0&cof=FORID%3A11&ie=ISO-8859 -1 & oe = ISO-8859-1 & q = batman # 978 it doesn't give a link.

Any help?

Thanks a lot and sorry for my english!

+1

parsing

pepe Dec 31. '09 at 16:02

source to share

6 answers

Greg · Answer 1 · 2008-12-31T16:16:53+0000

You are not getting the link using cURL because the page is using Javascript to get this data.

Using FireBug I found a real url here - pretty monstrous!

Parker · Answer 2 · 2008-12-31T16:34:05+0000

This may not be exactly what you are looking for, but it gives me the same answer as your example. Perhaps you can customize it according to your needs:

From bash enter:

$ wget -U 'Mozilla/5.0' -O - 'http://www.google.com/search?q=batman+site%3Awww.elseptimoarte.net' | sed 's/</\
</g' | sed -n '/href="http:\/\/www\.elseptimoarte\.net/p'

"</ g" starts a new line. Do not include the prompt ($). Someone more familiar with sed can do better than me. You can replace the "batman" query string and / or duplicate site url strings as per your need.

The following was my conclusion:

<a href="http://www.elseptimoarte.net/peliculas/batman-begins-1266.html" class=l>
<a href="http://www.elseptimoarte.net/peliculas/batman:-the-dark-knight-30.html" class=l>El Caballero Oscuro (2008) - El Séptimo Arte
<a href="http://www.elseptimoarte.net/-batman-3--y-sus-rumores-4960.html" class=l>&#39;
<a href="http://www.elseptimoarte.net/esp--15-17-ago--batman-es-lider-y-triunfadora-aunque-no-bate-record-4285.html" class=l>(Esp. 15-17 Ago.) 
<a href="http://www.elseptimoarte.net/peliculas/batman-gotham-knight-1849.html" class=l>
<a href="http://www.elseptimoarte.net/cine-articulo541.html" class=l>Se ponen en marcha las secuelas de &#39;
<a href="http://www.elseptimoarte.net/trailers-de-buena-calidad-para--indiana--e--batman--3751.html" class=l>Tráilers en buena calidad de &#39;Indiana&#39; y &#39;
<a href="http://www.elseptimoarte.net/usa-8-10-ago--impresionante--batman-sigue-lider-por-4%C2%AA-semana-consecutiva-4245.html" class=l>(USA 8-10 Ago.) Impresionante. 
<a href="http://www.elseptimoarte.net/usa-25-27-jul--increible--batman-en-su-segunda-semana-logra-75-millones-4169.html" class=l>(USA 25-27 Jul.) Increíble. 
<a href="http://www.elseptimoarte.net/cine-articulo1498.html" class=l>¿Aparecerá Catwoman en &#39;

Parker · Answer 3 · 2008-12-31T16:08:12+0000

I'll give you a more verbose command line answer in a second, but by the way, have you considered using Yahoo Pipes? It's a little more than a proof of concept, but it has everything you need.

Parker · Answer 4 · 2008-12-31T23:48:54+0000

Pep,

Here's where you can use the command to get what you want:

$ wget -U 'Mozilla/5.0' -O - 'http://www.google.com/search?q=batman+site%3Awww.elseptimoarte.net' | sed 's/</\                                                            
</g' | sed -n 's/<a href="\(http:\/\/www\.elseptimoarte\.net[^"]*\).*$/\1/gp' > myfile.txt

This is a slight change to the above command. Places line breaks between URLs, but it is not difficult to change it to give your exact result.

Parker · Answer 5 · 2009-01-01T09:18:38+0000

curl and wget have many uses. I'm sure people have their preferences, but I prefer to run wget first as it has automatic tracking of links to the specified depth and is generally a little more versatile with general text web pages, while I use curl when I need a less general protocol, or I need to interact with form data.

You can use curl if you have any preference, although I think wget is more appropriate. In the above command, just replace "wget" with "curl" and "-U" with "-A". Omit '-O -' (I believe the default curl for stdout if not on your machine uses the appropriate flag) and leave everything else the same. You should get the same result.

chakrit · Answer 6 · 2009-01-01T10:04:33+0000

There is Watir for Java

And if you are using .NET C # / VB, you can use WatiN , which is an awesome browser tool.

It's kind of a testing framework with tools to manipulate the browser DOM and poke around, but I believe you can also use those outside of the testing context.

Parse html code to find the field

More articles: