Parse html code to find the field
I have this page http://www.elseptimoarte.net/ . The page has a search box. If I put for example "batman" it gave me some search results with the url of each result: http://www.elseptimoarte.net/busquedas.html?cx=003284578463992023034%3Alraatm7pya0&cof=FORID%3A11&ie=ISO-8859 -1 & oe = ISO-8859-1 & q = batman # 978
I would like to parse the html code to get the url like firse links: Example: www.elseptimoarte.net/peliculas/batman-begins-1266.html
The problem is I am using curl (in bash), but when I do curl -L -s http://www.elseptimoarte.net/busquedas.html?cx=003284578463992023034%3Alraatm7pya0&cof=FORID%3A11&ie=ISO-8859 -1 & oe = ISO-8859-1 & q = batman # 978 it doesn't give a link.
Any help?
Thanks a lot and sorry for my english!
This may not be exactly what you are looking for, but it gives me the same answer as your example. Perhaps you can customize it according to your needs:
From bash enter:
$ wget -U 'Mozilla/5.0' -O - 'http://www.google.com/search?q=batman+site%3Awww.elseptimoarte.net' | sed 's/</\
</g' | sed -n '/href="http:\/\/www\.elseptimoarte\.net/p'
"</ g" starts a new line. Do not include the prompt ($). Someone more familiar with sed can do better than me. You can replace the "batman" query string and / or duplicate site url strings as per your need.
The following was my conclusion:
<a href="http://www.elseptimoarte.net/peliculas/batman-begins-1266.html" class=l>
<a href="http://www.elseptimoarte.net/peliculas/batman:-the-dark-knight-30.html" class=l>El Caballero Oscuro (2008) - El Séptimo Arte
<a href="http://www.elseptimoarte.net/-batman-3--y-sus-rumores-4960.html" class=l>'
<a href="http://www.elseptimoarte.net/esp--15-17-ago--batman-es-lider-y-triunfadora-aunque-no-bate-record-4285.html" class=l>(Esp. 15-17 Ago.)
<a href="http://www.elseptimoarte.net/peliculas/batman-gotham-knight-1849.html" class=l>
<a href="http://www.elseptimoarte.net/cine-articulo541.html" class=l>Se ponen en marcha las secuelas de '
<a href="http://www.elseptimoarte.net/trailers-de-buena-calidad-para--indiana--e--batman--3751.html" class=l>Tráilers en buena calidad de 'Indiana' y '
<a href="http://www.elseptimoarte.net/usa-8-10-ago--impresionante--batman-sigue-lider-por-4%C2%AA-semana-consecutiva-4245.html" class=l>(USA 8-10 Ago.) Impresionante.
<a href="http://www.elseptimoarte.net/usa-25-27-jul--increible--batman-en-su-segunda-semana-logra-75-millones-4169.html" class=l>(USA 25-27 Jul.) Increíble.
<a href="http://www.elseptimoarte.net/cine-articulo1498.html" class=l>¿Aparecerá Catwoman en '
source to share
Pep,
Here's where you can use the command to get what you want:
$ wget -U 'Mozilla/5.0' -O - 'http://www.google.com/search?q=batman+site%3Awww.elseptimoarte.net' | sed 's/</\
</g' | sed -n 's/<a href="\(http:\/\/www\.elseptimoarte\.net[^"]*\).*$/\1/gp' > myfile.txt
This is a slight change to the above command. Places line breaks between URLs, but it is not difficult to change it to give your exact result.
source to share
curl and wget have many uses. I'm sure people have their preferences, but I prefer to run wget first as it has automatic tracking of links to the specified depth and is generally a little more versatile with general text web pages, while I use curl when I need a less general protocol, or I need to interact with form data.
You can use curl if you have any preference, although I think wget is more appropriate. In the above command, just replace "wget" with "curl" and "-U" with "-A". Omit '-O -' (I believe the default curl for stdout if not on your machine uses the appropriate flag) and leave everything else the same. You should get the same result.
source to share