View "Page Source" shows different HTML than cURL

First of all, my problem is different from this: Difference between cURL and a web browser?

I am using the Chrome browser to visit: http://www.walmart.com/search/browse-ng.do?cat_id=1115193_1071967 And then I go through the source code of the page:

<a class="js-product-title" href="/ip/Tide-Simply-Clean-Fresh-Refreshing-Breeze-Liquid-Laundry-Detergent-138-fl-oz/33963161">

However, I did not find such information from the command line:

curl "http://www.walmart.com/search/browse-ng.do?cat_id=1115193_1071967">local.html

      

Does anyone know why the reason for the difference? I am using a selector to select Python to parse a webpage.

+3


source to share


3 answers


In the browser, you can execute JavaScript, which in turn can modify the document. Curl will just give you a simple raw result and nothing else.



If you disable JavaScript in your browser and refresh the page, you will see that it looks different.

+6


source


In addition to the simple JS implementation as explained in another answer, your browser does a lot more work to fetch this page from the server you are viewing and the server can react to that.

  • Open Chrome, press F12, go to the "Network" tab.
  • Load the page you want.
  • Look for the first thing requested (it should be the document icon, with the address below it, you can also sort by "timeline" to find it).
  • Right click on the item, select "Copy as cURL"

Paste this into notepad and see what your browser sent to get this as well as the simple curl command you made.



curl "http://stackoverflow.com/questions/25333342/viewing-page-source-shows-different-html-than-curl" -H "Accept-Encoding: gzip,deflate,sdch" -H "Accept-Language: en-US,en;q=0.8" -H "User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36" -H "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8" -H "Referer: http://stackoverflow.com/questions?page=2&sort=newest" -H "Cookie: <cookies redacted because lulz>" -H "Connection: keep-alive" -H "Cache-Control: max-age=0" --compressed

      

Things like the language header sent, and the user agent (more or less used by the browser and OS), even in some cases, if it was requested compressed, all this can cause the server to generate the page differently. It could just be a normal reaction (like giving a specific browser html to that browser only, coughing *, etc.) or part of higher level A / B testing on new projects or features. Chances are, the content returned to you by the URL might be different for someone else, or even for you using a different browser or tool.

I should also point out that what you are looking at on the page is not what appears with the view source. The source is what was sent to your browser for rendering. What you actually see on the page is something after rendering and running Javascript. Most browsers support some sort of "Inspect" feature in the right click menu, I suggest you take a look at the pages with this and compare with what is displayed in the view source. This will change your perspective on how the web works.

+3


source


I don't know if you found your answer or not. I have a solution. It could be because the server is throwing a 301, etc. The code is straightforward C, so adapt yourself.

curl_easy_setopt(curl, CURLOPT_NOPROGRESS, 0);
curl_easy_setopt(curl, CURLOPT_VERBOSE, 1L); // To see what happening
curl_easy_setopt(curl, CURLOPT_USERAGENT, curlversion); // variable
curl_easy_setopt(curl, CURLOPT_FOLLOWLOCATION, 1L); // Optional/toggle

      

The last option must be checked with / without to see the accuracy in both the browser and curl's.

Also see detailed description by sending command line command

:~$ curl -v http://myurl > page.html

      

See the difference. This should help.

+1


source







All Articles