How can I simulate a web browser so that the site serves me the correct HTML source?

I'm trying to webscrape a website and it seems to feed me fake HTML using the WebClient.DownloadData () method.

Is there a way for me to "trick" the website into thinking that I am a browser?

Edit:

Adding this header still doesn't fix the problem:

Client.Headers.Add("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");

      

Is there anything else I can try? :)

Edit 2:

If it helps at all. I am trying to download ThePirateBay search source.

This URL: http://thepiratebay.org/search/documentary/0/7/200

As you can see, the source shows what is needed, seed information for movies, etc. But when I use the DownloadData () method, I get random torrent results, nothing related to what I am looking for.

+2


source to share


5 answers


Try adding a custom user agent header so that it thinks you are one of the major browsers (IE, FF, etc.)



client.Headers.Add ("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");

      

+4


source


I may have missed something, but the following code worked without issue:



Regex torrents = new Regex(
    @"<tr>[\s\S]*?<td><a href=""(?<link>.*?)"".*?>" + 
    @"(?<name>.*?)</a>[\s\S]*?<td><a href=""(?<torrent>.*?)""[\s\S]*?>" + 
    @"(?<size>\d+\.?\d*)&nbsp;(?<unit>.)iB</td>");
Uri url = new Uri("http://thepiratebay.org/search/documentary/0/7/200");

WebClient client = new WebClient();
string html = client.DownloadString(url);
//string html = Encoding.Default.GetString(client.DownloadData(url));

foreach (Match torrent in torrents.Matches(html))
{
    Console.WriteLine("{0} ({1:0.00}{2}b)", 
        torrent.Groups["name"].Value, 
        Double.Parse(torrent.Groups["size"].Value), 
        torrent.Groups["unit"].Value);
    Console.WriteLine("\t{0}", 
        new Uri(url, torrent.Groups["link"].Value).LocalPath);
    Console.WriteLine("\t{0}",
        new Uri(torrent.Groups["torrent"].Value).LocalPath);
}

      

+3


source


HTTP is a text-based protocol that is very human readable! Connect to the site using telnet and manually enter in HTTP requests. This allows you complete control over the user agent string and other related information. It's also very simple.

When you get it manually, you can add this functionality to your application with very simple socket programming.

More details: http://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol

I would post links to the RFC and Wikipedia page on the user-agent string, but I just joined.

+1


source


WebClient client = new WebClient ();
client.Headers.Add ("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");

      

0


source


Try to print the WebClient headers - maybe there is something weird in there by default that might be related to the site not in the browser?

0


source







All Articles