How can I simulate a web browser so that the site serves me the correct HTML source?

Question

How can I simulate a web browser so that the site serves me the correct HTML source?

I'm trying to webscrape a website and it seems to feed me fake HTML using the WebClient.DownloadData () method.

Is there a way for me to "trick" the website into thinking that I am a browser?

Edit:

Adding this header still doesn't fix the problem:

Client.Headers.Add("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");

Is there anything else I can try? :)

Edit 2:

If it helps at all. I am trying to download ThePirateBay search source.

This URL: http://thepiratebay.org/search/documentary/0/7/200

As you can see, the source shows what is needed, seed information for movies, etc. But when I use the DownloadData () method, I get random torrent results, nothing related to what I am looking for.

+2

c # winforms

Sergio Tapia 23 oct. 09 at 0:54

source to share

5 answers

Matt wrock · Answer 1 · 2009-10-23T00:58:58+0000

Try adding a custom user agent header so that it thinks you are one of the major browsers (IE, FF, etc.)

client.Headers.Add ("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");

Rubens farias · Answer 2 · 2009-10-23T01:42:08+0000

I may have missed something, but the following code worked without issue:

Regex torrents = new Regex(
    @"<tr>[\s\S]*?<td><a href=""(?<link>.*?)"".*?>" + 
    @"(?<name>.*?)</a>[\s\S]*?<td><a href=""(?<torrent>.*?)""[\s\S]*?>" + 
    @"(?<size>\d+\.?\d*)&nbsp;(?<unit>.)iB</td>");
Uri url = new Uri("http://thepiratebay.org/search/documentary/0/7/200");

WebClient client = new WebClient();
string html = client.DownloadString(url);
//string html = Encoding.Default.GetString(client.DownloadData(url));

foreach (Match torrent in torrents.Matches(html))
{
    Console.WriteLine("{0} ({1:0.00}{2}b)", 
        torrent.Groups["name"].Value, 
        Double.Parse(torrent.Groups["size"].Value), 
        torrent.Groups["unit"].Value);
    Console.WriteLine("\t{0}", 
        new Uri(url, torrent.Groups["link"].Value).LocalPath);
    Console.WriteLine("\t{0}",
        new Uri(torrent.Groups["torrent"].Value).LocalPath);
}

foxostro · Answer 3 · 2009-10-23T01:50:21+0000

HTTP is a text-based protocol that is very human readable! Connect to the site using telnet and manually enter in HTTP requests. This allows you complete control over the user agent string and other related information. It's also very simple.

When you get it manually, you can add this functionality to your application with very simple socket programming.

More details: http://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol

I would post links to the RFC and Wikipedia page on the user-agent string, but I just joined.

SpliFF · Answer 4 · 2009-10-23T00:57:39+0000

WebClient client = new WebClient ();
client.Headers.Add ("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");

Ben regenspan · Answer 5 · 2009-10-23T01:25:48+0000

Try to print the WebClient headers - maybe there is something weird in there by default that might be related to the site not in the browser?

How can I simulate a web browser so that the site serves me the correct HTML source?

More articles: