How can I simulate a web browser so that the site serves me the correct HTML source?
I'm trying to webscrape a website and it seems to feed me fake HTML using the WebClient.DownloadData () method.
Is there a way for me to "trick" the website into thinking that I am a browser?
Edit:
Adding this header still doesn't fix the problem:
Client.Headers.Add("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");
Is there anything else I can try? :)
Edit 2:
If it helps at all. I am trying to download ThePirateBay search source.
This URL: http://thepiratebay.org/search/documentary/0/7/200
As you can see, the source shows what is needed, seed information for movies, etc. But when I use the DownloadData () method, I get random torrent results, nothing related to what I am looking for.
source to share
I may have missed something, but the following code worked without issue:
Regex torrents = new Regex(
@"<tr>[\s\S]*?<td><a href=""(?<link>.*?)"".*?>" +
@"(?<name>.*?)</a>[\s\S]*?<td><a href=""(?<torrent>.*?)""[\s\S]*?>" +
@"(?<size>\d+\.?\d*) (?<unit>.)iB</td>");
Uri url = new Uri("http://thepiratebay.org/search/documentary/0/7/200");
WebClient client = new WebClient();
string html = client.DownloadString(url);
//string html = Encoding.Default.GetString(client.DownloadData(url));
foreach (Match torrent in torrents.Matches(html))
{
Console.WriteLine("{0} ({1:0.00}{2}b)",
torrent.Groups["name"].Value,
Double.Parse(torrent.Groups["size"].Value),
torrent.Groups["unit"].Value);
Console.WriteLine("\t{0}",
new Uri(url, torrent.Groups["link"].Value).LocalPath);
Console.WriteLine("\t{0}",
new Uri(torrent.Groups["torrent"].Value).LocalPath);
}
source to share
HTTP is a text-based protocol that is very human readable! Connect to the site using telnet and manually enter in HTTP requests. This allows you complete control over the user agent string and other related information. It's also very simple.
When you get it manually, you can add this functionality to your application with very simple socket programming.
More details: http://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol
I would post links to the RFC and Wikipedia page on the user-agent string, but I just joined.
source to share