How many Java HttpURLConnections should I open at the same time?

I am writing a multithreaded Java crawler. From what I understand on the web, when a user loads a webpage, the browser asks for the first document (like index.html) and as HTML comes in, it will find other resources to be included (images, CSS, JS) and request these resources at the same time.

My crawler is only asking for the original document. For some reason, I can't get it to scratch more than 2-5 pages every 5 seconds. I spawn a new thread for every HttpURLConnection I make. It looks like I should be at least scraping 20-40 pages per second. If I try to deploy 100 threads, I get I / O exceptions like crazy. Any ideas what's going on?

+2


source to share


4 answers


It would be nice to take a look at your code as you might be doing something a little wrong and that would break your crawler, but since the general rule of thumb to do asynchronous I / O is far superior to the IO blocking that HttpURLConnection offers. Asynchronous IO allows all the processing to be handled in a single thread, and all the actual I / O is done by the operating system in due time.



For a good implementation of the HTTP protocol over asynchronous IO, take a look at Apache HTTP Core . See an example of such a client here .

+1


source


The details about what -kind- of IOExceptions you are getting might be handy. There are several possibilities to consider.



  • Going outside the open file descriptor (too many sockets).
  • Failed connections because too many connections were opened to a specific server.
  • Fetching too much data before processing any of it (assuming it blocks IO - if you make 100 requests to 100 different servers, you suddenly get a stream of data back to you - HTTP GET requests are small - responses, maybe , no. You can use DDoS effectively)
  • You made a stupid mistake in your code :)
0


source


The best number of threads or HttpUrlConnections depends on many factors.

  • If you are crawling an external website where you are not the owner, you should only use one thread and latency. Otherwise, the website might detect a DOS attack. During this time, it may be wise to crawl various websites.
  • If this is your own site without DOS detection, it depends on network latency. Whether the webserver is on your local network then it can be useful to double the CPU cores used. If the web server is on the Internet, then it might be useful to use a few more threads. But I have 100 threads. This could knock your web server out. How many workers does the web server have?
0


source


Oh, and I hope you are close () to your input streams you get from connections. They close in the Connection finalizer anyway, but that could be easy in a few seconds. I came across this question, so maybe this will help you.

0


source







All Articles