Fast, efficient handling of HTTP requests across multiple threads in .NET.

I have been working with .NET since its inception and have been doing parallel programming long before ... however, I cannot explain this phenomenon. This code runs on a production system and does its job for the most part just to get better understanding.

I am passing 10 URLs for parallel processing like this:

    public static void ProcessInParellel(IEnumerable<ArchivedStatus> statuses, 
                                         StatusRepository statusRepository, 
                                         WaitCallback callback, 
                                         TimeSpan timeout)
    {
        List<ManualResetEventSlim> manualEvents = new List<ManualResetEventSlim>(statuses.Count());

        try
        {
            foreach (ArchivedStatus status in statuses)
            {
                manualEvents.Add(new ManualResetEventSlim(false));
                ThreadPool.QueueUserWorkItem(callback,
                                             new State(status, manualEvents[manualEvents.Count - 1], statusRepository));
            }

            if (!(WaitHandle.WaitAll((from m in manualEvents select m.WaitHandle).ToArray(), timeout, false))) 
                throw ThreadPoolTimeoutException(timeout);
        }
        finally
        {
            Dispose(manualEvents);
        }
    }

      

The callback looks something like this:

    public static void ProcessEntry(object state)
    {
        State stateInfo = state as State;

        try
        {
            using (new LogTimer(new TimeSpan(0, 0, 6)))
            {
               GetFinalDestinationForUrl(<someUrl>);
            }
        }
        catch (System.IO.IOException) { }
        catch (Exception ex)
        {

        }
        finally
        {
            if (stateInfo.ManualEvent != null)
                stateInfo.ManualEvent.Set();
        }
    }

      

Each of the callbacks looks at the URL and follows a series of redirects (AllowAutoRedirect is intentionally set to false to handle cookies):

    public static string GetFinalDestinationForUrl(string url, string cookie)
    {
        if (!urlsToIgnore.IsMatch(url))
        {
            HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
            request.AllowAutoRedirect = false;
            request.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
            request.Method = "GET";
            request.KeepAlive = false;
            request.Pipelined = false;
            request.Timeout = 5000;

            if (!string.IsNullOrEmpty(cookie))
                request.Headers.Add("cookie", cookie);

            try
            {
                string html = null, location = null, setCookie = null;

                using (WebResponse response = request.GetResponse())
                using (Stream stream = response.GetResponseStream())
                using (StreamReader reader = new StreamReader(stream))
                {
                    html = reader.ReadToEnd();
                    location = response.Headers["Location"];
                    setCookie = response.Headers[System.Net.HttpResponseHeader.SetCookie];
                }

                if (null != location)
                    return GetFinalDestinationForUrl(GetAbsoluteUrlFromLocationHeader(url, location),
                                                    (!string.IsNullOrEmpty(cookie) ? cookie + ";" : string.Empty) + setCookie);



                return CleanUrl(url);
            }
            catch (Exception ex)
            {
                if (AttemptRetry(ex, url))
                    throw;
            }
        }

        return ProcessedEntryFlag;
    }

      

I have a highly accurate StopWatch around a recursive call to GetFinalDestinationForUrl with a threshold of 6 seconds and usually callbacks that complete within that time.

However, WaitAll , with a large timeout (0,0,60) for 10 threads, is still regularly consumed.

The exception prints something like:

System.Exception: Not all streams returned in 60 seconds: Max. Worker: 32767, Max. I / O: 1000, Available Worker: 32764, Available I / O: 1000 at Work.Threading.ProcessInParellel (IEnumerable`1 statuses, StatusRepository statusRepository, WaitCallback callback, TimeSpan timeout) in Work.UrlExpanderWorker.SyncAllUsers ()

This works on .NET 4 with maxConnections set to 100 for all urls.

My only theory is that a synchronous HttpWebRequest call can block for longer than the specified timeout? This is the only reasonable explanation. The question is why and how is the best way to force a real timeout on this operation?

Yes, I know that the recursive call specifies a timeout of 5 seconds per call, but it may take multiple calls to process the given URL. But I almost never see StopWatch warnings. For every 20-30 WaitAll time errors that I see, I can see one message indicating that a given thread took more than 6 seconds. If the problem is that 10 threads cumulatively need more than 60 seconds, then I should see at least a 1: 1 ratio (if not higher) between messages.

UPDATE (March 30, 2012):

I can confirm that network calls alone do not honor timeouts under certain circumstances:

            Uri uri = new Uri(url);
            HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(uri);
            request.AllowAutoRedirect = false;
            request.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
            request.Method = "GET";
            request.KeepAlive = false;
            request.Pipelined = false;
            request.Timeout = 7000;
            request.CookieContainer = cookies;

            try
            {
                string html = null, location = null;

                using (new LogTimer("GetFinalDestinationForUrl", url, new TimeSpan(0, 0, 10)))
                    using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
                    using (Stream stream = response.GetResponseStream())
                    using (StreamReader reader = new StreamReader(stream))
                    {
                        html = reader.ReadToEnd();
                        location = response.Headers["Location"];
                        cookies = Combine(cookies, response.Cookies);

                        if (response.ContentLength > 150000 && !response.ContentType.ContainsIgnoreCase("text/html"))
                            log.Warn(string.Format("Large request ({0} bytes, {1}) detected at {2} on level {3}.", response.ContentLength, response.ContentType, url, level));
                    }

      

This code regularly registers entries that took 5-6 minutes and they did not exceed 150,000. And I'm not talking about an isolated server here or there, these are random (loud) media sites.

What exactly is going on here and how do we ensure the code is released within a reasonable time frame?

+3


source to share


2 answers


I agree with Aliostad . I don't see any glaring problems with the code. Do you have any locks that are causing these work items to serialize? I don't see anything on the surface, but it's worth double checking if your code is more complex than what you posted. You will need to add registration code to capture the time when these HTTP requests start to fire. Hope this gives you some more hints.

In an unrelated note, I usually avoid using WaitHandle.WaitAll

. It has some limitations like 64 descriptor resolution and doesn't work on STA thread. Why use this template instead.



using (var finished = new CountdownEvent(1);
{
  foreach (var item in workitems)
  {
    var capture = item;
    finished.AddCount();
    ThreadPool.QueueUserWorkItem(
      () =>
      {
        try
        {
          ProcessWorkItem(capture);
        }
        finally
        {
          finished.Signal();
        }
      }
  }
  finished.Signal();
  if (!finished.Wait(timeout))
  {
    throw new ThreadPoolTimeoutException(timeout);
  }
}

      

+2


source


I have gone through your entire code. As far as I understand, I see no problem.

So there seems to be another problem, but for handling I suggest:

Record trace, debug, or console output at the beginning GetFinalDestinationForUrl

and end of it, and include the URL in the trace.

This will help you identify the problem. This will help you if HttpWebRequest

your 5 second timeout or .NET is not honoring your 100 concurrent connections.

Update your question with the result and I'll review it again.


UPDATE



I've reviewed your new improvements. Well done to isolate the problem: It now confirms that WaitAll is not honoring your timeout.

Now this seems like a Microsoft support issue and is worth raising on them, unless others can spot an issue with this detail. (it's worth asking Eric Lippert and Jon Skeet to read this question)

In my personal experience, even when I sent them the code to reproduce it and they reproduced it, I got no response. Now this is BizTalk, this is the .NET framework, so I think you will probably get a better answer.


My theory of dampness

I also have a rough theory that I feel during high load and maximum context switches , the wait thread doesn't get the context much longer than expected, so it doesn't get a chance to time out and abort all those threads. Another theory is that threads that are busy with their I / O take longer to interrupt and do not respond to interruption. Now this, as I said, is rough, proving or solving it far beyond my competence.

+1


source







All Articles