Windows: Overlapped IO vs IO Completion Ports, Real World Performance Events

Question

Windows: Overlapped IO vs IO Completion Ports, Real World Performance Events

So, I was looking for a socket IO overlap for a server application I am building and keep seeing comments from people saying "never use hEvent

" or "IO input ports will be faster" etc. but no one ever says WHY to use hEvent

, and no one ever provides any real data or numbers on completion ports that are faster or how much faster. hEvent

fromWaitForMultipleObjects()

better suited for my application, so if the difference in speed is negligible, I tend to use this, but I don't want to do it without any real data telling me how big a sacrifice I am doing this. I've googled and googled and googled and can't find any benchmarks or articles or ANYTHING comparing the two strategies other than a few StackOverflow answers saying "don't use this" without giving a reason.

Can anyone provide me with some real information or numbers here on the practical, real difference between usage hEvent

and completion ports?

+3

c ++ windows winsock winsock2

ShadauxCat June 20. 17 at 23:26

source to share

4 answers

For maximum performance, you should use the I / O completion ports. The number of sockets is not limited. All other similar apis will only serve 1024 sockets, and performance degrades quickly along with higher CPU consumption.

https://msdn.microsoft.com/en-us/library/windows/desktop/aa365198(v=vs.85).aspx

You can also check out this great presentation on asynchrounous i / o, which I think should make sure someone contemplates writing large application applications on client servers.

Time History: Asynchronous C ++ - Stephen Simpson [ACCU 2017] https://www.youtube.com/watch?v=Z8tbjyZFAVQ

In this presentation, you will find a complete description and comparison of the available technologies, as well as test results. Well worth the time.

Limiting WaitForMultipleObjects () to 64 handles makes it impractical for handling anything with more than a few I / O threads.

+2

Michaël Roy June 21. 17 at 3:01

source to share

If you use the SO_RCVBUF and SO_SNDBUF option to set the TCP stack send and receive buffer to zero, you are basically instructing the TCP stack to do I / O directly using the buffer specified in your I / O call. Therefore, in addition to the non-blocking I / O advantage with overlapping sockets, another advantage is better performance because you keep a copy of the buffer between the TCP stack buffer and the user buffer for each I / O call. But you need to make sure that you do not access the user's buffer after it is sent for the overlap operation and before the overlap operation is completed.

This is why the overlap is faster ... if you choose to poll the async, it will be faster. Then polling on non-blocking sockets because the internet driver will fill the buffer in your code set by the buffer in your internet driver, otherwise you would have to copy it and wait before you can use the received data

Use completion to avoid polling ... Polling can be effective if you are sure the socket received data, but if you do not know when or which socket you should check, like in the chat service. completion ports do an excellent job of running on multiple sockets at the same time. Take action when data is ready.

I replied for this answer after 3 weeks, hope you can use it, otherwise I put another possibility

+1

Patrick josefsen 28 Aug 17 at 12:14

source to share

I want to make a good example of when you should be using an overlapping poll.

Imagine a listing of a legend server: the server must use UDP for the current position and current state of the animation in conjunction with the dumb client, the rest, like kill and delete, must be sent and received using TCP, so we are sure it will be received.

The cost of server hardware is far from cheap, so we want the server to serve many people at the same time, so we create many game instances that serve 10 people on a single server. Port completion / overlap completion is the best solution when you don't know when or which socket you should poll, and you can never predict when a player will attack another player, so TCP must use completion.

With positions / UDP this is a different case, because we know that positions need to update as quickly as the server can handle it.

So how do we deal with this problem? Imagine a thread that launches 100 instances of a game with cops, the poll will overlap here because we know new data will be available all the time, and if we hadn't ignored this client and polled 99 other instances and came back and see if it sent customer something now. The reason this model is better is because if we were to use the completion model, we were never sure that all clients would receive the same priority.

0

Patrick josefsen 31 Aug 17 at 0:38

source to share

ShadauxCat · Accepted Answer · 2017-06-21T01:57:10+0000

This answer comes from Harry Johnston as a comment on the question, and with a little research I found a few details that do a WaitForMultipleObjects

terrifying thing.

The maximum number of objects you can wait is 64. This alone makes the scalability of the WFMO approach practically non-existent. But looking further I found this thread: https://groups.google.com/forum/#!topic/comp.os.ms-windows.programmer.win32/okwnsYetF6g

In NT terms, to introduce a wait, a wait block must be allocated for each object, and each waitblock is queued for the object you are waiting for and then sewn onto the thread. When any of these objects are signaled, all wait blocks must be removed, deallocated, and released back to the pool. All this happens in DISPATCH_LEVEL and everything except the pool allocation and free happens with the held dispatcher back.

(WFMO with fAll == TRUE is even MORE expensive. Every time ANY of the objects is signaled, all the others must be checked. All of this, you guessed it, in DISPATCH_LEVEL with a dispatched spinlock is done.)

This direct blocking at the dispatcher level prevents preemptive and provisional threading across the entire system, even with multiple cores. This is a terrifying and compelling reason to never use WFMO

for anything, ever if you are expecting more than three objects (a thread has three wait blocks pre-allocated and can avoid a lot if you expect 3 or less).

Windows: Overlapped IO vs IO Completion Ports, Real World Performance Events

More articles: