Fatal error on live servers

I am writing some client / server software and I am facing the following design problem. I usually use the VERIFY macro very liberally - if something is wrong on the user's machine, I want the software to fail and log the error so that it can be fixed. I've never been a fan of ignoring any mistakes.

However, I am now writing a server. If the server dies, many clients go down, so the server should die as little as possible. So I don't know how to deal with some conditions that I would consider as fatal exceptions otherwise.

For example, I receive a network packet from a user who is not logged in. While it doesn't have to be, I have enough experience to be aware from time to time that "impossible" errors do occur. So I'm pretty sure that if I make a fatal mistake in these cases, the server WILL crash eventually. On the other hand, I could log and ignore the error and continue, but I'm afraid that some errors might go unnoticed this way.

What would you do in a situation like this?

+1


source to share


1 answer


If you can recover from the mistake, then obviously it was not fatal. I don't see the benefit of failing if you can log the error and continue execution - the most important thing is that you get the error logged. If you can recover and continue working as usual, this is the best course.

You should also implement a notification system ( server monitoring ) that, depending on the error level, will notify you with varying degrees of urgency, so you would pick up something critical as soon as possible. There is a common system for servers such as Nagios and Munin . You should take a look at what they are doing and see if you can take something from them and implement / integrate it into your system.



Regardless, you should make sure that the client instances are as isolated as possible. Client stream threads shouldn't bring the entire server down - ever (at least in theory).

+3


source







All Articles