High availability network interruption resilience

obones · February 16, 2023, 4:04pm

Hello,

I have a setup where a server accepts connections from a variety of clients and during its normal course of actions, one call from a client triggers events towards other clients.
The clients receiving those events must reply to the server telling they have received it, and in parallel they may start their own processing that can consume all cores of the CPU where they run. During these heavy usage times, they should continue to receive other events and react to them properly by sending the “ack” message.
While this works in most hardware configuration, we are facing some issues when the clients using all CPU cores are on a High Availability virtual machines that get moved to other hosts because of the CPU usage.
When this happens, there might be some networking glitches and the client is no longer capable of communicating with its server. This means that the server no longer sees the “acks” it expects from its clients and all hell breaks loose. The only way we have to restore functionality is to kill the client and restart them, which proves to be quite complicated when they are running on machines outside of our end user controls. Our clients’ IT managers are not too happy having to do that kind of maintenance and despite us telling them to “pin” the machines to a host, they insist on us being resilient to such a situation.

This is thus the subject of this message: Do you have any suggestions as to how to handle such a case properly?
We are using RO SDK 10.0.0.1489 with a few hot fixes, but it’s still quite old. Do you know if there were changes in latest versions that could address such a situation as described above?

I would appreciate any hints, suggestions here as I’m a bit running out of ideas.

EvgenyK · February 17, 2023, 7:43am

Hi,

Can you specify more details, pls?

Your server platform (.NET or Delphi)
What EventRepository is used (InMemory, Olympia, etc)
what channel is used (http, tcp, supertcp, superhttp)
how do you handle

obones · February 17, 2023, 9:22am

Ah yes, sorry for the missing details, here they are:

All made using Delphi 10.4
InMemory event repository
TROSynapseSuperTCPServer and TROSynapseSuperTCPChannel all communicating with a TROBinMessage
Clients notify the server of having received an event via dedicated service methods available on the server:

  CoClientRepliesService.
    Create(IROmessageCloneable(FReplyMessage).Clone, FChannel).
      ReplyEmptyResult(FId, RequestId);

The Clone is here because lots of replies can happen in parallel and it was discovered (via other discussions here) that a message is not thread safe when used in communicating with a service.

EvgenyK · February 17, 2023, 10:15am

Hi,

it was fixed some time ago so now it’s threadsafe - Clone is called internally automatically.

I can recommend to try use Olympia and their SessionManager/EventRepository because it has internal protection for events: events will be removed from queue once confirmation about receiving them from client-side is received. as a result, you shouldn’t call ClientRepliesService service. it will allow to decrease CPU usage on server-side.

obones · February 22, 2023, 8:08am

I’ll try Olympia, but there is one thing that the example above did not show: Most event replies have a useful payload associated to them in the form of a series of arguments.
I thus believe I won’t be able to get rid of the Reply methods, but maybe Olympia will add more reliability in this aspect

EvgenyK · February 22, 2023, 9:07am

Hi,

OlympiaEventManager supports IROValidatedSessionsChangesListener.
as a result, client notifies server that specified event was received.

try to test the latest version (.1555)