Service recovery with event repositories skipping message acknowledgements

estebanp · December 8, 2021, 6:23am

Hi guys,

I’m trying to improve Olympia catastrophe functionality and while doing so I noticed a case where message acknowledgement (EventSucceeded on IROValidatedSessionsChangesListener) will not be triggered and will cause events to never get a confirmation of acknowledgement.

Following the code the uROBaseSuperTCPServer unit on the Callback method only sends acknolwedgements when the TROEventData is of type named. This works on most cases, but when there is a disconnection of a client due to a server crash and messages are stuck waiting to be deliver once server comes back up and client automatically connects it will pickup any stored events on the repository ( on method TROSCServerWorker.Connected ) but it will create them as regular TROEventData because it doesnt have the original messageid/eventid attached to it.

Sending the events to be process which it does correctly, but skipping the acknowledgment of doing so to the repositories, making any sort of message delivery confirmation useless on catastrophic events.

The line above, due to the fid being empty is the one that will force skipping the acknowledgement:

if Supports(fOwner.fEventRepository, IROValidatedSessionsChangesListener, asv) and not IsEqualGUID(fid, EmptyGUID) then asv.EventSucceeded(fClientGuid, fid);

There should be a reason for the special considerations for TRONamedEventData vs TROEventData but i lack the context to figure it out.

Any assistance on this will be appreciated.

estebanp · December 8, 2021, 6:40am

We’ll probably need some sort of extra olympia call where a structure of proper message id/ event data can be called instead of the typical GetEventData for reconnection cases so it can be dispatched properly.

I don’t see an easy workaround for this one. Specially because its part of the base class TROEventRepository and could have implications.

RemObjectsSoftware · December 8, 2021, 9:28am

Logged as bugs://D19191.

estebanp · December 8, 2021, 6:22pm

The issue becomes more evident when you have 2 same services running on different machines using olympia and you have clients connected to each service. If one goes down, all messages sent by the active one will be queued and when the other service comes back up all clients will start getting the messages, the first time it will be all good. But, if the service goes down again, all clients will not only receive the new missing ones, but also the ones from the first time it went down.

So everytime it recovers, unless the messages are expired or the sessions are killed very frequently you can have thousands of messages sent out as duplicates, so you can imagine the chaos that caused on our services.

RemObjectsSoftware · June 17, 2022, 11:38am

bugs://D19191 was closed as fixed.