EventRepository is locked for all the time an event is dispatched

Hello,

I’m using a TROSynapseSuperTCPServer server paired with a TROBinMessage. I also have a TROInMemorySessionManager plugged into a TROInMemoryEventRepository that uses its own TROBinMessage for communication.
This works quite well, with dozens of clients at a time all sending requests to the server. The server also sends events on a regular basis to some of its connected clients via calls like this:

GetNewEventSinkWriter.SendPingEvent(SessionId, Parameter);

However, it happens that I hit a complete stall in the communication between the clients and the server, in that no new request can be made for a while.
I have a “health check” method that detects such a situation and gives me a complete stack trace for all threads running at the time it happens.
What I have noticed is that most threads are stuck in TRORemoteDataModule.DoOnActivate which waits for TROCustomSessionManager.FindSession to return after it has been able to acquire its lock on fCritical

I say most threads, because there is one thread that is stuck on waiting on something else, namely the fCritical section of the event repository. This thread has the following call stack:

System.SyncObjs                 TCriticalSection.Acquire
uROEventRepository     749   +2 TROInMemoryEventRepository.DoRemoveSession
uROEventRepository     497   +6 TROEventRepository.RemoveSession
uROEventRepository     505   +1 TROEventRepository.RemoveSession
uROEventRepository     917  +13 TROInMemoryEventRepository.SessionsChangedNotification
uROSessions            996   +2 TROCustomSessionManager.DoNotifySessionsChangesListener
uROSessions            801  +20 TROCustomSessionManager.DeleteSession
uRORemoteDataModule    331  +12 TRORemoteDataModule.DoOnDeactivate

Looking at the source for TROCustomSessionManager.DeleteSession it is clear that this is the thread that holds SessionManager.fCritical and thus preventing anyone else from obtaining it.

But this is not enough to explain what is going on, because that thread is waiting for EventRepository.fCritical and so there must be another thread holding onto it.
And sure enough, there is one, that is right in the middle of dispatching the event presented above. That thread has the following call stack:

System.SyncObjs                     THandleObject.WaitFor
uROBaseSuperTcpConnection 1142   +4 TROBaseSuperChannelWorker.WaitForAck
uROBaseSuperTCPServer      232  +30 TSendEvent.Callback
uROBaseSuperTCPServer      457   +6 TROSCServerWorker.DispatchEvent
uROEventRepository        1130   +3 TIROActiveEventServerList.DispatchEvent
uROEventRepository         716  +37 TROInMemoryEventRepository.DoStoreEventData
uROEventRepository         532   +2 TROEventRepository.StoreEventData
uROEventRepository         538   +1 TROEventRepository.StoreEventData
uROBaseSuperTCPServer      247  +45 TSendEvent.Callback
uROBaseSuperTCPServer      457   +6 TROSCServerWorker.DispatchEvent
uROEventRepository        1130   +3 TIROActiveEventServerList.DispatchEvent
uROEventRepository         716  +37 TROInMemoryEventRepository.DoStoreEventData
uROEventRepository         532   +2 TROEventRepository.StoreEventData
MyServerLibrary_Invk      1801  +10 TMyEventSink_Writer.SendPingEvent

The call to TROInMemoryEventRepository.DoStoreEventData is done while holding EventRepository.fCritical and because it goes all the way to TROBaseSuperChannelWorker.WaitForAck, the critical section is held for the duration of ROServer.AckWaitTimeout which by default is 10 seconds.
However, as can be seen in the call stack, there is a retry done on this because TSendEvent.Callback went through its except part, most likely because a first timeout of 10 seconds already occurred.

I don’t know why the Ack never arrives but the fact that it’s holding onto EventRepository.fCritical which in turn leads another thread to hold onto Session.fCritical is a sure way to block every client of services that are using the session manager.
Also, I’m worried that there is an infinite recursion between TSendEvent.Callback and TROInMemoryEventRepository.DoStoreEventData as they call each other via the except part and I could not see any stop condition in there. But I may have missed it, this code is not that clear to me.

The call stacks above are from SDK version 10.0.0.1489 but looking at the what’s new for later versions, I don’t think any change happened in this area.
I know you will ask me for a test application that reproduces the issue, and I’m still trying to create one, but in the meantime, does this ring a bell to you?
Do you have any suggestions as to what I could try to mitigate the lengthy call to WaitForAck? I already reduced ROServer.AckWaitTimeout to 4 seconds, but it does not feel robust.

One thing that just came to my mind is a small change inside TROCustomSessionManager.DeleteSession, namely going from this:

      end;

      DoNotifySessionsChangesListener(aSessionID, saDelete, nil);
      if Assigned(fOnSessionDeleted) then fOnSessionDeleted(aSessionID, IsExpired);
    finally
      fCritical.Leave;
    end;

to this

      end;
    finally
      fCritical.Leave;
    end;

    DoNotifySessionsChangesListener(aSessionID, saDelete, nil);
    if Assigned(fOnSessionDeleted) then fOnSessionDeleted(aSessionID, IsExpired);

This way, the notification to the listeners are done outside the code that holds onto SessionManager.fCritical which avoids the entire “wait lock” situation.
What do you think of this?

Hi,

we can do this change.

can you test it and confirm that this change doesn’t cause any issue and your application works a bit stable, pls ?

I will indeed, and my unit tests are saying it does not have any impact. Running at the client’s site is a bit more difficult and time consuming, I’ll post back when I hear from them.

Can you confirm that your own unit/regression tests are running ok with this change?

One thing that I did not notice in my threads call stacks is that one of them is locked on SessionManager.fCritical like many others but not from TRORemoteDataModule.DoOnActivate but rather from TROSCServerWorker.Connected which calls my OnConnected event handler set on the server instance.
Its code is as follows:

procedure TServerDataModule.ROServerClientConnected(aChannel: IROTransport; const aGuid: TGUID);
var
  Session: TROSession;
  TransportObject: TObject;
  Worker: TROSCServerWorker;
begin
  TransportObject := aChannel.GetTransportObject;
  Worker := TransportObject as TROSCServerWorker;

  Session := ROInMemorySessionManager.FindSession(Worker.ClientID);
  if Assigned(Session) then
  begin
    // do some job with session values
  end;
end;

So clearly, there is a call to FindSession that is blocking and that should be fixed by the suggested change.
But with the current state of things, this prevents the TROSCServerWorker instance from processing any pending packets, among which there might be the ACK that the event sink writer is waiting for.

I thus believe that there should be a warning in the documentation saying that no “lengthy” and/or “session related” work should be done in the OnConnected event handler, for it might prevent network traffic from being properly processed.

Hi,

you can set channel.SynchonizeEvents property to True. in this case code won’t wait events

SynchronizeEvents is only available on channels, here I’m talking about the OnClientConnected event on TROBaseSuperTCPServer which is called directly by TROSCServerWorker.Connected and which is definitely not synchronized.

Hi,

You are right - I’ve mixed client and server-side.
I’ll review possibility to replace TCriticalSection with TMultiReadExclusiveWriteSynchronizer