Hello,
I’m using a TROSynapseSuperTCPServer
server paired with a TROBinMessage
. I also have a TROInMemorySessionManager
plugged into a TROInMemoryEventRepository
that uses its own TROBinMessage
for communication.
This works quite well, with dozens of clients at a time all sending requests to the server. The server also sends events on a regular basis to some of its connected clients via calls like this:
GetNewEventSinkWriter.SendPingEvent(SessionId, Parameter);
However, it happens that I hit a complete stall in the communication between the clients and the server, in that no new request can be made for a while.
I have a “health check” method that detects such a situation and gives me a complete stack trace for all threads running at the time it happens.
What I have noticed is that most threads are stuck in TRORemoteDataModule.DoOnActivate
which waits for TROCustomSessionManager.FindSession
to return after it has been able to acquire its lock on fCritical
I say most threads, because there is one thread that is stuck on waiting on something else, namely the fCritical
section of the event repository. This thread has the following call stack:
System.SyncObjs TCriticalSection.Acquire
uROEventRepository 749 +2 TROInMemoryEventRepository.DoRemoveSession
uROEventRepository 497 +6 TROEventRepository.RemoveSession
uROEventRepository 505 +1 TROEventRepository.RemoveSession
uROEventRepository 917 +13 TROInMemoryEventRepository.SessionsChangedNotification
uROSessions 996 +2 TROCustomSessionManager.DoNotifySessionsChangesListener
uROSessions 801 +20 TROCustomSessionManager.DeleteSession
uRORemoteDataModule 331 +12 TRORemoteDataModule.DoOnDeactivate
Looking at the source for TROCustomSessionManager.DeleteSession
it is clear that this is the thread that holds SessionManager.fCritical
and thus preventing anyone else from obtaining it.
But this is not enough to explain what is going on, because that thread is waiting for EventRepository.fCritical
and so there must be another thread holding onto it.
And sure enough, there is one, that is right in the middle of dispatching the event presented above. That thread has the following call stack:
System.SyncObjs THandleObject.WaitFor
uROBaseSuperTcpConnection 1142 +4 TROBaseSuperChannelWorker.WaitForAck
uROBaseSuperTCPServer 232 +30 TSendEvent.Callback
uROBaseSuperTCPServer 457 +6 TROSCServerWorker.DispatchEvent
uROEventRepository 1130 +3 TIROActiveEventServerList.DispatchEvent
uROEventRepository 716 +37 TROInMemoryEventRepository.DoStoreEventData
uROEventRepository 532 +2 TROEventRepository.StoreEventData
uROEventRepository 538 +1 TROEventRepository.StoreEventData
uROBaseSuperTCPServer 247 +45 TSendEvent.Callback
uROBaseSuperTCPServer 457 +6 TROSCServerWorker.DispatchEvent
uROEventRepository 1130 +3 TIROActiveEventServerList.DispatchEvent
uROEventRepository 716 +37 TROInMemoryEventRepository.DoStoreEventData
uROEventRepository 532 +2 TROEventRepository.StoreEventData
MyServerLibrary_Invk 1801 +10 TMyEventSink_Writer.SendPingEvent
The call to TROInMemoryEventRepository.DoStoreEventData
is done while holding EventRepository.fCritical
and because it goes all the way to TROBaseSuperChannelWorker.WaitForAck
, the critical section is held for the duration of ROServer.AckWaitTimeout
which by default is 10 seconds.
However, as can be seen in the call stack, there is a retry done on this because TSendEvent.Callback
went through its except
part, most likely because a first timeout of 10 seconds already occurred.
I don’t know why the Ack never arrives but the fact that it’s holding onto EventRepository.fCritical
which in turn leads another thread to hold onto Session.fCritical
is a sure way to block every client of services that are using the session manager.
Also, I’m worried that there is an infinite recursion between TSendEvent.Callback
and TROInMemoryEventRepository.DoStoreEventData
as they call each other via the except
part and I could not see any stop condition in there. But I may have missed it, this code is not that clear to me.
The call stacks above are from SDK version 10.0.0.1489 but looking at the what’s new for later versions, I don’t think any change happened in this area.
I know you will ask me for a test application that reproduces the issue, and I’m still trying to create one, but in the meantime, does this ring a bell to you?
Do you have any suggestions as to what I could try to mitigate the lengthy call to WaitForAck
? I already reduced ROServer.AckWaitTimeout
to 4 seconds, but it does not feel robust.