I’m using a
TROSynapseSuperTCPServer server paired with a
TROBinMessage. I also have a
TROInMemorySessionManager plugged into a
TROInMemoryEventRepository that uses its own
TROBinMessage for communication.
This works quite well, with dozens of clients at a time all sending requests to the server. The server also sends events on a regular basis to some of its connected clients via calls like this:
However, it happens that I hit a complete stall in the communication between the clients and the server, in that no new request can be made for a while.
I have a “health check” method that detects such a situation and gives me a complete stack trace for all threads running at the time it happens.
What I have noticed is that most threads are stuck in
TRORemoteDataModule.DoOnActivate which waits for
TROCustomSessionManager.FindSession to return after it has been able to acquire its lock on
I say most threads, because there is one thread that is stuck on waiting on something else, namely the
fCritical section of the event repository. This thread has the following call stack:
System.SyncObjs TCriticalSection.Acquire uROEventRepository 749 +2 TROInMemoryEventRepository.DoRemoveSession uROEventRepository 497 +6 TROEventRepository.RemoveSession uROEventRepository 505 +1 TROEventRepository.RemoveSession uROEventRepository 917 +13 TROInMemoryEventRepository.SessionsChangedNotification uROSessions 996 +2 TROCustomSessionManager.DoNotifySessionsChangesListener uROSessions 801 +20 TROCustomSessionManager.DeleteSession uRORemoteDataModule 331 +12 TRORemoteDataModule.DoOnDeactivate
Looking at the source for
TROCustomSessionManager.DeleteSession it is clear that this is the thread that holds
SessionManager.fCritical and thus preventing anyone else from obtaining it.
But this is not enough to explain what is going on, because that thread is waiting for
EventRepository.fCritical and so there must be another thread holding onto it.
And sure enough, there is one, that is right in the middle of dispatching the event presented above. That thread has the following call stack:
System.SyncObjs THandleObject.WaitFor uROBaseSuperTcpConnection 1142 +4 TROBaseSuperChannelWorker.WaitForAck uROBaseSuperTCPServer 232 +30 TSendEvent.Callback uROBaseSuperTCPServer 457 +6 TROSCServerWorker.DispatchEvent uROEventRepository 1130 +3 TIROActiveEventServerList.DispatchEvent uROEventRepository 716 +37 TROInMemoryEventRepository.DoStoreEventData uROEventRepository 532 +2 TROEventRepository.StoreEventData uROEventRepository 538 +1 TROEventRepository.StoreEventData uROBaseSuperTCPServer 247 +45 TSendEvent.Callback uROBaseSuperTCPServer 457 +6 TROSCServerWorker.DispatchEvent uROEventRepository 1130 +3 TIROActiveEventServerList.DispatchEvent uROEventRepository 716 +37 TROInMemoryEventRepository.DoStoreEventData uROEventRepository 532 +2 TROEventRepository.StoreEventData MyServerLibrary_Invk 1801 +10 TMyEventSink_Writer.SendPingEvent
The call to
TROInMemoryEventRepository.DoStoreEventData is done while holding
EventRepository.fCritical and because it goes all the way to
TROBaseSuperChannelWorker.WaitForAck, the critical section is held for the duration of
ROServer.AckWaitTimeout which by default is 10 seconds.
However, as can be seen in the call stack, there is a retry done on this because
TSendEvent.Callback went through its
except part, most likely because a first timeout of 10 seconds already occurred.
I don’t know why the Ack never arrives but the fact that it’s holding onto
EventRepository.fCritical which in turn leads another thread to hold onto
Session.fCritical is a sure way to block every client of services that are using the session manager.
Also, I’m worried that there is an infinite recursion between
TROInMemoryEventRepository.DoStoreEventData as they call each other via the
except part and I could not see any stop condition in there. But I may have missed it, this code is not that clear to me.
The call stacks above are from SDK version 10.0.0.1489 but looking at the what’s new for later versions, I don’t think any change happened in this area.
I know you will ask me for a test application that reproduces the issue, and I’m still trying to create one, but in the meantime, does this ring a bell to you?
Do you have any suggestions as to what I could try to mitigate the lengthy call to
WaitForAck? I already reduced
ROServer.AckWaitTimeout to 4 seconds, but it does not feel robust.