EventRepository is locked for all the time an event is dispatched

obones · September 28, 2021, 3:06pm

Hello,

I’m using a TROSynapseSuperTCPServer server paired with a TROBinMessage. I also have a TROInMemorySessionManager plugged into a TROInMemoryEventRepository that uses its own TROBinMessage for communication.
This works quite well, with dozens of clients at a time all sending requests to the server. The server also sends events on a regular basis to some of its connected clients via calls like this:

GetNewEventSinkWriter.SendPingEvent(SessionId, Parameter);

However, it happens that I hit a complete stall in the communication between the clients and the server, in that no new request can be made for a while.
I have a “health check” method that detects such a situation and gives me a complete stack trace for all threads running at the time it happens.
What I have noticed is that most threads are stuck in TRORemoteDataModule.DoOnActivate which waits for TROCustomSessionManager.FindSession to return after it has been able to acquire its lock on fCritical

I say most threads, because there is one thread that is stuck on waiting on something else, namely the fCritical section of the event repository. This thread has the following call stack:

System.SyncObjs                 TCriticalSection.Acquire
uROEventRepository     749   +2 TROInMemoryEventRepository.DoRemoveSession
uROEventRepository     497   +6 TROEventRepository.RemoveSession
uROEventRepository     505   +1 TROEventRepository.RemoveSession
uROEventRepository     917  +13 TROInMemoryEventRepository.SessionsChangedNotification
uROSessions            996   +2 TROCustomSessionManager.DoNotifySessionsChangesListener
uROSessions            801  +20 TROCustomSessionManager.DeleteSession
uRORemoteDataModule    331  +12 TRORemoteDataModule.DoOnDeactivate

Looking at the source for TROCustomSessionManager.DeleteSession it is clear that this is the thread that holds SessionManager.fCritical and thus preventing anyone else from obtaining it.

But this is not enough to explain what is going on, because that thread is waiting for EventRepository.fCritical and so there must be another thread holding onto it.
And sure enough, there is one, that is right in the middle of dispatching the event presented above. That thread has the following call stack:

System.SyncObjs                     THandleObject.WaitFor
uROBaseSuperTcpConnection 1142   +4 TROBaseSuperChannelWorker.WaitForAck
uROBaseSuperTCPServer      232  +30 TSendEvent.Callback
uROBaseSuperTCPServer      457   +6 TROSCServerWorker.DispatchEvent
uROEventRepository        1130   +3 TIROActiveEventServerList.DispatchEvent
uROEventRepository         716  +37 TROInMemoryEventRepository.DoStoreEventData
uROEventRepository         532   +2 TROEventRepository.StoreEventData
uROEventRepository         538   +1 TROEventRepository.StoreEventData
uROBaseSuperTCPServer      247  +45 TSendEvent.Callback
uROBaseSuperTCPServer      457   +6 TROSCServerWorker.DispatchEvent
uROEventRepository        1130   +3 TIROActiveEventServerList.DispatchEvent
uROEventRepository         716  +37 TROInMemoryEventRepository.DoStoreEventData
uROEventRepository         532   +2 TROEventRepository.StoreEventData
MyServerLibrary_Invk      1801  +10 TMyEventSink_Writer.SendPingEvent

The call to TROInMemoryEventRepository.DoStoreEventData is done while holding EventRepository.fCritical and because it goes all the way to TROBaseSuperChannelWorker.WaitForAck, the critical section is held for the duration of ROServer.AckWaitTimeout which by default is 10 seconds.
However, as can be seen in the call stack, there is a retry done on this because TSendEvent.Callback went through its except part, most likely because a first timeout of 10 seconds already occurred.

I don’t know why the Ack never arrives but the fact that it’s holding onto EventRepository.fCritical which in turn leads another thread to hold onto Session.fCritical is a sure way to block every client of services that are using the session manager.
Also, I’m worried that there is an infinite recursion between TSendEvent.Callback and TROInMemoryEventRepository.DoStoreEventData as they call each other via the except part and I could not see any stop condition in there. But I may have missed it, this code is not that clear to me.

The call stacks above are from SDK version 10.0.0.1489 but looking at the what’s new for later versions, I don’t think any change happened in this area.
I know you will ask me for a test application that reproduces the issue, and I’m still trying to create one, but in the meantime, does this ring a bell to you?
Do you have any suggestions as to what I could try to mitigate the lengthy call to WaitForAck? I already reduced ROServer.AckWaitTimeout to 4 seconds, but it does not feel robust.

obones · September 28, 2021, 3:10pm

One thing that just came to my mind is a small change inside TROCustomSessionManager.DeleteSession, namely going from this:

      end;

      DoNotifySessionsChangesListener(aSessionID, saDelete, nil);
      if Assigned(fOnSessionDeleted) then fOnSessionDeleted(aSessionID, IsExpired);
    finally
      fCritical.Leave;
    end;

to this

      end;
    finally
      fCritical.Leave;
    end;

    DoNotifySessionsChangesListener(aSessionID, saDelete, nil);
    if Assigned(fOnSessionDeleted) then fOnSessionDeleted(aSessionID, IsExpired);

This way, the notification to the listeners are done outside the code that holds onto SessionManager.fCritical which avoids the entire “wait lock” situation.
What do you think of this?

EvgenyK · September 28, 2021, 4:04pm

Hi,

obones:

 end;
    finally
      fCritical.Leave;
    end;

    DoNotifySessionsChangesListener(aSessionID, saDelete, nil);
    if Assigned(fOnSessionDeleted) then fOnSessionDeleted(aSessionID, IsExpired);

we can do this change.

can you test it and confirm that this change doesn’t cause any issue and your application works a bit stable, pls ?

obones · September 29, 2021, 9:19am

I will indeed, and my unit tests are saying it does not have any impact. Running at the client’s site is a bit more difficult and time consuming, I’ll post back when I hear from them.

Can you confirm that your own unit/regression tests are running ok with this change?

obones · September 29, 2021, 9:31am

One thing that I did not notice in my threads call stacks is that one of them is locked on SessionManager.fCritical like many others but not from TRORemoteDataModule.DoOnActivate but rather from TROSCServerWorker.Connected which calls my OnConnected event handler set on the server instance.
Its code is as follows:

procedure TServerDataModule.ROServerClientConnected(aChannel: IROTransport; const aGuid: TGUID);
var
  Session: TROSession;
  TransportObject: TObject;
  Worker: TROSCServerWorker;
begin
  TransportObject := aChannel.GetTransportObject;
  Worker := TransportObject as TROSCServerWorker;

  Session := ROInMemorySessionManager.FindSession(Worker.ClientID);
  if Assigned(Session) then
  begin
    // do some job with session values
  end;
end;

So clearly, there is a call to FindSession that is blocking and that should be fixed by the suggested change.
But with the current state of things, this prevents the TROSCServerWorker instance from processing any pending packets, among which there might be the ACK that the event sink writer is waiting for.

I thus believe that there should be a warning in the documentation saying that no “lengthy” and/or “session related” work should be done in the OnConnected event handler, for it might prevent network traffic from being properly processed.

EvgenyK · September 29, 2021, 1:29pm

Hi,

you can set channel.SynchonizeEvents property to True. in this case code won’t wait events

obones · September 29, 2021, 1:48pm

SynchronizeEvents is only available on channels, here I’m talking about the OnClientConnected event on TROBaseSuperTCPServer which is called directly by TROSCServerWorker.Connected and which is definitely not synchronized.

EvgenyK · September 30, 2021, 3:32pm

Hi,

You are right - I’ve mixed client and server-side.
I’ll review possibility to replace TCriticalSection with TMultiReadExclusiveWriteSynchronizer

obones · November 8, 2021, 11:00am

The change in TROCustomSessionManager.DeleteSession has been working steadily for the past two months, we are happy with it.
However, it is not enough as there is another potential deadlock when the timer in TROCustomSessionManager elapses. At that point, it calls TROCustomSessionManager.ClearSessions that starts by locking itself (fCritical.Enter) and then calls DoClearSessions
In my case, this goes into TROInMemorySessionManager that then calls TROCustomSessionManager.DeleteSession with an already locked session manager. This means that the call to DoNotifySessionsChangesListener is made with a locked session manager, which is exactly what we tried to avoid by doing the change mentioned above.
And in my case, sure enough, the event repository is locked, waiting for an event reply to come back. But this reply comes back in a thread that waits for the session manager to be free, effectively leading to a deadlock.

Now, the event repository is locked because we set BlockingEvents to true back in 2011 when it seemed to be required. The situation might have changed, I’ll evaluate reverting it back to its default value.
But still, even if this is no longer blocking, it may happen that the event repository is locked while the session manager is also from ClearSessions which is why I suggest the following change:

procedure DoClearSessions(aSessionIds: TROSessionIDs; OnlyExpired : boolean); virtual; abstract;

procedure TROCustomSessionManager.ClearSessions(OnlyExpired : boolean);
var
  lRetry: Boolean;
begin
  CreateTimerByRequest;
  try
    fClearing := True;
    lRetry := True;
    while lRetry do begin
      lRetry := False;
      try
        DoClearSessions(GetAllSessions, OnlyExpired);
      except
        on e: Exception do begin
          if assigned(fOnException) then fOnException(EmptyGUID, e, lRetry);
          if not lRetry then raise;
        end;
      end;
    end;
  finally
    fClearing := False;
  end;
end;

procedure TROInMemorySessionManager.DoClearSessions(aSessionIds: TROSessionIDs; OnlyExpired: boolean);
var
  i : integer;
begin
  if OnlyExpired then begin
    for i := Low(aSessionIds) to High(aSessionIds) do
      if CheckSessionIsExpired(aSessionIds[i]) then  DeleteSession(aSessionIds[i], True);
  end
  else begin
    for i := Low(aSessionIds) to High(aSessionIds) do
      DeleteSession(aSessionIds[i], True);
  end;
end;

Basically, remove the lock in TROCustomSessionManager.ClearSessions and pass the list of session IDs to DoClearSessions so that it can work on them without being locked.

EvgenyK · November 9, 2021, 10:44am

Hi,

we can’t remove protection from TROCustomSessionManager.ClearSessions because this may lead to non-threadsafe situation in the DoClearSessions method in descendants.

it is possible to create a simple testcase that reproduces deadlock from the last post?

Edit: check PM

obones · November 9, 2021, 1:07pm

Yes, I had thought about that and I selfishly went the easiest way because we only ever use TROInMemorySessionManager
That being said, it could be possible to declare a protected fLockedClearSessions and set it to True by default. Then, in the derived classes that don’t need it, it is to be set to False in the constructor.
This way it is set to False for the In memory and DB derived classes without impacting any other derived class that we may not know of.

That’s proving quite difficult, it needs thousands of events being sent from a dozen of clients at the same time. I’m trying, but even at the client’s location, the deadlock only occurred twice in 30 days.

I will, thanks