Server lockups, running out of ideas

Perfect. That’s a classic dead-lock problem. Clearly the code needs to be refactored so that allopaths acquire these two locks in the same order. We’ll look at this ASAP. Thanx!

—marc

Many thanks. I’m just amazed it wasn’t my code as it usually is :slight_smile:

1 Like

pls update uROEventRepository.pas as

procedure TROInMemoryEventRepository.DoAddSession(aSessionID: TGUID;
  aActiveEventServer: IROActiveEventServer; const aEventSinkId: AnsiString);
var
  newsession : TROSessionReference;
  idx: Integer;
  s: string;
  temp: boolean;
begin
  if (csDestroying  in ComponentState) then Exit;
  fCritical.Enter;
  try
    s := GUIDToString(aSessionID);
    if fSessionIDs1.IndexOf(s) = -1 then begin
      fCritical.Leave;
      temp := SessionManager.FindSession(aSessionID,False) <> nil;
      fCritical.Enter;
      if temp then fSessionIDs1.Add(s);
    end;

    idx := fSessionIDs.IndexOf(s);
....

Thanks, logged as bugs://71885

bugs://71885 got closed with status fixed.

Thanks, testing that now.

EDIT: Looks good, my usual reproduction steps can’t cause the deadlock any more. Going to do a new build of our software and get it out to the affected customers and see how it goes.

1 Like

Just had another report of it happening again at a customer’s site. Unfortunately I removed the extra critical section logging code in that build as I didn’t think it was needed any longer and it was bloating the logs significantly so I don’t have the detail to go on. I’ve rebuilt it with that logging back in so if it happens again I should be able to trace the problem again.

The one difference I can spot from the logs I have is that, this time, it’s locked up after a service Deactivate event but before Destroy, which is later in the service instance lifecycle than before. It would seem that it’s gone in TRORemoteDataModule.DoOnDeactivate this time.

:(. Damn.

I think it’s definitely better - I can no longer reproduce it here with my test application. I suspect it’s going to be something very similar, another deadlock somewhere, just slightly different. Hopefully it’ll happen again fairly soon with the build with the critical section logging in place and I can trace through again and find it.

Have some more logs from another occurrence. From a cursory inspection it looks like it’s blocking within ClearSessions - it acquires the critical section but never releases it, blocking everything.

EDIT: I can’t quite follow the path. The ClearSessions call is from the thread monitoring session expiry. Most of the time there’s nothing to do so ClearSessions runs straight through without doing anything but sometimes it needs to expire sessions.

When it does so, it seems to be calling FindSessions but I can’t figure out why or where this happens. ClearSessions calls DoClearSessions but this just iterates fSessionList, calling CheckSessionIsExpired on each. If this returns true then it calls DeleteSession. The logs I have say that FindSession is being called within ClearSessions and I can’t figure out why.

In the case where it failed earlier, ClearSessions acquired the critical section and then called FindSession twice, then DoStoreEventData, then FindSession again but after that there’s no sign of the thread again before the lockup. The last call to FindSession looks like it completed (it released the critical section - nested from ClearSessions) but there’s no sign of execution returning to ClearSessions nor the critical section being released. This is obviously where the block is happening but I can’t figure out why.

Seems the errant FindSession is my fault. I have a BeforeDeleteSession event handler defined and this is calling FindSession to retrieve some values from the session object. This explains the call but not sure if this is what’s causing the problems yet as the FindSession call executes within the same thread which has already acquired the session manager’s critical section in ClearSessions, so it doesn’t block.

1 Like

That’s odd, yeah. I’d say lets see if the issue happens again, with that changed in your side of the code?

—marc

Yeah I’m doing a few other things in that event handler too so I may try disabling it and seeing if that cures the problem, if only as a test to confirm.

1 Like

Well so far we’ve not had any recurrences of this issue in several days.

I suspect the main issue has been fixed but there’s another, even rarer issue somewhere which caused the (thus far) one-off crash last week and which I can’t reproduce.

I think this is related to the code I have in the BeforeDeleteSession handler but can’t be sure. I can remove a fair bit of this code as it’s not actually needed but, for now I’m going to leave it there as I’ve put more debug logging in and, if the issue does arise again, it may serve to prove this is indeed the problem. If it doesn’t occur again after a week or so then I’ll go ahead and remove this code and hope that we’ve solved it once and for all.

Regards the BeforeDeleteSession handler, it does seem that you have to be very careful what you do inside here because it executes within a critical section so there’s the potential for causing deadlocks depending on what you’re doing.

OnBeforeDeleteSession event is fired inside critical section only if ClearSessions is called.
I will review that code

Ok thanks. As ClearSessions is called by a background thread according to the session manager’s SessionCheckInterval property, this could lead to some very subtle thread timing issues.