Server lockups, running out of ideas

Having a nightmare of an issue and I’m fast running out of ideas. My server runs as a Windows service and ticks along quite happily most of the time.

Every now and then though, both at our offices and at customer sites, it’ll suddenly freeze up. It goes completely unresponsive to clients, as if there’s no server there at all. According to netstat, it’s still listening on the relevant ports but no connections can be made. No exceptions are thrown at all and it appears to be totally random. If I inspect the process in task manager, I can see the number of threads creeping up slowly ad infinitum. The only solution is to stop the server and kill the process and restart.

I simply can’t find what’s wrong or why this is happening, as I said there seems to be no pattern to it at all. I know I’m clutching at straws but does anyone have any idea what could be causing this or even experienced anything similar?

Hello

What RemObjects SDK version are you using?
What platform is used for your server and clients (ie Delphi, .NET, XCode etc). Also which exactly channels are used at server and client sides?

I’m using 8.0.83.1137 and both client and server are Delphi.

For channels I’m using TROSynapseSuperTCPServer and TROIpHTTPServer.

I’m currently looking at putting in a load of debug info so I can try to track what the server is doing at the moment the problem happens.

That’d be a good next step, yes. See f the new requests stilt your service (and just doit get back to the client), not not.

My current plan is to log all service creation/activation/destruction and I’m also trapping and logging all inbound messages on the TROBinMessage and TROSOAPMessage components so I should be able to see all client communication attempts.

I’m also trying to get remote debugging works so, when it happens on a machine within our own offices, I can connect my IDE to the running instance and try to work out what it’s doing. Remote debugging is flaky at best though.

Ok had a couple of incidents since adding some more logging. I’ve put debug output into the Create/Destroy/Activate/Deactivate handlers on all my services.

What I’m seeing is that, suddenly all services start reporting only Create & Activate and never Deactivate or Destroy. Which service it is seems random at the moment, I can’t see any pattern.

Obviously this locks up the whole server as every client request just creates another service thread and never returns.

Can you add a bit more logging?
Add it to your service methods to be sure that they were started and finished as expected.
it will allow to narrow the issue

I think I can see a bit of a pattern. Once it happens, any instantiated services which require a session only ever call Create, never Activate, Deactivate or Destroy. Those which don’t require a session will call Create and Activate, but not Deactivate or Destroy. It’s like the latter get one step further.

I’m wondering if it’s a session issue, i.e. somehow the session storage system is falling over because of something else I’m doing somewhere and this causes everything to start locking up when it tries accessing sessions.

Clutching at straws I know.

Sounds like it could be a deadlock in the session system, or — mor likely, sec it also affects those services without session, just differently, the activator code that calls not the session management. I’ll let Eugene review this further, as he’s more familiar with that code. What session manager are you using? the standard InMemory one?

Yeah just the standard one.

It’s highly possible/likely that this is something that’s my fault, i.e. a bug in my code somewhere is destabilising the session manager in some way, I just can’t figure out how.

I do have some code which creates and destroys sessions manually as I often need to use a service from within the server itself rather than in response to a client request so I create a session for this purpose and destroy it afterward. I’m not sure if maybe this is causing some weird issues. What I’m doing here is creating a new session and assigning it to an LDA instance before doing my data access, which instantiates the service. When my data module is destroyed (and the LDA along with it), I delete the session.

This has always seemed to work fine, although I’ve noticed one little oddity - in such cases the aClientID GUID passed to the Deactivate service event handler is always an empty GUID, suggesting the session has been destroyed by the time the service instance is deactivated. Not sure if this is the case or whether it’s something to be concerned about. Could be totally unrelated of course, I just noticed it whilst trying to track down the main issue.

Can you send your project (or simple testcase that reproduces this case) to support@ ?
It will allow to detect problems in SDK or in your code.

Sorry but I can’t - I can’t even reproduce it here, it happens sporadically and totally randomly on customer sites and only under load, when the customer is make active use of the server. I’d love to be able to reproduce it or distill it down to a simple test case but that’s just not possible so I’m having to add code to log debug events to a file which the customer can then send me when they have an occurrence.

Trying to think about this logically and looking through the RO code.

Taking the cases where the service instance is constructed but never activated, the construction is performed by the aFactory.CreateInstance call in TROInvoker.CustomHandleMessage in uROServer. This obviously succeeds.

BeforeInvoke will then be called which in turn calls objactivation.OnActivate which calls TRORemoteDataModule._OnActivate which then calls TRORemoteDataModule.DoOnActivate.

Now the fOnActivate event handler is fired immediately after the DoOnActivate call so, as this event is never fired, I think it’s fairly safe to assume that DoOnActivate is never returning.

Looking through the TRORemoteDataModule.DoOnActivate code, for services which do not require a session, no session manager will be assigned and thus nothing is done. This would makes sense as such services do fire OnActivate event.

For services which do require a session, fSessionManager.FindSession will be called. I’m suspecting that it’s this call which is blocking and I’m wondering about this code:

  fCritical.Enter;
  try
    result := DoFindSession(aSessionID, aUpdateTime);
  finally
    fCritical.Leave;
  end;

The key thing for me is that no errors are thrown, execution simply blocks with any instantiated services never getting beyond creation or activation. This suggests some kind of block or deadlock and I’m wondering whether the code above is the problem, or more specifically the critical section. If this critical section were to somehow become acquired by a thread but never released, it would surely create the symptoms I’m seeing, as the session manager would effectively become totally blocked with any calls to FindSession (or any other function that needs to acquire the critical section) blocking forever.

Not sure what you make of that analysis or whether you agree or disagree but is there any possible way you can see for that critical section to become blocked?

could you add two log messages around the DoFindSession? this way, we’ll know for sure f its this that hangs.

eg

Log('before');
fCritical.Enter;
try
  Log('in critical section');
  result := DoFindSession(aSessionID, aUpdateTime);
  Log('after DoFind');
finally
  Log('in finally');
  fCritical.Leave;
  Log('out of criticalsection');
end;
Log('after');

Yep going to do a build with that in and get it to the affected customers to see what I can find.

I just managed to reproduce it here with the additional logging in place.

From the logs, it’s definitely this critical section in uROSessions. It looks like DeleteSession is stuck somehow, having acquired the critical section, and this is then blocking every other FindSession call (and any others that use the CS obviously).

Will see if I can reproduce it again and track it down further.

It’s freezing inside DeleteSessions at the call to DoNotifySessionsChangesListener.

It’s obviously something very subtle as I have to really hammer it with multiple instances of a test app calling several server functions repeatedly as fast as they can. Even then it sometimes takes a while to fail.

Cool. that should give us some starting point for a code review to narrow this down.

Cool I’m still trying to track it down further.

My current guess is that the critical sections in each of TROCustomSessionManager and TROInMemoryEventRepository are deadlocking each other in some circumstances. I think there are two threads, each of which has acquired one of these critical sections and then tries to acquire the other.

It might not be relevant but I’ve noticed that the call to DoNotifySessionsChangesListener inside TROCustomSessionManager.DeleteSession is the only time this is called within the session manager’s critical section - all other calls to it are outside.

Ok here’s what seems to be happening, in chronological order:

  1. Thread A calls TROCustomSessionManager.DeleteSession
  2. Thread A acquires the session manager’s critical section
  3. Thread A calls TROInMemorySessionManager.DoDeleteSession
  4. Whilst thread A is executing DoDeleteSession, thread B calls TROEventRepository.AddSession
  5. Thread B enters TROInMemoryEventRepository.DoAddSession
  6. Thread B aquires the event repository’s critical section
  7. Thread B calls TROCustomSessionManager.FindSession
  8. Thread B tries to acquire the session manager’s critical section but blocks because thread A has it
  9. Thread A returns from TROInMemorySessionManager.DoDeleteSession
  10. Thread A calls TROCustomSessionManager.DoNotifySessionsChangesListener
  11. Thread A calls TROInMemoryEventRepository.SessionsChangedNotification
  12. Thread A calls TROEventRepository.RemoveSession
  13. Thread A calls TROInMemoryEventRepository.DoRemoveSession
  14. Thread A tries to acquire the event repository’s critical section but blocks because thread B has it

At this point we’re deadlocked. Thread A has acquired the session manager’s critical section but is waiting on the event repository’s. Thread B has acquired the event repository’s critical section but is waiting on the session manager’s.

This is obviously very dependent on timing of exactly when these threads are called which is why is sporadic and hard to reproduce.
Nonetheless, when it does occur, it locks the entire server solid as any further attempts to access the session manager or event repository just hang.

Hope this diagnosis is of some help to you. I also hope you can find a quick solution as this is causing us huge problems at some customer sites.