SessionManager OnSessionExpired -> System.Collections.Generic.Queue -> System.OutOfMemoryException

Hi,

I upgraded a customer a few days ago. On that day and also twice on a subsequent day, I’ve seen an error that has never triggered previously in several years of our software running on the customers setup.

Prior to the upgrade, the RemObjects SDK for .NET version was 8.2.29.1153. After the upgrade, they are using 9.5.111.1397. The customer is probably our heaviest user with hundreds of sessions coming and going during any one day. We never actively restart our .NET server process, except for system patching etc.

On the first occasion the error triggered, I saw our .NET server heaving consuming CPU. Memory usage looked to be heading towards 10GB. On that occasion I had been alerted before the process fell over and I actively killed it, so there was no event recorded i.e. I’m assuming this was the same error.

Yesterday, the error occurred twice within 6 hours, resulting in our service not running for a period of time. These I did not see until they had crashed and where the event log did record the event.

Application: oxy003.exe
Framework Version: v4.0.30319
Description: The process was terminated due to an unhandled exception.
Exception Info: System.OutOfMemoryException
Stack:
   at System.Collections.Generic.Queue`1[[System.__Canon, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].SetCapacity(Int32)
   at System.Collections.Generic.Queue`1[[System.__Canon, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].Enqueue(System.__Canon)
   at RemObjects.SDK.Server.MemoryMessageQueue.QueueMessage(System.IO.Stream)
   at RemObjects.SDK.Server.MessageQueueManager.QueueMessage(System.IO.Stream, RemObjects.SDK.Server.IMessageQueue)
   at RemObjects.SDK.Server.MessageQueueManager.QueueMessage(System.IO.Stream, System.Guid)
   at RemObjects.SDK.Server.MessageQueueManager.QueueMessage(RemObjects.SDK.IMessage, System.Guid)
   at RemObjects.SDK.Server.EventSinkManager.RemObjects.SDK.Server.ICustomMessageQueueHandler.DispatchEvent(RemObjects.SDK.IMessage, System.Type, RemObjects.SDK.Server.IEventTargets)
   at OXY003.ServerEvents_EventSinkProxy.onLogout(System.Guid, System.Guid)
   at OXY003.Engine.sessionManager_OnSessionExpired(System.Object, RemObjects.SDK.Server.SessionEventArgs)
   at RemObjects.SDK.Server.MemorySessionManager.ExpireSessions()
   at RemObjects.SDK.Server.SessionManager.ExpireTimerHandler(System.Object)
   at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
   at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
   at System.Threading.TimerQueueTimer.CallCallback()
   at System.Threading.TimerQueueTimer.Fire()
   at System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem()
   at System.Threading.ThreadPoolWorkQueue.Dispatch()

The above would seem to fit with the first incident I witnessed i.e. high memory consumption to the point of failure.

We do have code in the server side OnSessionExpired.

method Engine.sessionManager_OnSessionExpired(sender: System.Object; e: RemObjects.SDK.Server.SessionEventArgs);
var	lCompanyGuid	: System.Guid;
	FormGuid 		: System.Guid;
	iLoop			: Integer;
	rRow			: System.Data.DataRow;
	Sessions		: IList<Guid>;
	lSessions		: Array of Guid;
 begin
	Sessions := self.sessionManager.GetAllSessions;
	Array.Resize(var lSessions, Sessions.Count);
	Sessions.CopyTo(lSessions,0);
  	var Targets : RecipientListEventTargets := new RecipientListEventTargets(lSessions);

	var ev2: IServerEvents := eventSinkManager1.GetEventSink(e.Session.SessionID,typeOf(IServerEvents), Targets) as IServerEvents;

	lCompanyGuid	:= new Guid(e.Session['CompanyGuid'].ToString);
	FormGuid		:= new Guid(e.Session['FormGuid'].ToString);
	ev2.onLogout(lCompanyGuid,FormGuid);

	for iLoop := ServerActivity.MessageDataTable.Rows.Count-1 downto 0 do
	begin
		rRow := ServerActivity.MessageDataTable.Rows[iLoop];
		case (rRow['CompanyGuid'].ToString = lCompanyGuid.ToString) and (rRow['FormGuid'].ToString = FormGuid.ToString) of
			true : 	begin
						rRow.Delete;
						break;
					end;
		end;
	end;
end;

The above code has not changed in several years and I’ve never seen the application fail with the above exception on any customer installation.

I’ve tested locally today with low timeout settings (30 seconds compared to the production setting of 10800) on the session manager. The code still appears to perform as it has always done, triggering the onLogout event to pass back to clients, notifying them of an expired session (used in a monitoring screen to view the sessions still active).

One workaround would be to remove the above code, but I’d rather not resort to that without understanding why we now have an issue. However, this is a key customer and so I do need to implement some remedy fairly quickly.

Any thoughts on what might be happening here ? Changes in the RemObjects framwork that might now be exposing flaws in our code ?

Hello

Unfortunately I do not see any changes that would cause this (at least in SessionManager or EventSinkManager code).

Which channel is used for client <-> server communications?
Is it possible to reproduce the issue locally on your development host or to attach to the running instance using Visual Studio to check out what exactly is consuming memory there?

Unrelated: If the ServerActivity.MessageDataTable here is a System.Data.DataTable instance then I would wrap the loop over it into a BeginLoadData / EndLoadData method calls. This would give you some performance improvement here.

Hi Anton,

Thanks for the reply.

We use SuperTCPServerChannel for communication, which again hasn’t changed for a long time. Server side in the latest version, there are no real changes at all, save for the RemObjects update. The modifications are all client side functionality - just extra screens that follow our standard framework pattern.

Unfortunately I can’t connect to the running instance on the production server with VS. The issue occurred twice yesterday, about six hours apart. The first time, an operator was asked to reboot the server. The second, I was contacted and I just restarted our .NET service. When the issue does happen again, it will also need to be restarted right away due to it being a rather key process for the customer i.e. I wouldn’t have time to investigate even if I could.

I’m happy to continue probing this locally, but haven’t had any success this morning. Ultimately I could take that logout notification code out, with the hope / expectation that the issue will go away. However, I’d rather not lose that functionality and of course I wouldn’t understand the underlying issue or be sure we’d fully removed the problem.

Given the exception, I’m assuming that there is an object that is continually trying to expand its memory up to the point of failure. Would that object be shared across different messages i.e. messages other than that generated in the OnLogout ? It seems unlikely that other messages are building up that memory usage with just the session expiry one tipping it over the edge. I say that as in both log events. it was the session expiry that triggered the failure. If another source of messages built up the object memory in the first place, it seems 99.99% likely that it would have been one of those messages that sent it over the edge and not the session expiry. So logically to me , it would be the session expiry somehow generating that memory usage and not something else ?

Could you try 2 things:

  1. Disable the expiry code altogether to see if it is actually the cause.

  2. Try to comment out only the line to see if there are any changes

    ev2.onLogout(lCompanyGuid,FormGuid);

  3. Then uncomment the line above and try to change this line

    var Targets : RecipientListEventTargets := new RecipientListEventTargets(lSessions);
    

to

  var Targets : RecipientListEventTargets := new RecipientListEventTargets(lSessions.Where(s -> s <> e.Session.SessionID).ToArray());

This is not very effective code, but it is worth to try it here

TIA

Thanks Anton.

I’ll try those suggestions over the coming days. As this is a bit trial and error, I’ll probably have to leave each change in place for a number of days to gauge its effect (so far today we’ve not seen the issue, so it’s not like we could tell immediately whether something has had an effect).

A couple of questions though :-

1). Would you agree that the onSessionExpired is the source of the error ? I’m convinced it should be, but your wording on the “disable expiry cod” option 1 sounds a little sceptical ?

2). I assume the linq change for the Targets is to prevent an attempt being made to send the expired session a notification of the expiry ? If so, should / would that cause a material difference in the older and newer RemObjects framework ?

It is always better to double-check. I’ve seen once a OutOfMemory exception being raised due to a .NET Framework bug (yes, it has bugs to), introduced by a Windows update installed at the same time as the SDK server update.

Again, it is a thing we need to check. SuperTCP has a buffer for guaranteed event delivery. So sending event targeted for an session being expired theoretically could cause this event to be put in the buffer. Still I cannot say this for sure - I’l log an issue to go over the code to double-check for it. Still it will rely on your feedback as well.

Thanks, logged as bugs://83317

Thanks Anton,

Just to update the thread, I’ve now created a modified version of the server application that can be started with a three way switch ( no change from current operation, send expiry notifications to all but the expired session and no sending of expiry notifications).

When I get user agreement, I can switch over to running that new version - I’m initially planning on using the option of notifying of expiry, but not to the expired session.

Since the last post we have seen this error once more on the same customer (earlier today). I happened to be logged onto their server whilst it was in progress and visibly it was consuming high CPU and memory usage was going up rapidly before eventually failing. Would that potentially fit with the same expiry message being accumulated in a buffer for “guaranteed delivery by a super TCP Channel” ?

I have also confirmed that we’ve seen this on one more customer installation. Just one example of the error, that occurred about 3 weeks ago and no issues since. This customer is at least one order of magnitude smaller in terms of sessions than the customer where we’ve now seen the problem 4 times over a couple of weeks.

Can you tell me how much sessions are in use on the larger customer (like dozens, hundreds or thousands)?

The larger customer probably peaks around 100-150 sessions, not that large in the scheme of things.

Hi Anton.

Latest update. I put modified code live last night on the large customer experiencing the issue with an option enabled that would NOT send the expiry notification back to the expired client. That code was live for about 11 hours before it hit a problem.

Below is the current version of the event. The self.SessionExpiry is my command line option for determining what to do with the expiry. For the failure this morning, I had it running with SessionExpiry.NotifyAllButExpired. I could not let the application continue until absolute failure as it was generating support calls from users - I had to terminate and restart. However, the CPU usage was high and memory getting close to 20GB (see below)

method Engine.sessionManager_OnSessionExpired(sender: System.Object; e: RemObjects.SDK.Server.SessionEventArgs);
var		lCompanyGUID	: System.Guid;
		FormGuid 		: System.Guid;
		iLoop			: Integer;
		rRow			: System.Data.DataRow;
		Sessions		: IList<Guid>;
		lSessions		: Array of Guid;
 begin
	try
		lCompanyGUID	:= new Guid(e.Session['CompanyGUID'].ToString);
		FormGuid		:= new Guid(e.Session['FormGuid'].ToString);

		Sessions := self.sessionManager.GetAllSessions;
		Array.Resize(var lSessions, Sessions.Count);
		Sessions.CopyTo(lSessions,0);
		
  		var Targets : RecipientListEventTargets;
		case Self.SessionExpiry of
			SessionExpiry.Standard 				: Targets := new RecipientListEventTargets(lSessions);
			SessionExpiry.NotifyAllButExpired 	: Targets := new RecipientListEventTargets(lSessions.Where(s -> s <> e.Session.SessionID).ToArray());
		end;
		
		case Self.SessionExpiry of
			SessionExpiry.Standard,
			SessionExpiry.NotifyAllButExpired : 
			begin
				var ev2: IServerEvents := eventSinkManager1.GetEventSink(e.Session.SessionID,typeOf(IServerEvents), Targets) as IServerEvents;
				ev2.onLogout(lCompanyGUID,FormGuid);
			end;
			SessionExpiry.NotificationOff : ;
		end;

		for iLoop := ServerActivity.MessageDataTable.Rows.Count-1 downto 0 do
		begin
			rRow := ServerActivity.MessageDataTable.Rows[iLoop];
			case (rRow['CompanyGUID'].ToString = lCompanyGUID.ToString) and (rRow['FormGuid'].ToString = FormGuid.ToString) of
				true : 	begin
								rRow.Delete;
								break;
							end;
			end;
		end;
	except on Ex:Exception do
		begin
			Logger.LogMessage(Ex.Message,EventLogEntryType.Error);
		end;
	end;
end;

The exception handling I added may give some clue, as this error was written to the event log by the Logger.LogMessage

sessionManager_OnSessionExpired : Array dimensions exceeded supported range.

As this server only has no more than 200 sessions, why would I be seeing that error (I’m assuming the range error would be an upper bound and not below the supported range). That is suggestive that the Array.Resize(var lSessions, Sessions.Count); is the triggering error, with the implication that the number of sessions is too high ?

I currently have the code running in live with the SessionExpiry.NotificationOff , which I had hoped would not trigger the issue. Given the array error though, I’m sceptical of that now as the rest of the code in the handler is still live.

My next fallback will be to either put in an option to skip all code in the OnSessionExpired, or simply not trigger it in the first place. I’d rather not have to do either though.

Hmm, so the code was running with normal memory consumption for 11 hours and then it suddenly started to consume enormous amounts of RAM or I misread you?

(A side note - If you don’t want to expose some info in this public thread you can write a mail to support@ where all conversations are kept private)

I have a bad feeling that we are looking in the wrong place. The worst thing regarding OutOfMemory exceptions is that they can be raised in a code COMPLETELY different from the one that actually caused the memory leak. So it is not necessary the event handler we are looking at.

What you need to do is
0.Download this: https://www.microsoft.com/en-us/download/confirmation.aspx?id=16273

1.Monitor memory / CPU consumption of the running server process for some time.
IF the average CPU and memory consumption are slowly raising over the time then yes, we have a memory leak
IF they say more or less the same and then “BOOM! everything is bad” then this is just one offending operation

2.Either way once you see that memory starts to rise start the CLR profiler, attach to the process and press “Show Heap” button

3.You will see something like this: https://i.imgur.com/UwERkqq.png

4.IMPORTANT. Check that the Details level is set to 1 and that Finalizer node is reasonable small: https://i.imgur.com/ZNhg9IS.png

5.Set details to like 10 and check what exactly consumes memory

6.Dbl-click the suspicious node and check detailed info: https://i.imgur.com/92W2Okm.png

7.Check details of the object consuming memory. F.e. right-click it and select ‘Show Individual Instances’ to find out what allocated and still holds these instances

Also when you open that graph you’ll see a path to the allocation log file in the window caption. Please send me this file via support@ so I’ll be able to help with its analysis.

Thanks Anton. I’ll work my way through that. It may be difficult to get that information as it is a live system that needs to be running. As soon as the problem strikes, users are “keen” to get it back as soon as possible.

We’re in “BOOM” territory on the memory usage. Prior to the RemObjects update, I’d never seen the issue. After that, we saw it 24 hours after install, then twice in 8 hours 2 days later, then again 3 days after that. Culminating in this morning, 11 or so hours after restarting with the modified code. I fairly sure (99%) there is no other code running around the point of failure i.e. I’m going to continue focusing on the expired sessions.

As time allows, I’ll get into the CLR Profiler. At the moment, the code is running with no expiry notifications being sent to clients, but with the event handler still being called. Hopefully that will run for X days without issue. If the failure happens at some point, then my next step would probably be to disable the handler altogether, retry and see how long that survives.

Memory leak would be a steady memory consumption increase, especially in the suspected scenario.

It is either a deadlock in the server channel code (tbh that is not that likely) or something else.

Still, I’ll wait for the results of your experiment. Also CLR Profiler dump of the hanged process seems to be the only way to really understand what is going wrong.

Well, that didn’t last long. It just went again, even when sending no expiry notifications to any client. I think it behaved as I was worried it would do - the array code in there still tripping it up, regardless of sending any notifications through the event sync. It’s now running with the next option in line, simply exiting the onSessionExpired without doing anything.

method Engine.sessionManager_OnSessionExpired(sender: System.Object; e: RemObjects.SDK.Server.SessionEventArgs);
var		lCompanyGuid	: System.Guid;
		FormGuid 		: System.Guid;
		iLoop			: Integer;
		rRow			: System.Data.DataRow;
		Sessions		: IList<Guid>;
		lSessions		: Array of Guid;
 begin
	try
		case Self.SessionExpiry of
			SessionExpiry.SkipAllHandling : exit;
		end;

The only option beyond that is to disable the OnSessionExpired.

Given we’re now down to one option remaining, I’ll bump the CLRProfiler up my list.

Most probably the session handler fails just because something else already consumed all the memory. CLR Profiler should help to find out what exactly did that.

Hi Anton - I’ve just sent a support email with links to a CLRProfiler log file and some supporting information.

Yeah, thanks. Analyzing it right now. I’ve sent you a response with 2 additional questions