Olympia server logging?

estebanp · March 18, 2020, 11:22pm

Hi guys,

Once again our Olympia was consuming crazy amounts of cpu and memory.
No call backs were being processed and all clients were connecting and disconnecting to it constantly.

We had to restart Olympia and forcibly our other 32 services because sessions were already lost since the last time Olympia lost connectivity.

No need to explain the disruption of such a restart. But, because it is a production environment and I can not take the time to take a memory Dump or profile the app, we need to go back up, so cant experiment much. Do you guys have some sort of logging on your Olympia?

I cant imagine such an important part of the SDK not having logs. Catching errors or issues on it must be quite difficult without them. If you guys do, is there a way to enable them so I can pass over the logs to you guys so they can be analyzed?

Thank you in advance.

antonk · March 19, 2020, 10:51am

Logging won’t give anything in the case when thing go really wrong. Mostly because if we can log some really bad thing then this means that we know that that bad thing can happen so we’d better add some recovery code for it.

As far as I understand this started to happen recently. What was changed prior to that (host configuration, Olympia version etc)? Also what is the EXACT Olympia version used there?

estebanp · March 19, 2020, 1:54pm

Hi Anton,

Thank you for answering.

Interesting, now days you just don’t log in case of errors, you instrumentalized method calls, you set points of control to determine the execution flow, internal counts and method calls latency looking for system degradation, thread exhaustion and more.

So, although I completely, totally and entirely disagree on the logging topic (reason why “distributed logging” (splunk, elastic, etc) is such a big thing these days) I got my answer. Thank you.

Also, the change was an update to the latest version. We needed a way to recover all subscriptions after a shutdown/restart of Olympia, so the update was a must. I added db lock errors and Olympia version used on a previous post referring to the same issue a few days back, but for reference, this is the version again: 10.0.0.1463.

thank you

antonk · March 20, 2020, 10:24am

I currently investigate ETW logging support in core RO/DA components, but there’s no ready solution for that.
Guess Olympia would be a good starting point (some general stats on session / events activity). What do you think?

Just trying to gather all the info in one place.

What was the previous version used? Does the CPU spike correlate with that event registration issue you mention in the other thread?

Need these answers to better understand what could go wrong there.

estebanp · March 20, 2020, 8:08pm

Today on one of our Azure dev environments we got this. Cpu was fine, but memory kept growing uncontrolled.

Logs showed exceptions coming from the message broker aka Olympia and clients eventually start getting Queue is full messages on service clients connected directly to Olympia. Which makes sense cause every request updates Olympia last update and when it doesnt work, everything else brakes.

It probably stopped working a while ago.

The previous version used was 9.4.109.1377. That version was bad on the fact that if it restarts, all subscriptions will be lost and all ours services will be left hanging without an easy way to recover them.

antonk · March 20, 2020, 8:18pm

Do you have exception name/message/stacktrace of these exceptions?

estebanp · March 20, 2020, 8:52pm

Hi Anton,

I only have Exception EROTimeout in module xx at xxx. Timeout waiting for response. for specific session GUIDs from the OnException messageBroker(Olympia). Sorry, no full stack trace from there.

This happened every time a request was sent to the client, probably due going and attempting to update last usage on Olympia.

antonk · March 20, 2020, 10:16pm

Yes. Every service call means that the session LastUsed timestamp is updated. Need to investigate if this can be caused by deadlocks database-side (that would cause service method to pause its execution and eventually timeout, causing the EROTimeout error client service-side).

Btw what connection pool size did you configure for Olympia db connection?

estebanp · March 21, 2020, 2:50am

Development environment:

{
“ServiceName”: “Dev.Olympia”,
“Port”: 9000,
“EventTimeout”: 1800,
“SessionTimeout”: 1800,
“PoolSize”: 10,
“InMemoryMode”: false
}

Production
{
“ServiceName”: “Prod.Olympia 10.0.0”,
“Port”: 9000,
“EventTimeout”: 3600,
“SessionTimeout”: 3600,
“PoolSize”: 20,
“InMemoryMode”: false
}

estebanp · March 23, 2020, 10:02pm

We are having the same issues right now.

Current memory usage of Olympia sits at 72 MB and growing . But, looking at the db we have:

5 rows on the Session table.
0 rows on the EventUserData
7 rows on EventUser
8 rows on EventSubscriptions
10 rows on EventData

I do not store much on the user sessions and even then, we don’t add data to them. We only stored profile and logging data once they login into the system.

So whatever is happening its on memory.

Right now we just get

23-03-2020 14:59:37.513 [EXCEPT] [ThreadId: 18524] [PID: 17772] MessageBroker exception for session {CBF55E29-2FE5-4496-A9DD-E9F9EB2077C5}.: Exception EROSuperChannelException in module GX.Cerberus.exe at 0000000000A6CED5.
No connection available.

On the service that controls sessions. If I restart messagebroker(Olympia), i’m forced to restart all services. (which I have to in any case cause no service can be authenticated right now).

This is on dev, so i’ll hear about it but not on the scale when it happens in production.

antonk · March 24, 2020, 12:31am

Were you able to capture the CLRProfiler dump?

estebanp · March 24, 2020, 4:45pm

So i managed to installed the app on the dev server. Attach it to the process, saw garbage collection, heap, options but saw no option to download dump. Care to provide instructions? Ill try to do it next time it happens.

antonk · March 24, 2020, 11:33pm

Sorry, my bad.

Start the CLRProfiler app. Target CLR version should be V4 Desktop CLR.
Press the Attach Process... button to attach to the target process. Target PID can be found in the Task Manager, Details tab. F.e. for Olympia name would be ROOlympiaServer.exe
Wait for a couple of seconds and press Detach Process
Go to File -> Save Profile As and save the file
Close CLRProfiler and let the memory grow a bit then do the steps again to get the updated log (to where exactly the memory consumption is growing)
Send us both log files for investigation.

Need to say that this looks a bit suspicious (ideally these tables should contain equal number of rows)

5 rows on the Session table.
7 rows on EventUser
8 rows on EventSubscriptions

antonk · April 1, 2020, 2:47pm

Hello

You might be interested in this feature as it will improve session performance in your scenatio: https://talk.remobjects.com/t/remoting-sdk-for-net-vnext-new-features/18046/23

Delphi-side support for this feature is under development right now and should be available next week.

estebanp · April 1, 2020, 5:50pm

Hi Anton,

Sounds like an interesting feature. So we are basically checking session deltas prior to submit the values or we are doing some sort of check server side on each incoming message to compare deltas and ignore db updates if there are no changes?

antonk · April 2, 2020, 10:15am

This is the logic in details:

If this is a new session then mark it as changed immediately
If this is a session received from Olympia then mark it as unchanged
- If a value is removed from session then mark it as changed
- If a value is set to session then mark it as changed
- If a complex value is retrieved from session (an array or a complex-type value) then mark it as changed . This ensures that if someone retrieves an array from session and then changes some element in that array then this updated array will be stored back in the session. Same for complex-typed values - this ensures that the updated instance will be stored in the session.
When a session is released for sessions that are not marked as changed there is no need to send the serialized session data back to Olympia. So only a session Id is sent back to ensure that Olympia will update the LastAccessed timestamp.

Note that last step uses a separate method exposed by Olympia instead of just using existing .UpdateSession one and passing there nil as a session data stream. This means that for old build of Olympia server app will raise an MethodNotFound exception instead of silently destroying session data on the Olympia Server side.

By default this new behavior is disabled, so exsting code won’t be affected by this feature and won’t use it unless explicitly told to.

antonk · April 7, 2020, 5:26pm

To keep you on-track:
Seems we have managed to reproduce the CPU/memory spike issue (while investigating a completely different issue) so we’ll test a fix for it tomorrow.
Also it seems that CPU spike issue did hide another possible issue Delphi-side, so we’ll work on its fix as well.

estebanp · April 7, 2020, 10:29pm

Thank you very much,

Once you feel we have something solid, please let me know and I’ll upgrade some projects to see if stability improves.

estebanp · April 8, 2020, 4:51pm

Today after installing two new services in DEV and starting them, they worked well for a few minutes and then Olympia stopped responding while increasing its memory usage and stayed on 25%-30% cpu usage.

I dont know if I did it right but I captured both logs and sent them to support.

estebanp · April 15, 2020, 2:43pm

Hi there,

Did the measurements helped? Did I took them as it was supposed to?

Currently our production Olympia is sitting on a memory usage of 1.4GB. Considering we dont store much on sessions this should be something else, I’m expecting it to die later today.

Do we feel the fixes are production ready? Do you know the details of the possible issue on the Delphi side of things?

Thank you.