Olympia redundancy

estebanp · May 2, 2019, 7:23pm

Hi guys,

This is a story, so grab popcorn.

For us, high availability is extremely important. We always need a redundancy in place. Every service has an equal service running on a different server and the calls from the clients are distributed via a load balancer among them. We use supertcp channels with Delphi and .Net clients.

So regular API calls are answered by whoever the client ended up connecting with and all is good. Managing events is a bit more interesting, because there will be duplicate calls or actions that will be triggered on those services because everyone is listening to similar events while subscribing to the same service on different machines.

In order to solve this, we implemented a common pattern where the first service of a type that has to send out messages will register itself as an “active” controller and all other services will be marked as passive. Every x amount of seconds (30/40) the active service will validate its status as active and the other services will check to see if they can become the new active controller or if they will remain as passive. That time becomes the “down time” of a service before taking control and becoming the “active” controller in case of a service going down.

If there is a message sent to all services of a type, the ones that need to act or reply in some way will do if they are marked as “active” so we dont have a duplication of actions. If the service is not active, it will simply exit the routine immediately.

All these works perfectly, having around 16 services on each server (3) node with thousands of messages flying around and working fine with minimum down time.

The problem starts with Olympia, we apply the same logic to the service that manages the sessions. We have multiple of them and only the active controller will delete or update session data when an event is triggered from olympia via the TROOlympia session component. It works great.

But, Olympia itself has no redundancy. We cant have two olympias running and having them receiving round robin calls among them and that is a major, major issue.

Creating a full 100% redundant solution for Olympia is being on the pipeline for a bit of time and having no access to its code, I cant implement a similar solution for Olympia on my own so, is it possible to implement a simple active/passive solution inside Olympia that will allow to have two olympias listening on different machines, answering calls, etc, but when sending events out or doing global cleanings will be only done from the active node? just like we do? and if the active node stops renewing its status as active to simply let the other node take over?

The solution is simple and well known and all the elements are in place for Olympia to have it. Access to a db, internal timers and control from the events that are triggered.

If not, what options do we have to implement redundancy on Olympia? any directions or ideas on that aspect will definitely help. Right now thats the only major point of concern on our solution.

Thanks in advance.

estebanp · May 2, 2019, 7:45pm

Thinking out loud, what if we have two of them running listening on the same port on different machines? whats the harm it will cause?

Each service connecting to olympia will be attached to that instance due to the nature of the supertcp connection, so that will not be an issue there will be basically only one writer per service when Olympia is implemented over sql or a major db. (Probably with sqllite we may have issues).

Both instances will attempt to do session cleaning depending on how it internally works, but being the db the last point of shared state, whoever gets there first will win and if it is done via transactions probably it will survive.

This part i’ll be guessing, OnSessionDelete, OnBeforeSessionDelete, etc messages will be triggered only on the service clients that connected directly to their specific Olympia server, that could be an issue cause the message will not be heard from all services with a TROOlympia disregarding if they connected to that instance or not. If the message is sent to everyone then thats no problem, each service client will decide based on its active/passive role if it is something to act on or not.

Same goes for all kinds of pub/sub situations, if the message goes out from both olympias to its respective client connections disregarding of who did the update then we should be fine. Otherwise, then we have an issue.

Interesting topic indeed.

antonk · May 3, 2019, 7:28pm

Hello

More than interesting

Well, the only simple thing here is that the Olympia sources are actually shipped.
They are shipped as a part of Data Abstract for .NET sources and can be found at ...\RemObjects Software\Data Abstract for .NET\Source\Olympia. Building them would require an active Data Abstract license and Elements compiler.

antonk · May 3, 2019, 7:53pm

Events management with multiple Olympia instances looking at the same database would work out of the box for old-style poll events model.

Unfortunately IIRC you use SuperTCP (and thus the push events model), so we need to work out some solution. Atm when something sends out an event, that event will be retranslated (in case of super-channel being used) only to known client sessions. Other sessions (attached to another Olympia instance) won’t receive that event (although it should land in their db record)

Now we need to brainstorm some solution for this

Crazy solution - let the OlympiaMessageQueueManager connect and register itself in all Olympia instances simultaneously.

Then regardless which exactly O instance tries to send out events - it will anyway send events out to all listeners.

The only difficulty I see here is that if some of O instances crashes and then resurrects again the QueueManager will have to reregister its event clients in that instance:

F.e a server app will have a component (let’s call it OlympiaRegistry) in it that will periodically ping known O instance addresses and ports (both running and stopped). Once a new O instance is discovered or an known O instance doesn’t respond anymore that Registry will ping the QueueManager to either register itself in the new O instance or to remove that O instance from the list of valid ones.

What do you think about such approach? Do you see any obvious caveats? Also are your servers .NET or Delphi based?

estebanp · May 3, 2019, 8:33pm

Hi Anton,

Thank you very much for your reply.

Lets say for this conversation that most of them are Delphi based servers (not all of them) and some of them will have Olympia classes (component) on them if they need to emit messages to everyone.

We implemented on each service a “service subscription” registry. Let me explain, currently, in RO if a service crashes or is stopped, all clients subscribed to an “Event Sink” will be lost and we pass the responsibility of restoring their subscriptions to the client. This is fine if it happens on a session that is expired or timed out. But if that wasnt the case, registrations to events of multiple services should be restored automatically, clients shouldnt be asked to do this.

So after a client registers to an event sink, we persist all the event subscriptions a given session has on db/memory. Initially we tried using the olympia shared session itself to store it but for some reason we start having odd cases where the key/value doesnt get stored. So we do it on our own db.

When a service starts/restarts/recovers, during the OnClientConnected event we check if the session exists and if it does, we look for all their existing subscriptions for that session and we register them with Olympia. Strange enough, we have to do this with every service of the same type because it seems that this data is held only in memory on a specific instance. After that things work fine.

Could that be similar to the idea you were having?

estebanp · May 7, 2019, 5:23am

So I did some tests.

I setup two Olympias on different computers, same settings, sharing db and Application Id.

Services connected on separate instances of olympia.

By doing an all sessions registration on each instance of TROOlympia wherever present as described before, messages are being received by all services disregarding of the olympia they are connected to. Is this supposed to work? Im happy it is, but I dont know if I might be getting wrong results.

OnBefore/Before etc session operations from Olympia are not shared among olympias and that causes a problem if by any chance two services in charge of doing session operations on those olympia events land on the same olympia.

Or a “passive” service may land on an olympia that got the event but nothing will happen cause that service is marked as passive.

But, if indeed Im getting callbacks/event sinks across multiple Olympias then I could simply use the events of the Olympia server as triggers to event sinks on other services ready to do those session cleaning operations and only the active ones will act. I’ll then remove Olympias session restrictions (passive/active) and move all that responsability to the services that have that capability.

These will all depend that I can get event/session data across multiple olympias.

antonk · May 7, 2019, 8:21am

Well, this should work for sure if client - server communications on this schema are using simple channels:

Olympia <---> Server <---> Client

For super-channels – needs to be tested. Probably there will be some delays in delivering events if the receiving client is connected to a server that is connected to O instance other that received the event ping.

Actually if you could test this in a production-like environment, I would be very-very grateful. You can use out Chat and SuperHTTP/SuperTCP chat samples as a server/client pairs