Hi guys,
This is a story, so grab popcorn.
For us, high availability is extremely important. We always need a redundancy in place. Every service has an equal service running on a different server and the calls from the clients are distributed via a load balancer among them. We use supertcp channels with Delphi and .Net clients.
So regular API calls are answered by whoever the client ended up connecting with and all is good. Managing events is a bit more interesting, because there will be duplicate calls or actions that will be triggered on those services because everyone is listening to similar events while subscribing to the same service on different machines.
In order to solve this, we implemented a common pattern where the first service of a type that has to send out messages will register itself as an “active” controller and all other services will be marked as passive. Every x amount of seconds (30/40) the active service will validate its status as active and the other services will check to see if they can become the new active controller or if they will remain as passive. That time becomes the “down time” of a service before taking control and becoming the “active” controller in case of a service going down.
If there is a message sent to all services of a type, the ones that need to act or reply in some way will do if they are marked as “active” so we dont have a duplication of actions. If the service is not active, it will simply exit the routine immediately.
All these works perfectly, having around 16 services on each server (3) node with thousands of messages flying around and working fine with minimum down time.
The problem starts with Olympia, we apply the same logic to the service that manages the sessions. We have multiple of them and only the active controller will delete or update session data when an event is triggered from olympia via the TROOlympia session component. It works great.
But, Olympia itself has no redundancy. We cant have two olympias running and having them receiving round robin calls among them and that is a major, major issue.
Creating a full 100% redundant solution for Olympia is being on the pipeline for a bit of time and having no access to its code, I cant implement a similar solution for Olympia on my own so, is it possible to implement a simple active/passive solution inside Olympia that will allow to have two olympias listening on different machines, answering calls, etc, but when sending events out or doing global cleanings will be only done from the active node? just like we do? and if the active node stops renewing its status as active to simply let the other node take over?
The solution is simple and well known and all the elements are in place for Olympia to have it. Access to a db, internal timers and control from the events that are triggered.
If not, what options do we have to implement redundancy on Olympia? any directions or ideas on that aspect will definitely help. Right now thats the only major point of concern on our solution.
Thanks in advance.