I have just updated our copy of RO SDK from 9.4 to 10.0 and have run our extensive unit testing suites that apply to the products that use the SDK for their communication needs.
They all succeeded except for one case where this happens:
Client calls a method on server
Server starts a lengthy process, itself waiting for a TEvent
to be signaled
As this event is never signaled, the method raises a EWaitedTooLong
exception
The server method catches that exception and raises a new one of type EServerWaitedTooLong
that inherits from EROException
so that it gets raised with the appropriate kind on the client.
Now, with 9.4, the EServerWaitedTooLong
exception is raised on the client and all is fine. But with 10.0, the exception is raised by the server, but the client never sees it and finally aborts by raising EROTimeout
long after.
I traced the code on the server, and I see that TROBinMessage.WriteException
goes to completion.
However, on the client side, TROBinMessage.IntReadFromStream
is never called with lHeader.MessageType
being set to MESSAGE_TYPE_EXCEPTION
Continuing tracing in the server, I ended up inside TROInvokerQueueItem.Callback
where there is this line:
if l.IndexOf(FClient) = -1 then exit;
And sure enough, the method exits just here. I tried reproducing this in a test application but, as almost usual, everything works fine here, the exit
above is not triggered.
I then wondered, what could be removing my client from the list? That’s quite easy to see, it’s all in the TROBaseSuperTCPServer.IntExecute
method, in the finally
part. This means that fChannel.DoExecute
has returned. In the test demo, this only happens when the client application exits. But in my unit test, this happens 4 times, to remove all clients, right before the exception is raised. So when TROInvokerQueueItem.Callback
is reached, it considers there is no one left to receive the message, despite the fact that my client application is still there waiting.
I dug deeper into fChannel.DoExecute
to end up on the fact that TROSynapseSuperTcpConnection.TryReadByte
is returning False
because fConnection.RecvBufferEx
has returned 0.
This looked all very strange to me and it got even stranger as I looked at the LastError
and LastErrorDesc
of fConnection
that said “Connection reset by peer”
So it appears to the server as if the client has decided to close the underlying connection, and yet that same client is waiting on the server to give it an answer, it can’t possibly be killing the connection.
I went on to inspect on the client side, and sure enough, there is a timeout inside TROSuperTcpConnection.TryReadByte
that is ultimately given to the select
function from WinSock. This value is built from PingSecs
and because it has its default value of 60, this gives 24000ms for the socket timeout.
Alas, the process on the server takes at least 28 seconds which is more and thus explains the fact that the socket is “abandoned”.
My first reflex was to double PingSecs
but there is an assert inside its setter to limit it to 60. So, lo and behold, I changed 60 by 120. But that does not help either, the timeout is stuck at 24 seconds.
So, before going yet any further, I changed code inside TROSuperTcpConnection.TryReadByte
so that the timeout is not computed from PingFrequency
but rather a constant 60 seconds.
And yes, finally, my unit test is working again, so clearly there must be a way to specify more than 24 seconds as the parameter to fConnection.CanRead
Further investigation showed that PingSecs
is read by TROSCClientWorker.Create
meaning that if the channel sees its PingSecs
value changed after its creation, it is not propagated to its fClient
. Luckily for us, there is a setter for PingSecs
so it’s quite easy to add this inside it:
if Assigned(fClient) then
begin
fClient.PingFrequency := fPingSecs;
fClient.PingTimeout := fClient.PingFrequency * 15 div 10;
end;
And with this, I get it working, but I still wonder what’s the point of this formula inside TROSuperTcpConnection.TryReadByte
(PingFrequency * 10 div 25)* 1000
It’s being used in all implementations of the super servers and channels, but I don’t understand why the ping frequency must be divided by 2.5 to get a timeout for the socket calls. Why not use PingFrequency
directly?
Right now, to solve my issue, I’ll take the shortest road and change the default value of all channels to 120 instead of 60, but I’d be very interested by answers on the following questions resulting from my investigation:
Why is PingSecs
limited to 60?
Why doesn’t it propagate when changed?
Why does it get divided by 2.5 to be used as a socket timeout?
Regards