I have just updated our copy of RO SDK from 9.4 to 10.0 and have run our extensive unit testing suites that apply to the products that use the SDK for their communication needs.
They all succeeded except for one case where this happens:
Client calls a method on server
Server starts a lengthy process, itself waiting for a
TEvent to be signaled
As this event is never signaled, the method raises a
The server method catches that exception and raises a new one of type
EServerWaitedTooLong that inherits from
EROException so that it gets raised with the appropriate kind on the client.
Now, with 9.4, the
EServerWaitedTooLong exception is raised on the client and all is fine. But with 10.0, the exception is raised by the server, but the client never sees it and finally aborts by raising
EROTimeout long after.
I traced the code on the server, and I see that
TROBinMessage.WriteException goes to completion.
However, on the client side,
TROBinMessage.IntReadFromStream is never called with
lHeader.MessageType being set to
Continuing tracing in the server, I ended up inside
TROInvokerQueueItem.Callback where there is this line:
if l.IndexOf(FClient) = -1 then exit;
And sure enough, the method exits just here. I tried reproducing this in a test application but, as almost usual, everything works fine here, the
exit above is not triggered.
I then wondered, what could be removing my client from the list? That’s quite easy to see, it’s all in the
TROBaseSuperTCPServer.IntExecute method, in the
finally part. This means that
fChannel.DoExecute has returned. In the test demo, this only happens when the client application exits. But in my unit test, this happens 4 times, to remove all clients, right before the exception is raised. So when
TROInvokerQueueItem.Callback is reached, it considers there is no one left to receive the message, despite the fact that my client application is still there waiting.
I dug deeper into
fChannel.DoExecute to end up on the fact that
TROSynapseSuperTcpConnection.TryReadByte is returning
fConnection.RecvBufferEx has returned 0.
This looked all very strange to me and it got even stranger as I looked at the
fConnection that said “Connection reset by peer”
So it appears to the server as if the client has decided to close the underlying connection, and yet that same client is waiting on the server to give it an answer, it can’t possibly be killing the connection.
I went on to inspect on the client side, and sure enough, there is a timeout inside
TROSuperTcpConnection.TryReadByte that is ultimately given to the
select function from WinSock. This value is built from
PingSecs and because it has its default value of 60, this gives 24000ms for the socket timeout.
Alas, the process on the server takes at least 28 seconds which is more and thus explains the fact that the socket is “abandoned”.
My first reflex was to double
PingSecs but there is an assert inside its setter to limit it to 60. So, lo and behold, I changed 60 by 120. But that does not help either, the timeout is stuck at 24 seconds.
So, before going yet any further, I changed code inside
TROSuperTcpConnection.TryReadByte so that the timeout is not computed from
PingFrequency but rather a constant 60 seconds.
And yes, finally, my unit test is working again, so clearly there must be a way to specify more than 24 seconds as the parameter to
Further investigation showed that
PingSecs is read by
TROSCClientWorker.Create meaning that if the channel sees its
PingSecs value changed after its creation, it is not propagated to its
fClient. Luckily for us, there is a setter for
PingSecs so it’s quite easy to add this inside it:
if Assigned(fClient) then begin fClient.PingFrequency := fPingSecs; fClient.PingTimeout := fClient.PingFrequency * 15 div 10; end;
And with this, I get it working, but I still wonder what’s the point of this formula inside
(PingFrequency * 10 div 25)* 1000
It’s being used in all implementations of the super servers and channels, but I don’t understand why the ping frequency must be divided by 2.5 to get a timeout for the socket calls. Why not use
Right now, to solve my issue, I’ll take the shortest road and change the default value of all channels to 120 instead of 60, but I’d be very interested by answers on the following questions resulting from my investigation:
PingSecs limited to 60?
Why doesn’t it propagate when changed?
Why does it get divided by 2.5 to be used as a socket timeout?