Exception does not seem to reach client

obones · September 21, 2020, 4:18pm

I have just updated our copy of RO SDK from 9.4 to 10.0 and have run our extensive unit testing suites that apply to the products that use the SDK for their communication needs.
They all succeeded except for one case where this happens:

Client calls a method on server
Server starts a lengthy process, itself waiting for a TEvent to be signaled
As this event is never signaled, the method raises a EWaitedTooLong exception
The server method catches that exception and raises a new one of type EServerWaitedTooLong that inherits from EROException so that it gets raised with the appropriate kind on the client.

Now, with 9.4, the EServerWaitedTooLong exception is raised on the client and all is fine. But with 10.0, the exception is raised by the server, but the client never sees it and finally aborts by raising EROTimeout long after.

I traced the code on the server, and I see that TROBinMessage.WriteException goes to completion.
However, on the client side, TROBinMessage.IntReadFromStream is never called with lHeader.MessageType being set to MESSAGE_TYPE_EXCEPTION

Continuing tracing in the server, I ended up inside TROInvokerQueueItem.Callback where there is this line:

if l.IndexOf(FClient) = -1 then exit;

And sure enough, the method exits just here. I tried reproducing this in a test application but, as almost usual, everything works fine here, the exit above is not triggered.

I then wondered, what could be removing my client from the list? That’s quite easy to see, it’s all in the TROBaseSuperTCPServer.IntExecute method, in the finally part. This means that fChannel.DoExecute has returned. In the test demo, this only happens when the client application exits. But in my unit test, this happens 4 times, to remove all clients, right before the exception is raised. So when TROInvokerQueueItem.Callback is reached, it considers there is no one left to receive the message, despite the fact that my client application is still there waiting.

I dug deeper into fChannel.DoExecute to end up on the fact that TROSynapseSuperTcpConnection.TryReadByte is returning False because fConnection.RecvBufferEx has returned 0.
This looked all very strange to me and it got even stranger as I looked at the LastError and LastErrorDesc of fConnection that said “Connection reset by peer”

So it appears to the server as if the client has decided to close the underlying connection, and yet that same client is waiting on the server to give it an answer, it can’t possibly be killing the connection.

I went on to inspect on the client side, and sure enough, there is a timeout inside TROSuperTcpConnection.TryReadByte that is ultimately given to the select function from WinSock. This value is built from PingSecs and because it has its default value of 60, this gives 24000ms for the socket timeout.
Alas, the process on the server takes at least 28 seconds which is more and thus explains the fact that the socket is “abandoned”.
My first reflex was to double PingSecs but there is an assert inside its setter to limit it to 60. So, lo and behold, I changed 60 by 120. But that does not help either, the timeout is stuck at 24 seconds.
So, before going yet any further, I changed code inside TROSuperTcpConnection.TryReadByte so that the timeout is not computed from PingFrequency but rather a constant 60 seconds.
And yes, finally, my unit test is working again, so clearly there must be a way to specify more than 24 seconds as the parameter to fConnection.CanRead

Further investigation showed that PingSecs is read by TROSCClientWorker.Create meaning that if the channel sees its PingSecs value changed after its creation, it is not propagated to its fClient. Luckily for us, there is a setter for PingSecs so it’s quite easy to add this inside it:

if Assigned(fClient) then
begin
  fClient.PingFrequency := fPingSecs;
  fClient.PingTimeout := fClient.PingFrequency * 15 div 10;
end;

And with this, I get it working, but I still wonder what’s the point of this formula inside TROSuperTcpConnection.TryReadByte

(PingFrequency * 10 div 25)* 1000

It’s being used in all implementations of the super servers and channels, but I don’t understand why the ping frequency must be divided by 2.5 to get a timeout for the socket calls. Why not use PingFrequency directly?

Right now, to solve my issue, I’ll take the shortest road and change the default value of all channels to 120 instead of 60, but I’d be very interested by answers on the following questions resulting from my investigation:

Why is PingSecs limited to 60?
Why doesn’t it propagate when changed?
Why does it get divided by 2.5 to be used as a socket timeout?

Regards

EvgenyK · September 22, 2020, 6:56am

Hi,

SuperTCP protocol was invented in 2006 and it wasn’t changed since that time including these constants.
ofc, you can change these constants, but it may lead to channel instability

as I can see, this value is used only at initialization TROSCClientWorker.

constructor TROSCClientWorker.Create(aOwner: TROBaseSuperTCPChannel;
                    ABaseSuperConnection: TROBaseSuperTcpConnection);
..
  PingFrequency := fOwner.fPingSecs; //<<<<

changing PingSecs after creating of TROSCClientWorker has no effect.

obones · September 22, 2020, 7:41am

Then something else has changed because before, I never reached those timeouts. Is as if I’m entering a sort of deadlock somewhere.
But thinking about it, this division by 2.5 does not make sense from a network perspective. The PingSecs value tells how often a packet is guaranteed to be sent and so a response is expected.
To accommodate for network delays and uncertainties, it’s common usage to wait for a period longer that the minimum interval between expected packets. And yet, here, the code is doing the exact opposite, as if someone mixed up the numerator and the denominator in the computation of the leeway factor.
At the very least, this computation should be centralized into one inline function that gets called everywhere instead of being copy/pasted. And to me, it should really multiply by 2.5 instead of dividing. If I was to write it, I would do it this way:

function GetSocketTimeoutMS(ASourceTimeoutSecs: Integer): Integer;
begin
  // ASourceTimeoutSecs * 2048 + ASourceTimeoutSecs * 512
  Result := ASourceTimeoutSecs shl 11 + ASourceTimeoutSecs shl 9;
end;

Using shl is way faster than doing a multiplication, an integer division then a multiplication again. Granted, the above gives ASourceTimeoutSecs multiplied by 2560, but when it comes to network waits, an additional 3.6 seconds (at most) is not going to change anything, especially when the timeout should never be reached.
Changing the method to use bit shift, though, is a game changer when it comes to high throughput.
Now, if you insist on dividing instead of multiplying, then I would go for the closest value, which is 512 via shl 9.
But I insist, reducing the timeout makes little sense in network terms.

Yes, and that could be a problem if users want to reduce the ping delay because the channel creates its worker in its own constructor, way before anyone can change the PingSecs value.

obones · September 22, 2020, 9:15am

Ok, after further investigation, I found what has changed. My code is using TROSuperTCPChannel and has not changed. However, that very class moved from using a TROIndyClientSuperConnection to a TROClientSuperTcpConnection the latter being based on your own implementation of the communication channels, thus not depending on third party libraries.
I can understand the rationale behind this as you control all aspects of the code. However, there is a slight difference in your implementation, when it comes to the isTimeoutError: Boolean override.
That method is used inside TROBaseSuperChannelWorker.DoExecute like this:

      if not FBaseSuperConnection.isTimeoutError then begin
        {$IFDEF uROBaseSCHelpers_DEBUG_ARC}LogARC('--> TryReadByte=False. before Disconnect');try{$ENDIF}
        Disconnect;
        {$IFDEF uROBaseSCHelpers_DEBUG_ARC}finally LogARC('--> TryReadByte=False. before Disconnect'); end;{$ENDIF}
        Break; // disconnected
      end;

So, if isTimeoutError returns true, the failure to read one byte is ignored and the connection is kept open.
In the Indy case, the isTimeoutError method returns the value of FTimeOutError which in my real case above is set to True inside TROIndySuperConnection.TryReadByte because _ReadFromStack returned -1
With your own implementation though, the isTimeoutError method returns False because fConnection.LastError is not ETIMEDOUT. And indeed, inside TROSocket.CanRead there is no code that takes into account the fact that if RO_Select returns 0, then it’s because the timeout was reached. If I change the code to now read like this:

  x := RO_Select(fSocket + 1, @lFDSet, nil, nil, ptv);
  if x = 0 then
    fLastError := ETIMEDOUT
  else
    SockCheck(x);

Then the isTimeoutError method returns True and the connection is not closed, bringing the situation back to what it was with version 9.4
So, the real error lies inside TROSocket.CanRead but it also comes from the fact that TROSuperTCPChannel was changed to use another implementation.
And that does not remove the questioning about the computed timeout value, which to me, does not make sense in network terms.

EvgenyK · September 22, 2020, 10:35am

Hi,

Can you send your testcase to support@, pls?
I’d like to reproduce this issue locally

obones · September 22, 2020, 11:48am

Even better, now that I know the issue comes from TROSuperTCPChannel, I was able to reproduce the issue in a simple test case: ROWaitThenRaise.zip (119.0 KB)

Compile both the server and client, start both then click on the “Test” button in the client.
You’ll see that the client thread disappears after 24 seconds and you get a timeout instead of the expected exception.
Change to any other “Super” channel, or apply the fix above and all is back to normal.

RemObjectsSoftware · September 23, 2020, 8:13am

Thanks, logged as bugs://84968

RemObjectsSoftware · September 23, 2020, 8:17am

bugs://84968 got closed with status fixed.