Massive CPU load problem, test case

bobokonijn · July 21, 2021, 11:35am

Hello Remobjects team,

I have the problem that when a Remoting SDK server is offline, or when it refuses a connection, the CPU load on the client spikes unacceptably when the channel is destroyed. It maxes out one thread/core for 30 seconds! This is with version 10.0.0.1495.

This problem has been bugging me for ages.

I have a reproducible test case here. It can be reproduced with any trivial service (mine is called sessionservice). I would be utterly grateful for a fix.

Kind regards,
Arthur Hoornweg.

[edit]: More concise example

procedure TForm1.Test;
var service:iSessionService;   i:integer;
begin
   service:=cosessionservice.Create('Supertcp://www.google.com:30841');
   try
    i:=service.Serviceversion;
   Except
    screen.cursor:=crDefault;
    Showmessage('Please start the task manager and go to the CPU  Performance page. '+
    ' Press OK in this message box and look what happens to the CPU load.') ;
    screen.cursor:=crHourglass;
   end;
end;

procedure TForm1.Button1Click(Sender: TObject);
begin
   screen.cursor:=crHourglass;
   test;
   screen.cursor:=crDefault;
   Showmessage('The connection was finally destroyed...');
end;

Kind regards,
Arthur

RemObjectsSoftware · July 21, 2021, 12:50pm

Logged as bugs://D19119.

antonk · July 21, 2021, 12:52pm

Hello

I have logged this issue. Unfortunately our Delphi team is at vacation this week. They will be back next Monday and will take care of this issue.

Regards

bobokonijn · July 21, 2021, 1:22pm

I’m including a screenshot.

According to SamplingProfiler, the CPU hog is “TROBaseSuperChannelWorker.Disconnect” , more specifically the line that does “tThread.Yield”. Replacing this line with “Sleep(1)” makes the CPU load go away but it still takes 30 seconds to destroy the channel.

[edit]

There’s more going on here. Why does destroying a channel that was never connected take 30 seconds? I see that TROBaseSuperChannelWorker.Disconnect is waiting for DoExecute() to do something, but DoExecute is never called if the connection failed… Should fDisconnected not be true at this stage to facilitate a quick exit?

procedure TROBaseSuperChannelWorker.Disconnect;
begin
  if fDisconnected then Exit;   // if the connection failed, should this flag not be "true" ? 
  While... begin 
      .....// this loop takes 30 seconds
  end;
end;

EvgenyK · July 27, 2021, 1:18pm

Hi,

you can replace

service:=cosessionservice.Create('Supertcp://www.google.com:30841');

with

Channel.TargetUrl := 'Supertcp://www.google.com:30841';
service:=cosessionservice.Create(Message, Channel);

as a result, you will get The connection was finally destroyed... message w/o delay.

it takes 24 seconds if to be exact. this is timeout when channel waits for any data:

Result := fConnection.CanRead((PingFrequency * 10 div 25)* 1000);

PingFrequency is 60 so above line will be

Result := fConnection.CanRead(24000);

we already handle it as

procedure TROClientThread.IntExecute;
begin
  if fChannel.fOwner.InitialConnect then
    fChannel.DoExecute
  else
    fChannel.fExecuting := True;
end;

in your case, TROBaseSuperChannelWorker.Disconnect is called when fChannel.fOwner.InitialConnect isn’t finished yet …

you are right, it can be replaced with

  if not fExecuting then begin
    ROSwitchToThread;
    while not fExecuting do Sleep(10);
  end;

bobokonijn · July 28, 2021, 10:27am

Hi EvgenyK,

you can replace
service:=cosessionservice.Create('Supertcp://www.google.com:30841');
with
Channel.TargetUrl := 'Supertcp://www.google.com:30841';
service:=cosessionservice.Create(Message, Channel);
as a result, you will get The connection was finally destroyed... message w/o delay.

That doesn’t fix the problem. The service is destroyed, yes, but not the connection. The connection is still busy for a very long time even though it already knows that it can’t connect.

This is my main problem:

We find that in our VPN infrastructure, tRoSuperTCPChannel sometimes is unable to reconnect if a server has gone offline temporarily. When the server comes back online, it can be PINGed and other services such as Remote Desktop work again but tRoSuperTCPChannel won’t reconnect, it keeps throwing eRoSuperChannelException (“no connection available”) when I execute a service. And sometimes, rarely, the service call hangs without ever returning.

If such a thing happens at night, we get angry calls from customers so I really had to find a workaround.

A workaround that works reliably so far is to destroy the channel whenever a server becomes unreachable and to create a fresh one as soon as the server can be PINGed again.

This is when I noticed two severe side effects. First of all, the channel’s destructor totally hogs the CPU (which you just fixed) and secondly it takes an eternity to finish, blocking the entire thread.

So I had to resort to another really desperate workaround. If a channel must be destroyed, I hand it over to a dedicated “kill” thread whose only job it is to destroy the channel. Yes that works, but it is a measure born out of desperation. I soooo wish the destructor would simply be a good boy and terminate whatever it’s doing immediately.

EvgenyK · July 28, 2021, 11:44am

Hi,

try to use _AsyncEx service methods, they work more smoothly.
in your case, it will be

...
var service:iSessionService_AsyncEx;
begin
   service:=cosessionservice_AsyncEx.Create('Supertcp://www.google.com:30841');
   try
      service.BeginServiceversion(
             procedure(const aRequest: IROAsyncRequest) begin
               i := service.EndServiceversion(aRequest);
             end);
   Except

Note: Drop email to support@, I’ll send to you updated uROTransportChannel.pas for this scenario.

bobokonijn · August 3, 2021, 7:32am

Hi EvgenyK, ROSwitchToThread does not exist, I assume you mean SwitchToThread?

Anyway, I believe I may have a workaround here.

It appears that setting “active:=false” in the OnException event handler of the channel speeds up subsequent destruction a lot.

procedure TForm1.OnChannelException(Sender: TROTransportChannel; anException: Exception; var aRetry: Boolean);
begin
  IF (anException is eROTimeout) or (anException is eroSuperChannelException) then
  begin
    aretry:=False;
    (sender as troBasesupertcpchannel).active:=False;
  end;
end;

EvgenyK · August 3, 2021, 7:49am

Hi,

ROSwitchToThread is present in .1511 (preview channel).

this is workaround for your specific case and cannot be for all users because it breaks the AutoReconnect logic of SuperTcp channel.

EvgenyK · August 3, 2021, 11:00am

Hi,

I think, this fix don’t harm to current logic and can handle aRetry:

procedure DoException(anException: Exception; var aRetry: Boolean); override;
...
procedure TROBaseSuperChannel.DoException(anException: Exception;
  var aRetry: Boolean);
begin
  inherited;
  if not aRetry then Active := False;  
end;

fixed as #D19119

RemObjectsSoftware · August 3, 2021, 11:03am

bugs://D19119 was closed as fixed.

bobokonijn · November 15, 2021, 10:10am

Hi EvgenyK, I see in my IDE that TROBaseSuperChannel.DoException is often called from the thread context of fWorkerThread. Therefore it is not safe to call “Active:=False” here because method SetActive() is not threadsafe.

A race condition / potential access violation would occur if my main thread simultaneously sets Active to false or when it calls the destructor.

I strongly suggest wrapping TROBaseSuperChannel.SetActive in a critical section to make it threadsafe.

EvgenyK · November 15, 2021, 10:19am

Hi,

this fix (#D19119) was reverted in .1519 because it caused connection issues.