Some interesting performance comparision: Island vs Delphi

I have Island (Oxygene/Windows) and Delphi (10.4 Sydney) to do the same computation like below:

 var sum: Int64 := 0;
      for I: Integer := 1 to 40000 do
        for J: Integer := 1 to 40000 do
          sum := sum + I * J;

The testing results are quite interesting:

32bit debug            2.89s (Delphi)        3.24s (Island)                                         
32bit release          2.67s (Delphi)        0.00000004s (Island) /1 tick
64bit debug            4.07s (Delphi)        2.54s (Island)
64bit release          4.06s (Delphi)        0s (Island)/ 0 tick

So basically, when Island in RELEASE build, it does very good optimization - being able to recognized the structure and does the computation magically fast during run time. @ck what is the trick? The backend LLVM optimizer calculates sum during compile time?

The test also suggests that Delphi has almost no optimization for RELEASE build. Yuk!


Delphi Code:

{$APPTYPE CONSOLE}

{$R *.res}

uses
  System.SysUtils, SynCommons;

var
  timer: TPrecisionTimer;
begin
  try
    timer.Start;
    var sum: Int64 := 0;
    for var I := 1 to 40000 do
      for var J := 1 to 40000 do
        sum := sum + I * J;
    Writeln('Takes: ', timer.Stop);
    Readln;
  except
    on E: Exception do
      Writeln(E.ClassName, ': ', E.Message);
  end;
end.

Island/Oxygene (windows) code:

type
  Program = class
  public

    class method Main(args: array of String): Int32; 
    begin
      var lStart, lStop, lFreq: rtl.LARGE_INTEGER;
      rtl.QueryPerformanceFrequency(@lFreq);
      rtl.QueryPerformanceCounter(@lStart);

      var sum: Int64 := 0;
      for I: Integer := 1 to 40000 do
        for J: Integer := 1 to 40000 do
          sum := sum + I * J;

      rtl.QueryPerformanceCounter(@lStop);
      writeLn(sum);
      var lElapsedTicks := lStop.QuadPart - lStart.QuadPart;
      var lElapsedMilliseconds := 1.0 * lElapsedTicks/lFreq.QuadPart * 1000.0;
      writeLn(lElapsedTicks);
      writeLn(lElapsedMilliseconds);
      readLn;
    end;

  end;

end.

Yep. It reduces the loop to a constant

4 Likes

I did another following up test, adding C++ Builder 10.4.2 and Microsoft C++ Compiler 16.10

The results are even more surprising:

32bit debug            0s (C++Builder)        0.001s (MS C++)                                         
32bit release          0s (C++Builder)        0s(MS C++)
64bit debug            0s (C++Builder)        0.0009s (MS C++)                                         
64bit release          0s (C++Builder)        0s(MS C++)

It is very surprising that C++ Builder is this fast ( I know it is based on Clang version 5), but still quite surprised to see the results.

C++ Builder’s debug didn’t optimize the loop away - I can still step into the loop. Also I can see the generated assembler code indeed having a loop structure (as identified using IDA Pro).

So I guess - for performance critical applications, C++ would still be the winner.