Difference between revisions of "Improving language shootout results"

From Lazarus wiki
Jump to navigationJump to search
(code highlight and category)
(Remove pointless comma)
 
(4 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 
==About==
 
==About==
  
The computer language shootout (http://shootout.alioth.debian.org/) is a realistically flawed
+
The computer language shootout (https://benchmarksgame-team.pages.debian.net/benchmarksgame/) is a realistically flawed
benchmarking system which compares many languages on illogical basis. See their homepage for more info.
+
benchmarking system which compares many languages on a somewhat unequal basis. See the linked page for more info.
  
 
==Goals==
 
==Goals==
Line 16: Line 16:
  
 
It was decided that to keep memory requirements low it's best to use val() and str() and never
 
It was decided that to keep memory requirements low it's best to use val() and str() and never
include SysUtils unless really required. 32 bit integers should be used when possible to increase
+
include SysUtils unless really required.  
speed as the benchmarking machine is x86(athlon 1Gb, 256Mb ram).
+
 
 
Use of inline and records instead of classes can also improve performance considerably, but
 
Use of inline and records instead of classes can also improve performance considerably, but
 
additional testing and comparing with C and other languages should be the main factor.
 
additional testing and comparing with C and other languages should be the main factor.
Line 28: Line 28:
 
===SSE2 and comparions===
 
===SSE2 and comparions===
 
Consider the following code compiled with -Cfsse2:
 
Consider the following code compiled with -Cfsse2:
<delphi>
+
<syntaxhighlight lang=pascal>
 
const  
 
const  
 
   limit=4.0;
 
   limit=4.0;
Line 37: Line 37:
 
   if x < limit then
 
   if x < limit then
 
     x := x + x;
 
     x := x + x;
end.</delphi>
+
end.</syntaxhighlight>
  
 
For the condition '''x<limit''', no sse2 code will be generated, because the compiler assumes the const is an extended and therefore needs to perform the comparison with the x87 FPU unit.
 
For the condition '''x<limit''', no sse2 code will be generated, because the compiler assumes the const is an extended and therefore needs to perform the comparison with the x87 FPU unit.
  
<delphi>const  
+
<syntaxhighlight lang=pascal>const  
   limit = double(4.0);</delphi>
+
   limit = double(4.0);</syntaxhighlight>
  
 
... also doesn't work. However
 
... also doesn't work. However
  
<delphi>const  
+
<syntaxhighlight lang=pascal>const  
   limit: double = double(4.0);</delphi>
+
   limit: double = double(4.0);</syntaxhighlight>
 
   
 
   
 
does.
 
does.
  
 
===Binary trees===
 
===Binary trees===
The [http://shootout.alioth.debian.org/u64/benchmark.php?test=binarytrees&lang=fpascal binary trees benchmark] can be improved using a TNonFreePooledMemoryManager available in fpc 2.1.1. That way the allocation of nodes occurs in blocks and all de-allocation happens all at once for all nodes of the tree. Look for an example [http://www.hu.freepascal.org/fpcircbot/cgipastebin?msgid=126 here]. Such an allocator probably won't be allowed though.
+
<s>The [https://benchmarksgame-team.pages.debian.net/benchmarksgame/performance/binarytrees.html binary trees benchmark] can be improved using a TNonFreePooledMemoryManager available in fpc 2.1.1. That way the allocation of nodes occurs in blocks and all de-allocation happens all at once for all nodes of the tree. Look for an example [http://www.hu.freepascal.org/fpcircbot/cgipastebin?msgid=126 here]. Such an allocator probably won't be allowed though.</s>
 +
 
 +
FPC is currently in first place for this benchmark, and indeed TNonFreePooledMemoryManager (plus PasMP) were what did the trick.
  
 
===Mandelbrot===
 
===Mandelbrot===
The [http://shootout.alioth.debian.org/u64/benchmark.php?test=mandelbrot&lang=fpascal mandelbrot benchmark] can be improved with fpc 2.1.1 using the sse2 instructions. Note that sse2 instructions are volatile and the results of a calculation will be put back on the stack, if the same body contains a procedure call. Therefore, a sse2 calculation needs to be moved to a separate (nested) procedure. The mandelbrot benchmark would probably benefit from [[Single assignment|SSA]] in loops, because without it calculations are performed in a scratch register and the result moved to the final register. SSA will prevent this. At the moment fpc 2.1.1 supports SSA only in simple linear flow control without loops and branches.
+
<s>The [https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/mandelbrot-fpascal-8.html mandelbrot benchmark] can be improved with fpc 2.1.1 using the sse2 instructions. Note that sse2 instructions are volatile and the results of a calculation will be put back on the stack, if the same body contains a procedure call. Therefore, a sse2 calculation needs to be moved to a separate (nested) procedure. The mandelbrot benchmark would probably benefit from [[Single assignment|SSA]] in loops, because without it calculations are performed in a scratch register and the result moved to the final register. SSA will prevent this. At the moment fpc 2.1.1 supports SSA only in simple linear flow control without loops and branches.
 +
 
 +
The latest versions of the competition gain a lot (nearly a factor of 2) by using 128 bit xmm registers for cr, ci, tr, and ti and instructions to do 2 calculations at the same time. I do not see a way to do this with fpc. Please prove me wrong :-)</s>
  
The latest versions of the competition gain a lot (nearly a factor of 2) by using 128 bit xmm registers for cr, ci, tr, and ti and instructions to do 2 calculations at the same time. I do not see a way to do this with fpc. Please prove me wrong :-)
+
FPC's score in this benchmark is currently much better than it was previously (formerly 22.69 seconds, currently 8.52 seconds).
 +
Optimization improvements in FPC 3.2.0 versus the current 3.0.4 being used are likely to improve the time even further.
  
 
===N-body===
 
===N-body===
The [http://shootout.alioth.debian.org/u64/benchmark.php?test=nbody&lang=fpascal n-body benchmark] can be improved with fpc 2.1.1 using the sse2 instructions and factoring out common factors. Compiling with -Cfsse2 and fpc 2.0.4 gives an internal error.
+
<s>The [https://benchmarksgame-team.pages.debian.net/benchmarksgame/performance/nbody.html n-body benchmark] can be improved with fpc 2.1.1 using the sse2 instructions and factoring out common factors. Compiling with -Cfsse2 and fpc 2.0.4 gives an internal error.</s>
 
 
==Shootout criticism==
 
 
 
Any page about benchmarking should discuss the use of the benchmarks. FPC does relatively well overall,
 
but is relatively weak in its category.
 
 
 
Criticising the shootout is not difficult, just start with the [http://shootout.alioth.debian.org/flawed-benchmarks.php "Flawed Benchmarks"] page.
 
 
 
I mention a few points below.
 
  
* The main problem is that the applications are relatively short and highly computational. This has a lot of consequences:
+
FPC 2.1.1 is now very outdated and SSE2 works just fine, but this benchmark still needs improvement in general.
** In general they favour lower level languages, well, that is not to Pascal's disadvantage.
 
** these benchmarks also favour compilers that aggressively optimize tight heavy calculating loops and have code profiling. FPC is not in that league, but, like Delphi, more geared to overall application performance, and not numeric or visual computational work with lots of tight loops. This is also why frameworks like Java and .NET don't score _that_ bad. Simply because the JIT can do good work for these tight loops, and take really advantage of its profiling optimizations, and auto-selecting of the exact CPU architecture. Simply being able to use a system specific move is already a serious advantage in benchmarks.
 
* Language usability and overall speed of the development process is not measured at all. (How would one do that?)
 
** Systems with a heavy RTL are punished. Both in size and startup time. However for actually getting work done, as long as it is not extreme, a full RTL is nice. (Most of the programs run for seconds - startup time must be really long.)
 
** ease of debugging is not measured. (How would one do that?)
 
** Languages that have a lower and higher level mode (e.g. FPC with its Pascal and Delphi levels) typically choose the most beneficial mode (e.g. lowest possible version for lowlevel stuff, high level mode if linecount is important), even if that is not the one typically used for a certain job.
 
** (not a problem of the shootout but its use) People are not used to reading boxplot charts with median and quartiles.
 
* Currently the benchmark game uses a quad-core machine so programs that use all 4 cores obviously have the advantage.
 
* Rules ''can'' change.
 
  
 
[[Category:Promotion]]
 
[[Category:Promotion]]

Latest revision as of 22:25, 3 March 2020

About

The computer language shootout (https://benchmarksgame-team.pages.debian.net/benchmarksgame/) is a realistically flawed benchmarking system which compares many languages on a somewhat unequal basis. See the linked page for more info.

Goals

Our goals are to be on the highest possible positions. The requirements to reach them can be categorized into two.

1: Optimizations on the assembler level. (core devel work)

2: Optimizations of the benchmarks. (junior devel work)

Way to go

It was decided that to keep memory requirements low it's best to use val() and str() and never include SysUtils unless really required.

Use of inline and records instead of classes can also improve performance considerably, but additional testing and comparing with C and other languages should be the main factor. Special note: inlining of recursive functions is possible and sometimes increases speed.

Another thing to use is {$implicitexceptions off} to keep rtl hindering speed. Dropping down to pchar level in tight loops can be beneficial. Use also only native sized integers and floats if possible.

Benchmarks notes

SSE2 and comparions

Consider the following code compiled with -Cfsse2:

const 
  limit=4.0;
var 
  x : double;
begin
  x := 1.73;
  if x < limit then
    x := x + x;
end.

For the condition x<limit, no sse2 code will be generated, because the compiler assumes the const is an extended and therefore needs to perform the comparison with the x87 FPU unit.

const 
  limit = double(4.0);

... also doesn't work. However

const 
  limit: double = double(4.0);

does.

Binary trees

The binary trees benchmark can be improved using a TNonFreePooledMemoryManager available in fpc 2.1.1. That way the allocation of nodes occurs in blocks and all de-allocation happens all at once for all nodes of the tree. Look for an example here. Such an allocator probably won't be allowed though.

FPC is currently in first place for this benchmark, and indeed TNonFreePooledMemoryManager (plus PasMP) were what did the trick.

Mandelbrot

The mandelbrot benchmark can be improved with fpc 2.1.1 using the sse2 instructions. Note that sse2 instructions are volatile and the results of a calculation will be put back on the stack, if the same body contains a procedure call. Therefore, a sse2 calculation needs to be moved to a separate (nested) procedure. The mandelbrot benchmark would probably benefit from SSA in loops, because without it calculations are performed in a scratch register and the result moved to the final register. SSA will prevent this. At the moment fpc 2.1.1 supports SSA only in simple linear flow control without loops and branches.

The latest versions of the competition gain a lot (nearly a factor of 2) by using 128 bit xmm registers for cr, ci, tr, and ti and instructions to do 2 calculations at the same time. I do not see a way to do this with fpc. Please prove me wrong :-)

FPC's score in this benchmark is currently much better than it was previously (formerly 22.69 seconds, currently 8.52 seconds). Optimization improvements in FPC 3.2.0 versus the current 3.0.4 being used are likely to improve the time even further.

N-body

The n-body benchmark can be improved with fpc 2.1.1 using the sse2 instructions and factoring out common factors. Compiling with -Cfsse2 and fpc 2.0.4 gives an internal error.

FPC 2.1.1 is now very outdated and SSE2 works just fine, but this benchmark still needs improvement in general.