Difference between revisions of "Vectorization"

From Lazarus wiki
Jump to navigationJump to search
m (Minor note about PMULLD)
m (Fixed template loop; categorised)
 
(7 intermediate revisions by 3 users not shown)
Line 1: Line 1:
<!--- {{Vectorization}} --->
+
{{LanguageBar}}
{{stub}}
+
 
This page has been set up as a collaboration and design spec for the proposal to include vectorisation support in the x86 and x86_64 variant of the Free Pascal Compiler (using SSE or AVX to reduce the number of instructions, and hence the execution speed, required to encode functionality).
+
This page has been set up as a collaboration and design specification for the proposal to include vectorization support in the <tt>x86</tt> and <tt>x86_64</tt> variant of the [[FPC]] (using SSE or AVX to reduce the number of instructions, and hence the execution speed, required to encode functionality).
 +
Eventually this page can be converted into a guide to the optimizer once vectorization is at least partially supported.
 +
 
 
[[User:CuriousKit|CuriousKit]] ([[User talk:CuriousKit|talk]]) 22:52, 11 December 2017 (CET)
 
[[User:CuriousKit|CuriousKit]] ([[User talk:CuriousKit|talk]]) 22:52, 11 December 2017 (CET)
==Loop Optimization==
+
 
Coupled with loop unrolling, it is possible to theoretically reduce the number of cycles by a factor of 4 or 8, depending on if SSE2 or AVX2 is used. It depends on the code contained within the loop, but if it utilises only relatively simple arithmetic and pointer movement (e.g. reading and writing sequentially from and to an array), then it might be vectorised with relative ease.
+
== Vector Types ==
 +
As of 2017-12-12, only static arrays of [[Single|singles]] or [[Double|doubles]] are considered vector types, but they are unaligned unless enforced by additional compiler directives.
 +
[[Dynamic array|Dynamic arrays]] and [[Pointer|pointers]] to singles or doubles are currently not considered to be vectors.
 +
 
 +
== Loop Optimization ==
 +
Coupled with loop unrolling, it is possible to theoretically reduce the number of cycles by a factor of four or eight, depending on if SSE2 or AVX2 is used.
 +
It depends on the code contained within the loop, but if it utilizes only relatively simple arithmetic and pointer movement (e.g. reading and writing sequentially from and to an array), then it might be vectorized with relatively great ease.
  
 
For example:
 
For example:
<syntaxhighlight>
+
<syntaxhighlight lang="pascal">
 
var
 
var
  Input1, Input2, Output: array of Cardinal; X: Integer;
+
input0, input1, output: array of cardinal;
 +
x: integer;
  
 
begin
 
begin
  assert(Length(Input1) = Length(Input2));
+
assert(length(input0) = length(input1));
  SetLength(Output, Length(Input1));
+
SetLength(output, length(input0));
 
+
  for X := 0 to Length(Input1) do
+
for x := 0 to length(input0)-1 do
  begin
+
begin
    Output[X] := Input1[X] * Input2[X];
+
output[x] := input0[x] * input1[x];
  end;
+
end;
 
end;
 
end;
 
</syntaxhighlight>
 
</syntaxhighlight>
  
While the size of the inputs is not immediately known, it can be seen that the array sizes do not change within the loop and hence Length(Input1) is constant. Visualising how the compiler might evaluate the code, it could be internally changed to the following:
+
While the size of the inputs is not immediately known, it can be seen that the array sizes do not change within the loop and hence <syntaxhighlight lang="pascal" enclose="none">length(input0)</syntaxhighlight> is constant.
 +
Visualizing how the compiler might evaluate the code, it could be internally changed to the following:
  
<syntaxhighlight>
+
<syntaxhighlight lang="pascal" highlight="3,9,11,13-16">
 
var
 
var
  Input1, Input2, Output: array of Cardinal; X, ArrayLen: Integer;
+
input0, input1, output: array of cardinal;
 +
x, arrayLen: integer;
  
 
begin
 
begin
  assert(Length(Input1) = Length(Input2));
+
assert(length(input0) = length(input1));
  SetLength(Output, Length(Input1));
+
SetLength(output, length(input0));
 
+
  ArrayLen := Length(Input1) div 4;
+
arrayLen := length(input0) div 4;
 
+
  for X := 0 to ArrayLen - 1 do
+
for x := 0 to arrayLen - 1 do
  begin
+
begin
    Output[4 * X] :=     Input1[4 * X] *     Input2[4 * X];
+
output[4 * x + 0] := input0[4 * x + 0] * input1[4 * x + 0];
    Output[4 * X + 1] := Input1[4 * X + 1] * Input2[4 * X + 1];
+
output[4 * x + 1] := input0[4 * x + 1] * input1[4 * x + 1];
    Output[4 * X + 2] := Input1[4 * X + 2] * Input2[4 * X + 2];
+
output[4 * x + 2] := input0[4 * x + 2] * input1[4 * x + 2];
    Output[4 * X + 3] := Input1[4 * X + 3] * Input2[4 * X + 3];
+
output[4 * x + 3] := input0[4 * x + 3] * input1[4 * x + 3];
  end;
+
end;
 
+
  { Handle leftover array entries individually (count = Length(Input1) mod 4)) }
+
// Handle leftover array entries individually
 +
// (count = length(input0) mod 4).
 
end;
 
end;
 
</syntaxhighlight>
 
</syntaxhighlight>
  
As such, the 4 statements in the converted for-loop can be easily assembled into SSE2 opcodes (although PMULLD is SSE4.1), even with potentially unaligned memory. For example:
+
As such, the four statements in the converted for-loop can be easily assembled into SSE2 opcodes (although <syntaxhighlight lang="asm" enclose="none">pmulld</syntaxhighlight> is SSE4.1), even with potentially unaligned memory.
<syntaxhighlight>
+
For example:
  LEA R8Output[0]
+
<syntaxhighlight lang="asm" highlight="12-14">
  LEA  R9, Input1[0]
+
lea    r8,  output[0]          ; r8 := @output
  LEA R10, Input2[0]
+
lea    r9input0[0]           ; r9  := @input0
 +
lea    r10, input1[0]           ; r10 := @input1
 +
 +
mov    ecx, input0_len          ; Calculate length(input0), a function call, beforehand and store in ecx
 +
xor    rbx, rbx                ; rbx := 0 - this is our index array
 +
 +
shr    ecx, 2                  ; Divide the length by 4, since we're processing 4 values at once.
 +
jz    @loop_exit              ; Don't enter loop if the length is less than 4
 +
 +
@vectorized_loop:
 +
movdqu xmm0, [r9 + rbx * 4]    ; Take 4 DWords from input0 and store in xmm0
 +
pmulld xmm0, [r10 + rbx * 4]    ; xmm0 := xmm0[0..3] * (input1)^[0..3]
 +
movdqu [r8 + rbx * 4], xmm0    ; Store result in output.
 +
 +
inc    rbx                      ; increment index
 +
dec    ecx
 +
ja    @vectorized_loop        ; if ecx > 0 then repeat loop with incremented index
 +
 +
@loop_exit:
 +
        ; Handle last few entries separately (count = input0_len mod 4)
 +
</syntaxhighlight>
 +
Further optimizations can be achieved by, for example, performing two reads, multiplies and writes per loop cycle, taking advantage of the fact that modern processors have more than one SIMD port to send instructions to.
  
  MOV  ECX, ArrayLen
+
== see also ==
  XOR  RBX, RBX
+
* [https://www.freepascal.org/docs-html/prog/progch5.html Chapter “Intel MMX support” in the Programmer's Guide]
  TEST ECX, ECX
+
* {{Doc|package=RTL|unit=mmx|text=Unit <syntaxhighlight lang="pascal" enclose="none">mmx</syntaxhighlight>}}
  JZ  @LoopExit
 
  
@VectorisedLoop:
+
[[Category:FPC]]
  MOVDQU  XMM0, [R9+RBX*4]
+
[[Category:Proposals]]
  PMULLD  XMM0, [R10+RBX*4]
 
  MOVDQU  [R8*RBX*4], XMM0
 
 
 
  INC  RBX
 
  DEC  ECX
 
  JNZ  @VectorisedLoop
 
@LoopExit: 
 
</syntaxhighlight>
 
Further optimisations can be achieved by, for example, performing two reads, multiplies and writes per loop cycle, taking advantage of the fact that modern processors have more than one SIMD port to send instructions to.
 

Latest revision as of 07:43, 30 December 2019

English (en)

This page has been set up as a collaboration and design specification for the proposal to include vectorization support in the x86 and x86_64 variant of the FPC (using SSE or AVX to reduce the number of instructions, and hence the execution speed, required to encode functionality). Eventually this page can be converted into a guide to the optimizer once vectorization is at least partially supported.

CuriousKit (talk) 22:52, 11 December 2017 (CET)

Vector Types

As of 2017-12-12, only static arrays of singles or doubles are considered vector types, but they are unaligned unless enforced by additional compiler directives. Dynamic arrays and pointers to singles or doubles are currently not considered to be vectors.

Loop Optimization

Coupled with loop unrolling, it is possible to theoretically reduce the number of cycles by a factor of four or eight, depending on if SSE2 or AVX2 is used. It depends on the code contained within the loop, but if it utilizes only relatively simple arithmetic and pointer movement (e.g. reading and writing sequentially from and to an array), then it might be vectorized with relatively great ease.

For example:

var
	input0, input1, output: array of cardinal;
	x: integer;

begin
	assert(length(input0) = length(input1));
	SetLength(output, length(input0));
	
	for x := 0 to length(input0)-1 do
	begin
		output[x] := input0[x] * input1[x];
	end;
end;

While the size of the inputs is not immediately known, it can be seen that the array sizes do not change within the loop and hence length(input0) is constant. Visualizing how the compiler might evaluate the code, it could be internally changed to the following:

var
	input0, input1, output: array of cardinal;
	x, arrayLen: integer;

begin
	assert(length(input0) = length(input1));
	SetLength(output, length(input0));
	
	arrayLen := length(input0) div 4;
	
	for x := 0 to arrayLen - 1 do
	begin
		output[4 * x + 0] := input0[4 * x + 0] * input1[4 * x + 0];
		output[4 * x + 1] := input0[4 * x + 1] * input1[4 * x + 1];
		output[4 * x + 2] := input0[4 * x + 2] * input1[4 * x + 2];
		output[4 * x + 3] := input0[4 * x + 3] * input1[4 * x + 3];
	end;
	
	// Handle leftover array entries individually
	// (count = length(input0) mod 4).
end;

As such, the four statements in the converted for-loop can be easily assembled into SSE2 opcodes (although pmulld is SSE4.1), even with potentially unaligned memory. For example:

	lea    r8,  output[0]           ; r8  := @output
	lea    r9,  input0[0]           ; r9  := @input0
	lea    r10, input1[0]           ; r10 := @input1
	
	mov    ecx, input0_len          ; Calculate length(input0), a function call, beforehand and store in ecx
	xor    rbx, rbx                 ; rbx := 0 - this is our index array
	
	shr    ecx, 2                   ; Divide the length by 4, since we're processing 4 values at once.
	jz     @loop_exit               ; Don't enter loop if the length is less than 4
	
@vectorized_loop:
	movdqu xmm0, [r9  + rbx * 4]    ; Take 4 DWords from input0 and store in xmm0
	pmulld xmm0, [r10 + rbx * 4]    ; xmm0 := xmm0[0..3] * (input1)^[0..3]
	movdqu [r8 + rbx * 4], xmm0     ; Store result in output.
	
	inc    rbx                      ; increment index
	dec    ecx
	ja     @vectorized_loop         ; if ecx > 0 then repeat loop with incremented index
	
@loop_exit:
        ; Handle last few entries separately (count = input0_len mod 4)

Further optimizations can be achieved by, for example, performing two reads, multiplies and writes per loop cycle, taking advantage of the fact that modern processors have more than one SIMD port to send instructions to.

see also