Difference between revisions of "Vectorization"
CuriousKit (talk | contribs) (Loop Optimization Stub) |
CuriousKit (talk | contribs) m (Minor note about PMULLD) |
||
Line 46: | Line 46: | ||
</syntaxhighlight> | </syntaxhighlight> | ||
− | As such, the 4 statements in the converted for-loop can be easily assembled into SSE2 opcodes, even with potentially unaligned memory. For example: | + | As such, the 4 statements in the converted for-loop can be easily assembled into SSE2 opcodes (although PMULLD is SSE4.1), even with potentially unaligned memory. For example: |
<syntaxhighlight> | <syntaxhighlight> | ||
LEA R8, Output[0] | LEA R8, Output[0] |
Revision as of 02:49, 12 December 2017
An editor has declared this article to be a stub, meaning that it needs more information. Can you help out and add some? If you have some useful information, you can help the Free Pascal Wiki by clicking on the edit box on the left and expanding this page.
This page has been set up as a collaboration and design spec for the proposal to include vectorisation support in the x86 and x86_64 variant of the Free Pascal Compiler (using SSE or AVX to reduce the number of instructions, and hence the execution speed, required to encode functionality). CuriousKit (talk) 22:52, 11 December 2017 (CET)
Loop Optimization
Coupled with loop unrolling, it is possible to theoretically reduce the number of cycles by a factor of 4 or 8, depending on if SSE2 or AVX2 is used. It depends on the code contained within the loop, but if it utilises only relatively simple arithmetic and pointer movement (e.g. reading and writing sequentially from and to an array), then it might be vectorised with relative ease.
For example:
var
Input1, Input2, Output: array of Cardinal; X: Integer;
begin
assert(Length(Input1) = Length(Input2));
SetLength(Output, Length(Input1));
for X := 0 to Length(Input1) do
begin
Output[X] := Input1[X] * Input2[X];
end;
end;
While the size of the inputs is not immediately known, it can be seen that the array sizes do not change within the loop and hence Length(Input1) is constant. Visualising how the compiler might evaluate the code, it could be internally changed to the following:
var
Input1, Input2, Output: array of Cardinal; X, ArrayLen: Integer;
begin
assert(Length(Input1) = Length(Input2));
SetLength(Output, Length(Input1));
ArrayLen := Length(Input1) div 4;
for X := 0 to ArrayLen - 1 do
begin
Output[4 * X] := Input1[4 * X] * Input2[4 * X];
Output[4 * X + 1] := Input1[4 * X + 1] * Input2[4 * X + 1];
Output[4 * X + 2] := Input1[4 * X + 2] * Input2[4 * X + 2];
Output[4 * X + 3] := Input1[4 * X + 3] * Input2[4 * X + 3];
end;
{ Handle leftover array entries individually (count = Length(Input1) mod 4)) }
end;
As such, the 4 statements in the converted for-loop can be easily assembled into SSE2 opcodes (although PMULLD is SSE4.1), even with potentially unaligned memory. For example:
LEA R8, Output[0]
LEA R9, Input1[0]
LEA R10, Input2[0]
MOV ECX, ArrayLen
XOR RBX, RBX
TEST ECX, ECX
JZ @LoopExit
@VectorisedLoop:
MOVDQU XMM0, [R9+RBX*4]
PMULLD XMM0, [R10+RBX*4]
MOVDQU [R8*RBX*4], XMM0
INC RBX
DEC ECX
JNZ @VectorisedLoop
@LoopExit:
Further optimisations can be achieved by, for example, performing two reads, multiplies and writes per loop cycle, taking advantage of the fact that modern processors have more than one SIMD port to send instructions to.