This page has been set up as a collaboration and design specification for the proposal to include vectorization support in the x86 and x86_64 variant of the FPC (using SSE or AVX to reduce the number of instructions, and hence the execution speed, required to encode functionality). Eventually this page can be converted into a guide to the optimizer once vectorization is at least partially supported.
As of 2017-12-12, only static arrays of singles or doubles are considered vector types, but they are unaligned unless enforced by additional compiler directives. Dynamic arrays and pointers to singles or doubles are currently not considered to be vectors.
Coupled with loop unrolling, it is possible to theoretically reduce the number of cycles by a factor of four or eight, depending on if SSE2 or AVX2 is used. It depends on the code contained within the loop, but if it utilizes only relatively simple arithmetic and pointer movement (e.g. reading and writing sequentially from and to an array), then it might be vectorized with relatively great ease.
var input0, input1, output: array of cardinal; x: integer; begin assert(length(input0) = length(input1)); setLength(output, length(input0)); for x := 0 to length(input0)-1 do begin output[x] := input0[x] * input1[x]; end; end;
While the size of the inputs is not immediately known, it can be seen that the array sizes do not change within the loop and hence
length(input0) is constant.
Visualizing how the compiler might evaluate the code, it could be internally changed to the following:
var input0, input1, output: array of cardinal; x, arrayLen: integer; begin assert(length(input0) = length(input1)); setLength(output, length(input0)); arrayLen := length(input0) div 4; for x := 0 to arrayLen - 1 do begin output[4 * x + 0] := input0[4 * x + 0] * input1[4 * x + 0]; output[4 * x + 1] := input0[4 * x + 1] * input1[4 * x + 1]; output[4 * x + 2] := input0[4 * x + 2] * input1[4 * x + 2]; output[4 * x + 3] := input0[4 * x + 3] * input1[4 * x + 3]; end; // Handle leftover array entries individually // (count = length(input0) mod 4). end;
As such, the four statements in the converted for-loop can be easily assembled into SSE2 opcodes (although
pmulld is SSE4.1), even with potentially unaligned memory.
lea r8, output ; r8 := @output lea r9, input0 ; r9 := @input0 lea r10, input1 ; r10 := @input1 mov ecx, arrayLen ; ecx := arrayLen xor rbx, rbx ; rbx := 0 ; do not enter loop with empty array test ecx, ecx ; ecx = 0 ? jz @loop_exit ; if ecx = 0 then goto exit @vectorized_loop: movdqu xmm0, [r9 + rbx * 4] ; xmm0 := (r9+4rbx)^ pmulld xmm0, [r10 + rbx * 4] ; xmm0 := xmm0[0..1] * (r10+4rbx)^[0..1] movdqu [r8 * rbx * 4], xmm0 ; ??? inc rbx ; inc(rbx) loop @vectorized_loop ; dec(ecx); if ecx <> 0 then goto loop @loop_exit:
Further optimizations can be achieved by, for example, performing two reads, multiplies and writes per loop cycle, taking advantage of the fact that modern processors have more than one SIMD port to send instructions to.