Difference between revisions of "Case Compiler Optimization"

From Lazarus wiki
Jump to navigationJump to search
m (Binary search algorithm link)
m (Assumption of %key)
Line 25: Line 25:
  
 
====Loop Code====
 
====Loop Code====
Start with %lo = -1, %hi = number of case labels, %key = value to be searched, %table being the address of the map that contains the pointers (which may have to be set beforehand due to the requirement of relocatable code on x86-64), and %mid and %temp undefined.  The reason for "lea 2(%lo), %temp" is explained later.
+
Start with %lo = -1, %hi = number of case labels, %key = value to be searched (assumed to be a signed integer), %table being the address of the map that contains the pointers (which may have to be set beforehand due to the requirement of relocatable code on x86-64), and %mid and %temp undefined.  The reason for "lea 2(%lo), %temp" is explained later.
 
<syntaxhighlight lang="asm">
 
<syntaxhighlight lang="asm">
 
@LoopBegin:
 
@LoopBegin:

Revision as of 15:46, 5 August 2018

An editor has declared this article to be a stub, meaning that it needs more information. Can you help out and add some? If you have some useful information, you can help the Free Pascal Wiki by clicking on the edit box on the left and expanding this page.

i386 and x86-64

Binary Search Lookup

Main article: Binary search algorithm

Requirements

  • 32 or more items.
  • All branches call an identical function. Branches that have multiple case values or ranges are treated as individual branches for each single value.
  • The range of case labels is large, which would make a direct index or hash lookup impractical.

Iteration Count

A binary search divides the eligible list in half with each iteration, meaning its complexity grows with the base-2 logarithm of the list size -- that is, the algorithm is O(log n). While this gives us a rough guide as to the number of times the critical loop will run, it is not an exact answer, and log2(n) is only integral when n is a power of 2. However, if we analyse the workings of the binary search, we can deduce the iteration count:

BinSearch8.png

If we take a list size of 8 and evaluate the midpoint (either one of the two middle values will suffice) and determine it is not a match, we can exclude this value and hence are left with one of two sublists to evaluate, one of size 3 and one of size 4. Given the search may take us to the larger of the two, we have to assume worst-case behaviour. Similarly, at size 4, the sublists are of size 1 and 2, and then at size 2, we are left with either a sublist of 1, a match, or a mismatch, since there is no second sublist. At size 1, we just check if this is the value we are looking for, and if not, then we can conclude that the value does not appear in the list.

This evaluates to 3 iterations and a final comparison with the 1-sized sublist. This agrees with the value of log2(8) = 3. This is the simplest case though where the list size is a power of two. If we go one higher to a list size of 9...

BinSearch9.png

Because we can exclude the midpoint, both sublists have a size of 4, and following the worst-case process listed above, this also results in 3 iterations. From this, we can deduce that the largest sublist size from a list of size n is equal to floor(n/2). Taken to the logical extreme where the sublist size is an odd number in each division stage, we find that only 3 iterations are required up to an initial list size of 15, since this divides down to 7, then 3, and then we are left with one of two 1-sized sublists that just need a comparison. Such number sequences are all 1s in binary, and this corresponds with values of the form 2a - 1, one below the argument that causes log2 to reach its next integral value. Therefore, we can conclude that the total iteration count is a remarkably simple equation:

floor(log2(n))

This is relatively fast to calculate, as it's equivalent to the position of the most significant 1-bit in the binary value of n.

Loop Code

Start with %lo = -1, %hi = number of case labels, %key = value to be searched (assumed to be a signed integer), %table being the address of the map that contains the pointers (which may have to be set beforehand due to the requirement of relocatable code on x86-64), and %mid and %temp undefined. The reason for "lea 2(%lo), %temp" is explained later.

@LoopBegin:
  lea    1(%lo,%hi,1), %mid
  shr    %mid
  cmp    (%table,%mid,8), %key
  cmovge %mid,   %lo
  cmovle %mid,   %hi
  lea    2(%lo), %temp
  cmp    %temp,  %hi
  jge    @LoopBegin

With loop unrolling, this can be simplified further:

  lea    1(%lo,%hi,1), %mid
  shr    %mid
  cmp    (%table,%mid,8), %key
  je     @CaseMatch
  cmovge %mid,   %lo
  cmovle %mid,   %hi

... with “@CaseMatch” appearing later. This block of code is repeated for each iteration. If there is concern that the conditional jump may incur a performance penalty, it can be removed without incident, especially during the last couple of iterations (an exact match will set %lo and %hi to the same value).

For 64-bit systems, the comparison is more complicated because each entry in the table is 16 bytes, and there is no facility to multiply by 16 in reference operands. Nevertheless, with some rearrangement of the commands to take advantage of a CPU core's multiple ports and the availability of additional registers, this can be resolved relatively painlessly:

  lea    1(%lo,%hi,1), %index
  lea    1(%lo,%hi,1), %mid
  and    $-2,    %index ; Clear lsb
  shr    %mid
  cmp    (%table,%index,8), %key
  je     @CaseMatch
  cmovge %mid,   %lo
  cmovle %mid,   %hi

During the search, the value of %lo is one below the lowest valid index, while %hi is one above the highest valid index. If at any point that %lo + 1 ≥ %hi, the entire list has been searched. Due to how %mid is calculated, if execution passes through the unrolled loop without jumping out, %lo and %hi will have a difference of 2 and sit either side of the index of the last remaining entry in the list that hasn't been checked, hence one final lookup in the table against %key will confirm a match, with a jump to the case's "else" block (or the end of the block if one isn't present) if it's not.

For i386:

  lea    1(%lo,%hi), %mid
  shr    %mid
  cmp    (%table,%mid,8), %key
  jne    @ElseBlock
@CaseMatch:
  mov    4(%table,%mid,8), %procaddr
  ; Do common parameter set-up
  call   %procaddr
  jmp    @CaseBlockEnd
@ElseBlock:
  ; Else block code
  ...
@CaseBlockEnd:
  ...

For x86-64 (note the use of %index in the @CaseMatch setion):

  lea    1(%lo,%hi), %index
  and    $-2, %index
  cmp    (%table,%index,8), %key
  jne    @ElseBlock
@CaseMatch:
  mov    8(%table,%index,8), %procaddr
  ; Do common parameter set-up
  call   %procaddr
  jmp    @CaseBlockEnd
@ElseBlock:
  ; Else block code
  ...
@CaseBlockEnd:
  ...

Buffer Overrun Trap

One drawback to this highly optimised binary search is that it doesn't trap the situation where the input key is greater than the largest key in the table, and if this happens, the loop may exit with %lo + 1 = %hi = list length, then "lea 1(%lo, %hi), %mid" sets %mid to %lo + 1, which is out of bounds. To get around this, we recall that in normal operation, %lo and %hi have a difference of 2 if the flow reaches the end of the loop, therefore their sum is a multiple 2, and so the "+1" part of the LEA instruction has no effect after dividing the sum by 2, so the simple fix is to replace the LEA instruction with "lea (%lo, %hi), %mid". In other words, for i386:

  lea    (%lo,%hi), %mid
  shr    %mid
  cmp    (%table,%mid,8), %key
  jne    @ElseBlock
@CaseMatch:
  mov    4(%table,%mid,8), %procaddr
  ; Do common parameter set-up
  call   %procaddr
  jmp    @CaseBlockEnd
@ElseBlock:
  ; Else block code
  ...
@CaseBlockEnd:
  ...

For x86-64:

  lea    (%lo,%hi), %index
  and    $-2, %index
  cmp    (%table,%index,8), %key
  jne    @ElseBlock
@CaseMatch:
  mov    8(%table,%index,8), %procaddr
  ; Do common parameter set-up
  call   %procaddr
  jmp    @CaseBlockEnd
@ElseBlock:
  ; Else block code
  ...
@CaseBlockEnd:
  ...

Putting it all together

Now that the registers and exceptional cases are explained, we can put the entire thing together.

For i386:

@LoopBegin:
  lea    1(%lo,%hi,1), %mid
  shr    %mid
  cmp    (%table,%mid,8), %key
  cmovge %mid,   %lo
  cmovle %mid,   %hi
  lea    2(%lo), %temp
  cmp    %temp,  %hi
  jge    @LoopBegin
  ; Unroll the above if desired

  lea    (%lo,%hi), %mid
  shr    %mid
  cmp    (%table,%mid,8), %key
  jne    @ElseBlock
@CaseMatch:
  mov    4(%table,%mid,8), %procaddr
  ; Do common parameter set-up
  call   %procaddr
  jmp    @CaseBlockEnd
@ElseBlock:
  ; Else block code
  ...
@CaseBlockEnd:
  ...

For x86-64:

@LoopBegin:
  lea    1(%lo,%hi,1), %index
  lea    1(%lo,%hi,1), %mid
  and    $-2,    %index ; Clear lsb
  shr    %mid
  cmp    (%table,%index,8), %key
  je     @CaseMatch
  cmovge %mid,   %lo
  cmovle %mid,   %hi
  lea    2(%lo), %temp
  cmp    %temp,  %hi
  jge    @LoopBegin
  ; Unroll the above if desired

  lea    (%lo,%hi), %index
  and    $-2, %index
  cmp    (%table,%index,8), %key
  jne    @ElseBlock
@CaseMatch:
  mov    8(%table,%index,8), %procaddr
  ; Do common parameter set-up
  call   %procaddr
  jmp    @CaseBlockEnd
@ElseBlock:
  ; Else block code
  ...
@CaseBlockEnd:
  ...