Difference between revisions of "Varlen Encoding"

From Lazarus wiki
(Added the rest of the encoding)
(More formatting and explanation behind the offsets)
Line 14: Line 14:
 
! Description
 
! Description
 
|-
 
|-
| '''0#######'''
+
| <code>'''0#######'''</code>
! style="text-align: center;" | $00..$7F
+
| <center><code>$00..$7F</code></center>
 
| No data bytes; '''#''''s encode a 7-bit value between $00 to $7F (0 to 127)
 
| No data bytes; '''#''''s encode a 7-bit value between $00 to $7F (0 to 127)
 
|-
 
|-
| '''10######'''
+
| <code>'''10######'''</code>
! style="text-align: center;" | $0080..$407F
+
| <center><code>$0080..$407F</code></center>
 
| 1 data byte follows; '''#''''s encode a 14-bit value offset by $80 (add $80 to the value to obtain the actual integer)
 
| 1 data byte follows; '''#''''s encode a 14-bit value offset by $80 (add $80 to the value to obtain the actual integer)
 
|-
 
|-
| '''110#####'''
+
| <code>'''110#####'''</code>
! style="text-align: center;" | $004080..$20407F
+
| <center><code>$004080..$20407F</code></center>
 
| 2 data bytes follow; '''#''''s encode a 21-bit value offset by $4080
 
| 2 data bytes follow; '''#''''s encode a 21-bit value offset by $4080
 
|-
 
|-
| '''1110####'''
+
| <code>'''1110####'''</code>
! style="text-align: center;" | $00204080..$1020407F
+
| <center><code>$00204080..$1020407F</code></center>
 
| 3 data bytes follow; '''#''''s encode a 28-bit value offset by $204080
 
| 3 data bytes follow; '''#''''s encode a 28-bit value offset by $204080
 
|-
 
|-
| '''11110###'''
+
| <code>'''11110###'''</code>
! style="text-align: center;" | $00 10204080..$08 1020407F
+
| <center><code>$00&nbsp;10204080..$08&nbsp;1020407F</code></center>
 
| 4 data bytes follow; '''#''''s encode a 35-bit value offset by $10204080
 
| 4 data bytes follow; '''#''''s encode a 35-bit value offset by $10204080
 
|-
 
|-
| '''111110##'''
+
| <code>'''111110##'''</code>
! style="text-align: center;" | $0008 10204080..$0408 1020407F
+
| <center><code>$0008&nbsp;10204080..$0408&nbsp;1020407F</code></center>
| 5 data bytes follow; '''#''''s encode a 42-bit value offset by $08 10204080
+
| 5 data bytes follow; '''#''''s encode a 42-bit value offset by $08&nbsp;10204080
 
|-
 
|-
| '''1111110#'''
+
| <code>'''1111110#'''</code>
! style="text-align: center;" | $000408 10204080..$020408 1020407F
+
| <center><code>$000408&nbsp;10204080..$020408&nbsp;1020407F</code></center>
| 6 data bytes follow; '''#''''s encode a 49-bit value offset by $0408 10204080
+
| 6 data bytes follow; '''#''''s encode a 49-bit value offset by $0408&nbsp;10204080
 
|-
 
|-
| '''11111110'''
+
| <code>'''11111110'''</code>
! style="text-align: center;" | $00020408 10204080..$01020408 1020407F
+
| <center><code>$00020408&nbsp;10204080..$01020408&nbsp;1020407F</code></center>
| 7 data bytes follow; '''#''''s encode a 56-bit value offset by $020408 10204080
+
| 7 data bytes follow; '''#''''s encode a 56-bit value offset by $020408&nbsp;10204080
 
|-
 
|-
| '''11111111'''
+
| <code>'''11111111'''</code>
! style="text-align: center;" | $01020408 10204080..$FFFFFFFF FFFFFFFF
+
| <center><code>$01020408&nbsp;10204080..$FFFFFFFF&nbsp;FFFFFFFF</code></center>
| 8 data bytes follow; '''#''''s encode a 64-bit value offset by $01020408 10204080
+
| 8 data bytes follow; '''#''''s encode a 64-bit value offset by $01020408&nbsp;10204080
 
|}
 
|}
 +
It is permissible to remove the offsets without sacrificing the full QWord domain.  The use of offsets remove multiple encodings for the same value (e.g. '''00000000''' and '''10000000&nbsp;00000000''' for zero) and slightly increase the chance that an integer is stored in fewer bytes (e.g. 16,384, or $4000, would otherwise take 3 bytes to store without the offset (as '''11000000&nbsp;00000000&nbsp;00000000'''), whereas with the offset, it only takes 2 bytes (as '''10111111&nbsp;10000000''')).

Revision as of 03:23, 23 March 2019

Varlen Encoding (short for "variable length") is a way of storing a QWord so that small values take as few bytes as possible.

Inspiration

The inspiration and design of Varlen encoding came from two sources:

Some data fields are statistically more likely to contain small values, such as those used for sequential indices and string lengths, but whose upper limit may be the size of a LongWord or a QWord. If many hundreds of these values are stored, space can be quickly wasted when most of the bytes that make up the full integer value are zero, and this can be problematic on media with limited capacity. The design of Varlen Encoding is an attempt to reduce this wastage with minimal performance impact.

Format

A Varlen-encoded integer (henceforth just called a Varlen) takes between 1 and 9 bytes to store. The first byte, known as a Lead Byte, encodes a byte count and the most significant bits of the number, stored in big-endian order. The bit pattern of the Lead Byte dictates the number of Data Bytes that follow:

Lead Byte Value Range Description
0#######
$00..$7F
No data bytes; #'s encode a 7-bit value between $00 to $7F (0 to 127)
10######
$0080..$407F
1 data byte follows; #'s encode a 14-bit value offset by $80 (add $80 to the value to obtain the actual integer)
110#####
$004080..$20407F
2 data bytes follow; #'s encode a 21-bit value offset by $4080
1110####
$00204080..$1020407F
3 data bytes follow; #'s encode a 28-bit value offset by $204080
11110###
$00 10204080..$08 1020407F
4 data bytes follow; #'s encode a 35-bit value offset by $10204080
111110##
$0008 10204080..$0408 1020407F
5 data bytes follow; #'s encode a 42-bit value offset by $08 10204080
1111110#
$000408 10204080..$020408 1020407F
6 data bytes follow; #'s encode a 49-bit value offset by $0408 10204080
11111110
$00020408 10204080..$01020408 1020407F
7 data bytes follow; #'s encode a 56-bit value offset by $020408 10204080
11111111
$01020408 10204080..$FFFFFFFF FFFFFFFF
8 data bytes follow; #'s encode a 64-bit value offset by $01020408 10204080

It is permissible to remove the offsets without sacrificing the full QWord domain. The use of offsets remove multiple encodings for the same value (e.g. 00000000 and 10000000 00000000 for zero) and slightly increase the chance that an integer is stored in fewer bytes (e.g. 16,384, or $4000, would otherwise take 3 bytes to store without the offset (as 11000000 00000000 00000000), whereas with the offset, it only takes 2 bytes (as 10111111 10000000)).