Difference between revisions of "Varlen Encoding"
CuriousKit (talk | contribs) m (Brief clarification) |
CuriousKit (talk | contribs) (Signed Varlen) |
||
Line 63: | Line 63: | ||
* Hashes | * Hashes | ||
* Signed integers (since negative numbers map onto the largest positive integers) | * Signed integers (since negative numbers map onto the largest positive integers) | ||
+ | |||
+ | ==Signed Varlen== | ||
+ | The regular Varlen can only encode unsigned integers, and while two's complement permits negative numbers to map over the top half of an unsigned integer domain, these values take up the most bytes under Varlen Encoding and so are likely to increase storage requirements if negative numbers close to zero are frequent. Therefore, for signed integers, a different encoding system is required. | ||
+ | |||
+ | ===Format=== | ||
+ | As with the regular Varlen, a Signed Varlen contains a ''Lead Byte'', but there is now always at least one ''Data Byte''. This is because one of the Lead Byte's bits must be used as a sign bit and so a data byte count of 9 different values (0-8) cannot be encoded, only 8. | ||
+ | |||
+ | {| class="wikitable" | ||
+ | |+Positive Values | ||
+ | |-! style="text-align: center;" | ||
+ | ! Lead Byte | ||
+ | ! Value Range | ||
+ | ! Description | ||
+ | |- | ||
+ | | <code>'''00######'''</code> | ||
+ | | <center><code>$0000..$3FFF</code></center> | ||
+ | | 1 data byte follows; '''#''''s encode a 14-bit value between $00 to $3FFF (0 to 16,383) | ||
+ | |- | ||
+ | | <code>'''010#####'''</code> | ||
+ | | <center><code>$004000..$203FFF</code></center> | ||
+ | | 2 data bytes follow; '''#''''s encode a 21-bit value offset by $4000 (add $4000 to the value to obtain the actual integer) | ||
+ | |- | ||
+ | | <code>'''0110####'''</code> | ||
+ | | <center><code>$00204000..$10203FFF</code></center> | ||
+ | | 3 data bytes follow; '''#''''s encode a 28-bit value offset by $204000 | ||
+ | |- | ||
+ | | <code>'''01110###'''</code> | ||
+ | | <center><code>$00 10204000..$08 10203FFF</code></center> | ||
+ | | 4 data bytes follow; '''#''''s encode a 35-bit value offset by $10204000 | ||
+ | |- | ||
+ | | <code>'''011110##'''</code> | ||
+ | | <center><code>$0008 10204000..$0408 10203FFF</code></center> | ||
+ | | 5 data bytes follow; '''#''''s encode a 42-bit value offset by $08 10204000 | ||
+ | |- | ||
+ | | <code>'''0111110#'''</code> | ||
+ | | <center><code>$000408 10204000..$020408 10203FFF</code></center> | ||
+ | | 6 data bytes follow; '''#''''s encode a 49-bit value offset by $0408 10204000 | ||
+ | |- | ||
+ | | <code>'''01111110'''</code> | ||
+ | | <center><code>$00020408 10204000..$01020408 10203FFF</code></center> | ||
+ | | 7 data bytes follow; '''#''''s encode a 56-bit value offset by $020408 10204000 | ||
+ | |- | ||
+ | | <code>'''01111111'''</code> | ||
+ | | <center><code>$01020408 10204000..$7FFFFFFF FFFFFFFF</code></center> | ||
+ | | 8 data bytes follow; '''#''''s encode a 64-bit value offset by $01020408 10204000 | ||
+ | |} | ||
+ | |||
+ | {| class="wikitable" | ||
+ | |+Negative Values | ||
+ | |-! style="text-align: center;" | ||
+ | ! Lead Byte | ||
+ | ! Value Range | ||
+ | ! Description | ||
+ | |- | ||
+ | | <code>'''11######'''</code> | ||
+ | | <center><code>$FFFFFFFF FFFFC000..$FFFFFFFF FFFFFFFF</code></center> | ||
+ | | 1 data byte follows; '''#''''s encode a 14-bit value between $FFFFFFFF FFFFC000 to $FFFFFFFF FFFFFFFF (-16,384 to -1) | ||
+ | |- | ||
+ | | <code>'''101#####'''</code> | ||
+ | | <center><code>$FFFFFFFF FFDFC000..$FFFFFFFF FFFFBFFF</code></center> | ||
+ | | 2 data bytes follow; '''#''''s encode a 21-bit value offset by $4000 ('''subtract''' $4000 to the value to obtain the actual integer) | ||
+ | |- | ||
+ | | <code>'''1001####'''</code> | ||
+ | | <center><code>$FFFFFFFF EFDFC000..$FFFFFFFF FFDFBFFF</code></center> | ||
+ | | 3 data bytes follow; '''#''''s encode a 28-bit value offset by $204000 | ||
+ | |- | ||
+ | | <code>'''10001###'''</code> | ||
+ | | <center><code>$FFFFFFF7 EFDFC000..$FFFFFFFF EFDFBFFF</code></center> | ||
+ | | 4 data bytes follow; '''#''''s encode a 35-bit value offset by $10204000 | ||
+ | |- | ||
+ | | <code>'''100001##'''</code> | ||
+ | | <center><code>$FFFFFBF7 EFDFC000..$FFFFFFF7 EFDFBFFF</code></center> | ||
+ | | 5 data bytes follow; '''#''''s encode a 42-bit value offset by $08 10204000 | ||
+ | |- | ||
+ | | <code>'''1000001#'''</code> | ||
+ | | <center><code>$FFFDFBF7 EFDFC000..$FFFFFBF7 EFDFBFFF</code></center> | ||
+ | | 6 data bytes follow; '''#''''s encode a 49-bit value offset by $0408 10204000 | ||
+ | |- | ||
+ | | <code>'''10000001'''</code> | ||
+ | | <center><code>$FEFDFBF7 EFDFC000..$FFFDFBF7 EFDFBFFF</code></center> | ||
+ | | 7 data bytes follow; '''#''''s encode a 56-bit value offset by $020408 10204000 | ||
+ | |- | ||
+ | | <code>'''10000000'''</code> | ||
+ | | <center><code>$80000000 00000000..$FEFDFBF7 EFDFBFFF</code></center> | ||
+ | | 8 data bytes follow; '''#''''s encode a 64-bit value offset by $01020408 10204000 | ||
+ | |} |
Revision as of 06:03, 23 March 2019
Varlen Encoding (short for "variable length") is a way of storing a QWord so that small values take as few bytes as possible.
Inspiration
The inspiration and design of Varlen encoding came from two sources:
- UTF-8 encoding
- The CompactIndex type used in old versions of the Unreal Engine.
Some data fields are statistically more likely to contain small values, such as those used for sequential indices and string lengths, but whose upper limit may be the size of a LongWord or a QWord. If many hundreds of these values are stored, space can be quickly wasted when most of the bytes that make up the full integer value are zero, and this can be problematic on media with limited capacity. The design of Varlen Encoding is an attempt to reduce this wastage with minimal performance impact.
Format
A Varlen-encoded integer (henceforth just called a Varlen) takes between 1 and 9 bytes to store. The first byte, known as a Lead Byte, encodes a byte count and the most significant bits of the integer, stored in big-endian order. The bit pattern of the Lead Byte dictates the number of Data Bytes that follow, which contain the rest of the bits for the stored integer:
Lead Byte | Value Range | Description |
---|---|---|
0#######
|
$00..$7F |
No data bytes; #'s encode a 7-bit value between $00 to $7F (0 to 127) |
10######
|
$0080..$407F |
1 data byte follows; #'s encode a 14-bit value offset by $80 (add $80 to the value to obtain the actual integer) |
110#####
|
$004080..$20407F |
2 data bytes follow; #'s encode a 21-bit value offset by $4080 |
1110####
|
$00204080..$1020407F |
3 data bytes follow; #'s encode a 28-bit value offset by $204080 |
11110###
|
$00 10204080..$08 1020407F |
4 data bytes follow; #'s encode a 35-bit value offset by $10204080 |
111110##
|
$0008 10204080..$0408 1020407F |
5 data bytes follow; #'s encode a 42-bit value offset by $08 10204080 |
1111110#
|
$000408 10204080..$020408 1020407F |
6 data bytes follow; #'s encode a 49-bit value offset by $0408 10204080 |
11111110
|
$00020408 10204080..$01020408 1020407F |
7 data bytes follow; #'s encode a 56-bit value offset by $020408 10204080 |
11111111
|
$01020408 10204080..$FFFFFFFF FFFFFFFF |
8 data bytes follow; #'s encode a 64-bit value offset by $01020408 10204080 |
It is permissible to remove the offsets without sacrificing the full QWord domain. The use of offsets remove multiple encodings for the same value (e.g. 00000000 and 10000000 00000000 for zero) and slightly increase the chance that an integer is stored in fewer bytes (e.g. 16,384, or $4000, would otherwise take 3 bytes to store without the offset (as 11000000 00000000 00000000), whereas with the offset, it only takes 2 bytes (as 10111111 10000000)).
Suitability
As stated earlier, Varlen Encoding is best suited for situations where the field is more likely to contain a small value over a large one. As a result, not all LongWord and QWord values should be stored as Varlens because an extra byte is required to store the largest of values. For example, the following field types are good candidates for Varlen Encoding:
- Index values
- String and buffer lengths
- Counts
- Offsets (as long as they are strictly positive)
Conversely, the following would make for poor candidates:
- 32-bit pointers
- Hashes
- Signed integers (since negative numbers map onto the largest positive integers)
Signed Varlen
The regular Varlen can only encode unsigned integers, and while two's complement permits negative numbers to map over the top half of an unsigned integer domain, these values take up the most bytes under Varlen Encoding and so are likely to increase storage requirements if negative numbers close to zero are frequent. Therefore, for signed integers, a different encoding system is required.
Format
As with the regular Varlen, a Signed Varlen contains a Lead Byte, but there is now always at least one Data Byte. This is because one of the Lead Byte's bits must be used as a sign bit and so a data byte count of 9 different values (0-8) cannot be encoded, only 8.
Lead Byte | Value Range | Description |
---|---|---|
00######
|
$0000..$3FFF |
1 data byte follows; #'s encode a 14-bit value between $00 to $3FFF (0 to 16,383) |
010#####
|
$004000..$203FFF |
2 data bytes follow; #'s encode a 21-bit value offset by $4000 (add $4000 to the value to obtain the actual integer) |
0110####
|
$00204000..$10203FFF |
3 data bytes follow; #'s encode a 28-bit value offset by $204000 |
01110###
|
$00 10204000..$08 10203FFF |
4 data bytes follow; #'s encode a 35-bit value offset by $10204000 |
011110##
|
$0008 10204000..$0408 10203FFF |
5 data bytes follow; #'s encode a 42-bit value offset by $08 10204000 |
0111110#
|
$000408 10204000..$020408 10203FFF |
6 data bytes follow; #'s encode a 49-bit value offset by $0408 10204000 |
01111110
|
$00020408 10204000..$01020408 10203FFF |
7 data bytes follow; #'s encode a 56-bit value offset by $020408 10204000 |
01111111
|
$01020408 10204000..$7FFFFFFF FFFFFFFF |
8 data bytes follow; #'s encode a 64-bit value offset by $01020408 10204000 |
Lead Byte | Value Range | Description |
---|---|---|
11######
|
$FFFFFFFF FFFFC000..$FFFFFFFF FFFFFFFF |
1 data byte follows; #'s encode a 14-bit value between $FFFFFFFF FFFFC000 to $FFFFFFFF FFFFFFFF (-16,384 to -1) |
101#####
|
$FFFFFFFF FFDFC000..$FFFFFFFF FFFFBFFF |
2 data bytes follow; #'s encode a 21-bit value offset by $4000 (subtract $4000 to the value to obtain the actual integer) |
1001####
|
$FFFFFFFF EFDFC000..$FFFFFFFF FFDFBFFF |
3 data bytes follow; #'s encode a 28-bit value offset by $204000 |
10001###
|
$FFFFFFF7 EFDFC000..$FFFFFFFF EFDFBFFF |
4 data bytes follow; #'s encode a 35-bit value offset by $10204000 |
100001##
|
$FFFFFBF7 EFDFC000..$FFFFFFF7 EFDFBFFF |
5 data bytes follow; #'s encode a 42-bit value offset by $08 10204000 |
1000001#
|
$FFFDFBF7 EFDFC000..$FFFFFBF7 EFDFBFFF |
6 data bytes follow; #'s encode a 49-bit value offset by $0408 10204000 |
10000001
|
$FEFDFBF7 EFDFC000..$FFFDFBF7 EFDFBFFF |
7 data bytes follow; #'s encode a 56-bit value offset by $020408 10204000 |
10000000
|
$80000000 00000000..$FEFDFBF7 EFDFBFFF |
8 data bytes follow; #'s encode a 64-bit value offset by $01020408 10204000 |