How does FeatureBase encode non-integer values?
FeatureBase uses equality encoding to create a Boolean relationship between a value and its unique identifier.
Table of contents
Before you begin
What data types are equality-encoded?
Each value of user data mapped to the following data types are converted to equality-encoded bitmaps:
User data | FeatureBase data type |
---|---|
Boolean | BOOL |
Unsigned integer | ID |
Alphanumeric | String |
Low cardinality | SET, SETQ |
How does equality encoding work?
FeatureBase equality encoding:
- uses the actual value as a column name, saved to disk
- represents the relationship between unique identifier and column in boolean terms:
1
indicates the relationship exists0
indicates it does not.
How does FeatureBase equality encode data?
The following table represents the historical names for FeatureBase and downloads for each.
ID | historical_name | downloads |
---|---|---|
1 | Pilosa | 10000 |
2 | Molecula | 18524 |
3 | FeatureBase | 50000 |
The historical_name
data can be equality-encoded as follows:
| ID | Pilosa | Molecula | FeatureBase |
| --- | ------ | -------- | ----------- |
| 1 | 1 | 0 | 0 |
| ID | Pilosa | Molecula | FeatureBase |
| --- | ------ | -------- | ----------- |
| 2 | 0 | 1 | 0 |
| ID | Pilosa | Molecula | FeatureBase |
| --- | ------ | -------- | ----------- |
| 3 | 0 | 0 | 1 |
Equality encoding integer values
Equality encoding integer values is less effective because Boolean relationships are harder to represent.
Equality encoding specific values
Using the downloads
column as unique identifier, the data can be encoded as follows:
| id-downloads | Pilosa | Molecula | FeatureBase |
| ------------ | ------ | -------- | ----------- |
| 10000 | 1 | 0 | 0 |
| id-downloads | Pilosa | Molecula | FeatureBase |
| ------------ | ------ | -------- | ----------- |
| 18524 | 0 | 1 | 0 |
| id-downloads | Pilosa | Molecula | FeatureBase |
| ------------ | ------ | -------- | ----------- |
| 50000 | 0 | 0 | 1 |
Encoding integer values as a range
Values can be encoded as a range which reduces the number of bitmaps and create/delete operations.
| id-download-range | Pilosa | Molecula | FeatureBase |
| ----------------- | ------ | -------- | ----------- |
| 0-25000 | 1 | 1 | 0 |
| 25001-50000 | 0 | 0 | 1 |
Issues equality encoding integer values
The following issues occur with equality encoding integers.
Method | Issue |
---|---|
Encode values | Two operations are required to update the values which incurs a processing overhead: * Create a new bitmap with updated values * Delete the original bitmap |
Range encoding | Specific values are lost |
FeatureBase avoids these issues by bit-slicing integer values.
Bitmap storage overheads
Encoding data as base-2 equality-encoded or bit-slice bitmaps makes queries faster but incurs storage overheads because the number of bitmaps scale:
- with the number of values, and
- the cardinality of those values
For example, the average storage overheads for a 10,000 value dataset will be as follows:
Database | Dataset saved as | Average storage overhead (KB) |
---|---|---|
RDBMS | Row and column based structure | 20480 - 30720 |
FeatureBase | * equality-encoded bitmaps * Bit-slice bitmaps | 1280000 |
FeatureBase overcomes this issue by compressing all bitmap data using Roaring Bitmap Format, based on Roaring Bitmaps.