Skip to main content Link Menu Expand (external link) Document Search Copy Copied

How does FeatureBase encode non-integer values?

FeatureBase uses equality encoding to create a Boolean relationship between a value and its unique identifier.

Table of contents

Before you begin

What data types are equality-encoded?

Each value of user data mapped to the following data types are converted to equality-encoded bitmaps:

User data FeatureBase data type
Boolean BOOL
Unsigned integer ID
Alphanumeric String
Low cardinality SET, SETQ

How does equality encoding work?

FeatureBase equality encoding:

  • uses the actual value as a column name, saved to disk
  • represents the relationship between unique identifier and column in boolean terms:
    • 1 indicates the relationship exists
    • 0 indicates it does not.

How does FeatureBase equality encode data?

The following table represents the historical names for FeatureBase and downloads for each.

ID historical_name downloads
1 Pilosa 10000
2 Molecula 18524
3 FeatureBase 50000

The historical_name data can be equality-encoded as follows:

| ID  | Pilosa | Molecula | FeatureBase |
| --- | ------ | -------- | ----------- |
| 1   | 1      | 0        | 0           |
| ID  | Pilosa | Molecula | FeatureBase |
| --- | ------ | -------- | ----------- |
| 2   | 0      | 1        | 0           |
| ID  | Pilosa | Molecula | FeatureBase |
| --- | ------ | -------- | ----------- |
| 3   | 0      | 0        | 1           |

Equality encoding integer values

Equality encoding integer values is less effective because Boolean relationships are harder to represent.

Equality encoding specific values

Using the downloads column as unique identifier, the data can be encoded as follows:

| id-downloads | Pilosa | Molecula | FeatureBase |
| ------------ | ------ | -------- | ----------- |
| 10000        | 1      | 0        | 0           |
| id-downloads | Pilosa | Molecula | FeatureBase |
| ------------ | ------ | -------- | ----------- |
| 18524        | 0      | 1        | 0           |
| id-downloads | Pilosa | Molecula | FeatureBase |
| ------------ | ------ | -------- | ----------- |
| 50000        | 0      | 0        | 1           |

Encoding integer values as a range

Values can be encoded as a range which reduces the number of bitmaps and create/delete operations.

| id-download-range | Pilosa | Molecula | FeatureBase |
| ----------------- | ------ | -------- | ----------- |
| 0-25000           | 1      | 1        | 0           |
| 25001-50000       | 0      | 0        | 1           |

Issues equality encoding integer values

The following issues occur with equality encoding integers.

Method Issue
Encode values Two operations are required to update the values which incurs a processing overhead:
* Create a new bitmap with updated values
* Delete the original bitmap
Range encoding Specific values are lost

FeatureBase avoids these issues by bit-slicing integer values.

Bitmap storage overheads

Encoding data as base-2 equality-encoded or bit-slice bitmaps makes queries faster but incurs storage overheads because the number of bitmaps scale:

  • with the number of values, and
  • the cardinality of those values

For example, the average storage overheads for a 10,000 value dataset will be as follows:

Database Dataset saved as Average storage overhead (KB)
RDBMS Row and column based structure 20480 - 30720
FeatureBase * equality-encoded bitmaps
* Bit-slice bitmaps
1280000

FeatureBase overcomes this issue by compressing all bitmap data using Roaring Bitmap Format, based on Roaring Bitmaps.