When ingesting into FeatureBase, each record must be associated with a key.
Ingesters support four ways to do this, three suitable for production workloads:
external-generate, to use the FeatureBase ID allocator, optionally including
auto-generate, suitable for testing.
id-field option should be considered when there is an existing field in the data which uniquely identifies each record and consists of contiguous positive integers. For example, the auto-incremented ID field from a relational database is usually perfect for this.
In most other cases, the
primary-key-fields option should be used. This uses one or more fields, converted to strings, then concatenated (using
| as the delimiter), to create unique record IDs. When only a single field is used for this, it will not be indexed as a field in FeatureBase. When multiple source fields are used, each individual field will be indexed in FeatureBase, in addition to being used for the record ID.
As an example, consider a data set of students across multiple schools, perhaps with a different CSV file for each school:
The studentID column, unique within a single school, serves as an identifier. When ingesting a single file corresponding to a single school, an ingest option like
--id-field=studentID might work well. This will result in an index with
studentID as FeatureBase record IDs, and every other column potentially represented as a FeatureBase field, including
To ingest multiple files without conflicting IDs, a different approach is required. When an appropriate identifier like a UUID is available, that can be used directly, with an option like
--primary-key-fields=UUID. This will result in an index with
UUID as FeatureBase record keys, so the index depends on key translation to convert UUID string values to integer record IDs. Every other column would potentially be represented as a FeatureBase field, including
|Sometimes, an appropriate unique identifier is not directly available, or perhaps a data set is designed to use a composite key as a unique identifier. For example, if the students data set did not include a UUID column. In this case, multiple values can be combined to produce a composite identifier that is unique. One option that would work well here is the pair (school, studentID), which would be specified as ||0”. Again, this index would depend on key translation. This index, in contrast to the previous, could include every column as a FeatureBase field, including both |
auto-generate option can create auto-incrementing integer IDs, when generating test data, or when ingesting from a CSV source, for example. This option is suitable for quick testing purposes, but does not support using multiple ingest processes or stopping and restarting ingest.
external-generate in addition to
auto-generate uses FeatureBase’s ID generation feature. Additionally,
offset-mode can be set for use with Kafka.