Community Quick Start Guide
Welcome to FeatureBase!
Follow the guide outlined below for a hands-on demonstration of low-latency queries at scale using FeatureBase Community edition. As you work through the guide, please note any questions or feedback that you may have for the FeatureBase team. We’re always looking for ways to improve the experience!
In this demonstration you will:
- Download a FeatureBase binary
- Configure and run FeatureBase
- Restore two large-scale datasets into FeatureBase
- Run a set of analytics queries
If you run into any roadblocks or have questions, please join our Discord.
Download FeatureBase Binary
To download releases of FeatureBase, you must have a github account.
You can see the latest FeatureBase release here.
Please note that by downloading FeatureBase you agree to the License and to receive occasional marketing emails from the FeatureBase team. You also understand that we will process your personal information in accordance with our Privacy Policy.
Click on the option that best meets your hardware’s needs to download the tarball containing the latest FeatureBase binary.
If you need to download the release programmatically, you can perform a wget
command.
Example for linux arm64:
wget https://github.com/FeatureBaseDB/featurebase/releases/latest/download/featurebase-v3.32.0-linux-arm64.tar.gz
Configuring FeatureBase - MacOS
Note: The FeatureBase executable are built for x86_64 and ARM64 CPU architectures. If you aren’t running macOS, there are two other options. You can install and run FeatureBase on Linux or a Docker container.
Note: Make sure you have
wget
if you want to download the release programmatically. To install, runbrew install wget
.
First, open the Terminal application and type the code below to access the Downloads
folder.
cd ~/Downloads
Next, extract the .tar.gz file
, copy the FeatureBase binary located in the featurebase-<version+architecture>/
directory in the extract to your /usr/local/bin folder
, and make it executable.
For the MacOS AMD64 Version:
tar -zxvf featurebase-v3.32.0-darwin-amd64.tar.gz
sudo cp ~/Downloads/featurebase-v3.32.0-darwin-amd64/featurebase /usr/local/bin/
chmod ugo+x /usr/local/bin/featurebase
For the MacOS ARM64 Version:
tar -zxvf featurebase-v3.32.0-darwin-arm64.tar.gz
sudo cp ~/Downloads/featurebase-v3.32.0-darwin-arm64/featurebase /usr/local/bin/
chmod ugo+x /usr/local/bin/featurebase
Note: The copy or cp command above moves the FeatureBase binary to
/usr/local/bin/
. Ensure this folder is in your path variable by, runningecho $PATH
in the command line and confirming/usr/local/bin/
is there. If it’s not, runexport PATH=$PATH:/usr/local/bin
to append it to your path variable.
Next, configure the FeatureBase server by creating and running the configuration file. Go to FeatureBase Configuration for more on configuring FeatureBase.
sudo touch /etc/featurebase.conf
sudo chown `whoami` /etc/featurebase.conf
sudo nano /etc/featurebase.conf
Copy and paste the contents below. Then save and exit.
name = "featurebase"
bind = "0.0.0.0:10101"
bind-grpc = "0.0.0.0:20101"
data-dir = "/var/lib/molecula"
log-path = "/var/log/molecula/featurebase.log"
[cluster]
name = "cluster"
replicas = 1
[etcd]
listen-client-address = "http://localhost:10401"
listen-peer-address = "http://localhost:10301"
initial-cluster = "featurebase=http://localhost:10301"
In the code above, bind
tells FeatureBase to listen for HTTP request on all IPv4 addresses on the local machine at port 10101
. bind-grpc
tells FeatureBase to listen for gRPC request on all IPv4 address on the local machine at port 20101
. data-dir
and log-path
tell FeatureBase to write data and logs to their respective locations.
You’ll notice there is a generic featurebase.conf within the release that can be referenced and used outside of this guide.
Set up Log and Data Folders
featurebase.conf
defined the data-dir
and the log-path
. Here we’ll want to create those folders and set the molecula
user as the owner. The data directory is where FeatureBase puts the startup log file and the actual data that comprises the FeatureBase indexes.
Create a log and data folder and change the owner:
sudo mkdir -p /var/log/molecula && sudo chown $USER /var/log/molecula
sudo mkdir -p /var/lib/molecula && sudo chown $USER /var/lib/molecula
Run FeatureBase
Next, run the featurebase server
process in the background by running:
featurebase server -c /etc/featurebase.conf &
Note: You may see an error message similar to
“featurebase” can’t be opened because Apple cannot check it for malicious software.
or"featurebase" cannot be opened because the developer cannot be verified"
. If so, you can resolve this issue by navigating to “Security & Privacy” -> “General” settings.
This concludes the setup of FeatureBase on MacOS. Continue with Restore 1 Billion Records to load data into your FeatureBase database.
Configuring FeatureBase - Linux
Note: If you aren’t running Linux, there are two other options. You can install and run FeatureBase on macOS or a Docker container. Note that FeatureBase executable is built for x86_64 and ARM64 CPU architectures.
Note: Make sure you have
wget
if you want to download the release programmatically. To install, runsudo yum install wget
orsudo apt install wget
First, download the appropriate binary for your system here.
Next, extract the .tar.gz file
, copy the featurebase binary located in the featurebase=<version_architecture>/
directory in the extract to your /usr/local/bin folder
, and make it executable.
tar -zxvf featurebase-v3.32.0-linux-arm64.tar.gz
chmod +x featurebase-v3.32.0-linux-arm64/featurebase
sudo cp featurebase-v3.32.0-linux-arm64/featurebase /usr/local/bin
Note: The copy or
cp
command above moves thefeaturebase
binary to /usr/local/bin/. Ensure this folder is in your path variable by, runningecho $PATH
in the command line and confirming/usr/local/bin/
is there. If it’s not, runexport PATH=$PATH:/usr/local/bin
to append it to your path variable.
Next, create the configuration file:
sudo nano /etc/featurebase.conf
Copy and paste the contents below. Then save and exit.
name = "featurebase"
bind = "0.0.0.0:10101"
bind-grpc = "0.0.0.0:20101"
data-dir = "/var/lib/molecula"
log-path = "/var/log/molecula/featurebase.log"
[cluster]
name = "cluster"
replicas = 1
[etcd]
listen-client-address = "http://localhost:10401"
listen-peer-address = "http://localhost:10301"
initial-cluster = "featurebase=http://localhost:10301"
Above, bind
tells FeatureBase to listen for HTTP request on all IPv4 addresses on the local machine at port 10101
. bind-grpc
tells FeatureBase to listen for gRPC request on all IPv4 address on the local machine at port 20101
. data-dir
and log-path
tell FeatureBase to write data and logs to their respective locations.
You’ll notice there is a generic featurebase.conf within the release that can be referenced and used outside of this guide.
Configure FeatureBase Service Unit
On Linux, we recommend running FeatureBase as a systemd
service unit. To learn more about systemd
and units, go here. To configure the FeatureBase service unit’s function, we’ll create a .service
file.
sudo nano /etc/systemd/system/featurebase.service
Copy and paste the contents below. Then save and exit.
[Unit]
Description="Service for FeatureBase"
[Service]
RestartSec=30
Restart=on-failure
User=molecula
ExecStart=/usr/local/bin/featurebase server -c /etc/featurebase.conf
[Install]
WantedBy=multi-user.target
The service file above tells
systemd
we want to run/usr/local/bin/featurebase server -c /etc/featurebase.conf
as themolecula
user. This starts a FeatureBase server and configures it based on/etc/featurebase.conf
. Additionally, in the event that process fails, it tellssystemd
to try and restart that process 30 seconds.
Set up Log and Data Folders
featurebase.conf
defined the data-dir
and the log-path
. Here we’ll want to create those folders and set the molecula
user as the owner. The data directory is where FeatureBase puts the startup log file and the actual data that comprises the FeatureBase indexes.
Create the molecula user:
sudo adduser molecula
Create a log and data folder and change the owner:
sudo mkdir /var/log/molecula && sudo chown molecula:molecula /var/log/molecula
sudo mkdir -p /var/lib/molecula && sudo chown molecula:molecula /var/lib/molecula
Run FeatureBase
Refresh systemd
so the FeatureBase service unit will load. For more information on daemon-reload
, go here.
sudo systemctl daemon-reload
If you’d like to automatically start FeatureBase after a reboot, you can run:
sudo systemctl enable featurebase.service
Finally, start the FeatureBase service unit:
sudo systemctl start featurebase
Verify it started successfully and is running:
sudo systemctl status featurebase
You get a message that looks like this:
● featurebase.service - "Service for FeatureBase"
Loaded: loaded (/etc/systemd/system/featurebase.service; disabled; vendor preset: disabled)
Active: active (running) since Wed 2022-09-07 20:13:13 UTC; 6s ago
Main PID: 30396 (featurebase)
CGroup: /system.slice/featurebase.service
└─30396 /usr/local/bin/featurebase server -c /etc/featurebase.conf
Sep 07 20:13:13 ip-10-0-0-5.us-east-2.compute.internal systemd[1]: Started "Service for FeatureBase".
The key here is Active: active (running)....
If you see something else, the following command is a good place to start troubleshooting.
journalctl -u featurebase -r
Note: if you get a
(code=exited, status=203/EXEC)
error, you likely grabbed the wrong executable (e.g. you’re working on an arm64 processor but grabbed the binary compiled for x86_64/amd64 processor)
If you do see Active: active (running)...
then you can try to stop, start, and restart FeatureBase.
# stop FeatureBase services
sudo systemctl stop featurebase
# verify stop
sudo systemctl status featurebase
# start FeatureBase services
sudo systemctl start featurebase
# verify start
sudo systemctl status featurebase
# restart FeatureBase
sudo systemctl restart featurebase
# verify restart
sudo systemctl status featurebase
Restore 1B records of Demo Data from S3
In order to use FeatureBase, you’ll need data! In a real-life situation, the FeatureBase team will provide guided onboarding and data modeling for our organization’s data. In this exercise, we’ll be working with curated demo data to showcase the low-latency capabilities of FeatureBase. Instructions to load the demo data are laid out below.
First, download the demo dataset from S3 using by clicking this link
Please note that this file is large (~14GB) and contains over 1B records. Make sure that you have sufficient local storage. On average, it takes about 30 minutes to download, but actual download time may depend on your connection speed.
To use your own data - check out our documentation for creating new sources here
Next, in a new terminal window, make a directory for the data:
mkdir ~/Cseg0_backup
And unzip it into the directory:
tar -zxvf ~/Downloads/cseg0_backup.tar.gz -C ~/Cseg0_backup
Now, restore the data into FeatureBase:
featurebase restore --host localhost:10101 -s ~/Cseg0_backup/
Once it’s completed successfully, you’ll see this message:
... INFO: http: Server closed
Now, you can run queries in the FeatureBase web application
Note: If the link above doesn’t work for you:
- Try using Google Chrome
- Paste
localhost:10101
into your browser
You can monitor health of the FeatureBase cluster and other activity on the homepage.
Data Exploration of Customer Segmentation Feature Table
It’s always a good idea to understand what the dataset you’re working with contains before you get started. To do this, click on the ‘Tables’ section in the application to display the names of the tables in your database.
Click on a table to show its contents. FeatureBase can ingest and represent a wide range of field types. One that may not be familiar is the SET field. SET fields are multi-valued and allow FeatureBase to collapse traditional data models, like the star schema, by efficiently storing multiple values for a single field.
To understand the shape of the data contained in the customer dataset that was preloaded into this environment. First, navigate to the Query page by clicking “Query” on the left navigation bar.
Let’s start by running a simple SQL statement to extract 10 records to explore.
SELECT * FROM cseg LIMIT 10;
Viewing this tabular output we can see each record contains several fields (attributes) and data types. Scroll left and right in the application to explore the full list of fields. For example, names and cities are captured in STRING fields. Income is captured in an INT field that will allow for range queries. You can also see that “education” is a SET field with multiple values in a single field.
Next, let’s check the full scale of this dataset by using another familiar SQL statement, the COUNT.
SELECT COUNT(*) FROM cseg;
This query will return the COUNT of records in the entire table and demonstrates we are working with a dataset of 1 billion records. Each record in the cseg table has 16 attributes.
Performing Large Aggregations
Aggregation workflows often require the ability to SUM large amounts of individual INT elements. This could be transaction amounts such as dollars (decimals), whole integers (counts, bandwidth), or any variation requiring a SUM across many records. In this example, we will SUM the income field across all 1 billion records.
SELECT SUM(income) FROM cseg;
It is unlikely to need to SUM in this manner across all records. It is much more common to introduce complex conditions to SUM a segment of records that meets specified criteria.
Here we introduce comparative and logical operators including GREATER THAN, AND, and OR.
SELECT SUM(income)
FROM cseg
WHERE income > 5000 AND age = 45 AND (skills='Ms Office' OR skills='Excel');
As you can see, the latency is in the sub-second time frame even when using complex searching criteria through 1 billion records.
NOTE: When aggregating over a SET field, values for a record will be included in multiple groups if not excluded in the query. For example, when SUM(income) is used with a GROUP BY of “education”, income for a record with both ‘Bachelor’s degree’ and ‘High school diploma or GED’ will be included in both groups.
NOTE: It is common for a single person to have multiple values for a field that may seem contradictory or redundant, like “education”. This may be due to differences in status over time as data are collected and aggregated. A person may be categorized as having “education” status of “Some college” and later be categorized as having a “Bachelor’s degree”. When those two data sources are matched up, the person may have multiple values associated with them.
Additionally, aggregations may include the AVERAGE argument.
SELECT AVG(income) FROM cseg;
Note that we don’t currently support full SQL, but are working toward expanding SQL functionality. For example, the AVERAGE function is not currently supported in GROUP BY queries and will be added soon.
If you have issues with your queries, please contact a FeatureBase Representative. FeatureBase has a native language, called PQL, and we can help you translate your SQL queries to get the desired results.
INNER JOINs at Scale
FeatureBase can merge at ingest and eliminate preprocessing in cases where performant INNER JOINs are required. Data from two separate tables or sources can be merged into a single normalized table by matching on a unique key in each dataset. Since FeatureBase can execute queries very quickly, workflows requiring INNER JOINs can be simplified with FeatureBase by merging disparate datasets at ingest. In the following example, we are combining many of the queries in this guide and adding the INNER JOIN.
The INNER JOIN is facilitating a COUNT of records, or people, are “available for hire” as indicated by that have a string field true for “available_for_hire” located in the skills table, and have the string field true for “Teaching”. In other words, we would like to count the number of people who are teachers and also available for hire. The latency on this type of INNER JOIN at the billion records scale is still sub-second allowing for several interesting data models.
SELECT count(*) AS count
FROM cseg AS t1
INNER JOIN skills AS t2 ON t1._id = t2.id
WHERE t2.bools = 'available_for_hire' and t1.hobbies = 'Teaching';
NOTE: Included in the demo dataset is a table known as skills, please use the discovery queries to take a look at it!
TopK - A FeatureBase Superpower
Ranking queries are notorious for being computationally intensive - aka slow. Some solutions will use statistics to speed up a ranking query by approximating the true results, but that’s not always a desirable option. In addition to SQL, FeatureBase has a native language called PQL. In PQL, TopK
queries can be run to return exact results in milliseconds.
This query returns the top five hobbies across all customers from the cseg table, sifting through a billion records.
[cseg]TopK(hobbies, k=5);
More complex, the next query returns the top ten hobbies among females who also like scuba diving from the cseg table in milliseconds. Even when adding complex filtering, the TopK queries can be run for exact results at scale without impacting query latency.
[cseg]TopK(hobbies, k=10, filter=Intersect(Row(sex=Female),Row(hobbies='Scuba Diving')));
Grouping with Complex Conditions and Aggregating
Another query commonly seen in aggregation-related use cases is the GROUP BY
. For example, let’s group by the hobbies counting only those with ultimate COUNT
above 200,000,000. We’ll execute this query in PQL.
[cseg]GroupBy(
Rows(hobbies),
sort="count desc",
having=CONDITION(count > 200000000)
);
Another useful facet of GROUP BY
is the ability to add an aggregate argument and utilize the low-latency aggregation in another capacity. We’ll execute this query in SQL.
SELECT education, SUM(income)
FROM cseg
WHERE age=18
GROUP BY education;
NOTE: At this point, we encourage you to mix and match segmentation criteria to experience low-latency queries even as complex conditions are added. You can use the Query Builder to help construct queries in PQL.
What’s Next?
We hope that this hands-on experience has further demonstrated the power of FeatureBase to power real-time analytics workflows at scale. While this example focused on a customer segmentation use case, the same type of workflows are often used in anomaly detection or business process optimization use cases and continues to perform as workloads grow to trillions of records. Additionally, FeatureBase excels at combining streaming and historical data in real-time, allowing you to analyze data as soon at is available in FeatureBase with no need for time-consuming preprocessing or preaggregation. From here, partner with your FeatureBase representative to better understand how FeatureBase will work for your organization’s specific needs.
Queries
Data Exploration
SELECT * FROM cseg LIMIT 10;
SELECT COUNT(*) FROM cseg;
Complex Segmentation
SELECT count(*) FROM cseg WHERE age = 45 AND income>50000 AND (skills='Ms Office' OR skills='Excel');
Aggregations
SELECT SUM(income) FROM cseg;
SELECT SUM(income) FROM cseg where income > 5000;
SELECT AVG(income) FROM cseg;
JOINS
SELECT count(*) AS count
FROM cseg AS t1
INNER JOIN skills AS t2 ON t1._id = t2.id
WHERE t2.bools = 'available_for_hire' AND t1.hobbies = 'Teaching';
Grouping with Complex Conditions
[cseg]GroupBy(
Rows(hobbies),
sort="count desc",
having=CONDITION(count > 200000000)
);
Top K
[cseg]TopK(hobbies, k=5);
[cseg]TopK(hobbies, k=10, filter=Intersect(Row(sex=Female),Row(hobbies='Scuba Diving')));