How do I evaluate FeatureBase sample data?
The sample database is provided for users and organizations evaluating FeatureBase at scale. It contains one billion rows with attributes commonly used for segmentation queries.
Learn how fast the FeatureBase database is by running queries against the cseg
(customer segmentation) and skills
tables
Before you begin
Table statements
Run the SHOW CREATE TABLE statement to output CREATE TABLE statements for both tables.
SHOW CREATE TABLE cseg;
SHOW CREATE TABLE skills;
What are IDSET
and STRINGSET
data types?
The cseg
table has multiple columns assigned to IDSET
and STRINGSET
data types. These data types enable FeatureBase to store low-cardinality data (1:many relationships) in a single column without needing to rely on traditional data models such as the star schema.
SQL queries
The following queries demonstrate sub-second latency in the FeatureBase database
Queries
Data Exploration
Prove the sample database contains 1000025000 rows, the majority contained in cseg
.
SELECT COUNT(*) FROM cseg;
SELECT COUNT(*) FROM skills;
Output the top 10 rows in each table:
SELECT TOP(10) * FROM cseg;
SELECT TOP(10) * FROM skills;
Complex Segmentation
SELECT SUM(income) FROM cseg
WHERE income > 5000 AND age = 45 AND (SETCONTAINSANY(skills,['Ms Office','Excel']));
Aggregations
SELECT SUM(income) FROM cseg;
SELECT SUM(income) FROM cseg where income > 5000;
SELECT AVG(income) FROM cseg;
Grouping with Complex Conditions
SELECT hobbies, COUNT(*) as cnt
FROM cseg
WITH (flatten(hobbies))
GROUP BY hobbies
HAVING COUNT(*) > 200000000
ORDER BY cnt DESC;
SELECT education, SUM(income)
FROM cseg
WITH (flatten(education))
WHERE age=18
GROUP BY education;
JOINS
[cseg]Count(Intersect(
Row(hobbies="Teaching"),
Distinct(Row(bools='available_for_hire'), field= id, index=skills)))
Top K
[cseg]TopK(hobbies, k=5)
[cseg]TopK(hobbies, k=10, filter=Intersect(Row(sex=Female),Row(hobbies='Scuba Diving')))
Next step
- Delete the sample database to reduce costs as soon as testing concludes