Documentation

GroupBy()

GroupBy() can be used similar to SELECT-FROM-WHERE-GROUP BY… in SQL. See below for the current functionality.

GroupBy() can be a very memory intensive opperation in FeatureBase. Ensure you have adequate memory to run large GroupBy() queries or use the max-query-memory FeatureBase configuration option to prevent out of memory issues.

Call Definition

GroupBy(
  ROWS_CALL, 
  ... , 
  filter=ROW_CALL, 
  having=CONDITION, 
  aggregate=AGG_CALL, 
  sort=SORT_DIR, 
  limit=UINT, 
  offset=UINT
)

Mandatory Arguments

  • ROWS_CALL : This argument tells GroupBy() which fields / field values to group by. Note that Rows() usually only accepts Set, Mutex, and Time fields as arguments. When passed as an argument to GroupBy(), Rows() also accept Int and Timestamp fields.
    • Paging through results is supported by passing the previous argument to each of the Rows() calls in the GroupBy(). Take the last result from your previous GroupBy() query, and pass each row ID in that result as the previous argument to each of the respective Rows() queries in your next GroupBy() query.

Optional Arguments

  • ... : Any number of additional rows calls used for grouping records by – see ROWS_CALL above.
  • filter : ROW_CALL is a row call used to filter records before grouping – analagous to the WHERE clause in SQL.
  • having : a condition that can be used to filter groups returned by GroupBy() based on the value of their aggragate statistics
    • CONDITION : the value of the having argument. This value should be structured in one of two ways:
      • count condition : this means we keeps groups with counts that meet a certain condition
        • structure : Condition(VALUE OPERATOR count OPERATOR VALUE) or Condition(count OPERATOR VALUE)
        • example : Condition(10 < count < 20)
        • the example above only outputs groups with a count between 10 and 20
      • sum condition : this means we keeps groups with sums that meet a certain condition
        • structure : Condition(VALUE OPERATOR sum OPERATOR VALUE) or Condition(sum OPERATOR VALUE)
        • example : Condition(100 < sum < 200)
        • the example above only outputs groups where the value summed is between 100 and 200
  • aggregate : an aggregate statistic applied over groups – similar functionality to SUM() in the SELECT clause of a SQL GROUP BY query.
    • AGG_CALL : one of the following:
      • Sum(): will return the sum and count of each group – note the count here does not include the count of records with null values in the field being summed.
      • a Count(Distinct()) query
  • sort : this argument can be used to sort the output.
    • SORT_DIR : the value of the sort argument. Below are the available sort directives for the sort argument.
      • "count asc" or "count desc" : this will sort the output ascending or descending by the count or records in each group
      • "aggregate asc" or "aggregate desc" : this will sort the output ascending or descending by the aggregate statistic provided in the aggregate argument. Again asc for ascending and desc for descending.
      • "count asc, aggregate desc" : finally, we can sort by count and then aggregate doing something like this. Note we can use any combination of the two bullets above. This example will first sort by count in ascending order. Then, groups with the same count will be secondarily sorted by the aggregate statistic in descending order.
  • limit : this argument can be used to limit the results to a specific number of groups (i.e. we can return the first two groups by supplying limit=2)
  • offset : this argument can be used to offset the results to a specific number (i.e. we can skip the first two groups by supplying offset=2)

Returns

  • List of “groups”. Each group is an object with a group key and a count key. The count is an integer, and the group is an array of objects which specify the field and row for each row that was intersected to get that result.

Examples

Data:

Index: customer (non keyed index)

 _id | age (Int) | has_purchased (Set) | last_purchase (Timestamp)
-----+-----------+---------------------+---------------------------
 0   |    23     | ["brand1","brand2"] | 2021-01-05T08:30:00Z
 1   |    31     | ["brand1","brand3"] | 2020-09-12T12:30:00Z
 2   |    23     | ["brand1","brand3"] | 2021-08-06T16:15:00Z
 3   |    19     | []                  | null
 4   |    25     | ["brand1","brand4"] | 2021-10-01T20:45:00Z
 5   |    40     | ["brand4"]          | 2022-01-13T11:00:00Z

Example 1

Count the number of customers that have purchased from each brand - sort by that count.

Query

[customer]GroupBy(
  Rows(has_purchased),
  sort="count desc"
)

Tabular Response

 has_purchased | count
---------------+-------
 brand1        | 4
 brand3        | 2
 brand4        | 2
 brand2        | 1

HTTP Response

{
  "results": [
    [
      {
        "group": [
          {
            "field": "has_purchased",
            "rowKey": "brand1"
          }
        ],
        "count": 4
      },
      {
        "group": [
          {
            "field": "has_purchased",
            "rowKey": "brand3"
          }
        ],
        "count": 2
      },
      {
        "group": [
          {
            "field": "has_purchased",
            "rowKey": "brand4"
          }
        ],
        "count": 2
      },
      {
        "group": [
          {
            "field": "has_purchased",
            "rowKey": "brand2"
          }
        ],
        "count": 1
      }
    ]
  ]
}

Explanation

4 customers purchased from brand 1, 2 customer purchased from brand 3 and 4, and 1 customer purchased from brand 2.


Example 2

What are the counts and sum of ages for each brand? Sorted by sum.

Query

[customer]GroupBy(
  Rows(has_purchased),
  aggregated=Sum(field=age)
)

Tabular Response

 has_purchased | count | sum
---------------+-------+-----
 brand1        | 4     | 107
 brand2        | 1     | 23
 brand3        | 2     | 59
 brand4        | 2     | 65

HTTP Response

{
  "results": [
    [
      {
        "group": [
          {
            "field": "has_purchased",
            "rowKey": "brand1"
          }
        ],
        "count": 4,
        "sum": 107
      },
      {
        "group": [
          {
            "field": "has_purchased",
            "rowKey": "brand2"
          }
        ],
        "count": 1,
        "sum": 23
      },
      {
        "group": [
          {
            "field": "has_purchased",
            "rowKey": "brand3"
          }
        ],
        "count": 2,
        "sum": 59
      },
      {
        "group": [
          {
            "field": "has_purchased",
            "rowKey": "brand4"
          }
        ],
        "count": 2,
        "sum": 65
      }
    ]
  ]
}

Explanation

Our groups are the the values in the has_purchased field due to the Rows(has_purchased) argument. Then, for each customer, add their age to the sum. The output is above.


Example 3

Group by has_purchased and age and return groups with counts of more than one.

Query

[customer]GroupBy(
  Rows(has_purchased), 
  Rows(age), 
  having=Condition(count > 1)
)

Tabular Response

 has_purchased | age | count
---------------+-----+-------
 brand1        | 23  | 2

HTTP Response

{
  "results": [
    [
      {
        "group": [
          {
            "field": "has_purchased",
            "rowKey": "brand1"
          },
          {
            "field": "age",
            "value": 23
          }
        ],
        "count": 2
      }
    ]
  ]
}

Explanation

There is only one group with a count over 1 - brand1 and age 23.


Example 4

For customers younger than 35, what are the counts of customer for each brand those customers have purchased from?

Query

[customer]GroupBy(
  Rows(has_purchased), 
  filter=Row(age < 35)
)

Tabular Response

 has_purchased | count
---------------+-------
 brand1        | 4
 brand2        | 1
 brand3        | 2
 brand4        | 1

HTTP Response

{
  "results": [
    [
      {
        "group": [
          {
            "field": "has_purchased",
            "rowKey": "brand1"
          }
        ],
        "count": 4
      },
      {
        "group": [
          {
            "field": "has_purchased",
            "rowKey": "brand2"
          }
        ],
        "count": 1
      },
      {
        "group": [
          {
            "field": "has_purchased",
            "rowKey": "brand3"
          }
        ],
        "count": 2
      },
      {
        "group": [
          {
            "field": "has_purchased",
            "rowKey": "brand4"
          }
        ],
        "count": 1
      }
    ]
  ]
}

Explanation

The filter Row(age < 35) is applied before counts are computed for each has_purchased group. Thus, customer 5 isn’t included and the count is 1 for brand4.

Something missing or incorrect?
Help improve this article or join us on Discord!

© 2022 Molecula Corp. (DBA FeatureBase). All rights reserved.