ClickHouse Data Cookbook: Visitor Segmentation with Theta Sketches

Data Cookbook Kitchen

ClickHouse has a lot of built-in functionality that is just there for analysts and data engineers to discover. Today, I am once more going to revisit one of my old Druid blogs. It’s about the difficulty of counting unique things, which is particularly relevant for any kind of clickstream or media analytics, where you want to count unique visitors over arbitrary segments of content, time, demographics, or other dimensions. Let’s take the exact same data and questions as back then, and see how elegantly ClickHouse handles this!

Recap: Unique counts and how to approximate them

When counting unique visitors, ClickHouse offers a choice of different algorithms including an adaptive uniqCombined function that does an exact count, or a hash table, or a HyperLogLog (HLL) sketch, depending on the expected cardinality of the result. What these approaches have in common is: they allow you to merge the results to form a union of segments.

But what if we want to express more complex set operations like the intersection or set difference of segments? This is what theta sketches are for.

Let’s find out how to use theta sketches in ClickHouse!

Modeling the data

The data set is going to be the same as in the Druid blog. There, I inlined the data in the ingestion wizard. Can we do something similar in ClickHouse? Yes, we can!!

Much like the Unix shell, ClickHouse allows inlining data blocks (or other multiline string literals) using here-docs. The text block has to be enclosed by a custom string literal, placed between two $ symbols. In the very simplest case, this would be the empty string, so anything between $$ would form the here-doc.

That way, we get our data as a string literal. To parse this into a table, we use the format()table function. We use the CSVWithNames format and try a DESCRIBE query first in order to see if ClickHouse is able to infer the correct data types for us.

describe format(CSVWithNames,
$$date,uid,show,episode
2022-05-19,alice,Game of Thrones,S1E1
2022-05-19,alice,Game of Thrones,S1E2
2022-05-19,alice,Game of Thrones,S1E1
2022-05-19,bob,Bridgerton,S1E1
2022-05-20,alice,Game of Thrones,S1E1
2022-05-20,carol,Bridgerton,S1E2
2022-05-20,dan,Bridgerton,S1E1
2022-05-21,alice,Game of Thrones,S1E1
2022-05-21,carol,Bridgerton,S1E1
2022-05-21,erin,Game of Thrones,S1E1
2022-05-21,alice,Bridgerton,S1E1
2022-05-22,bob,Game of Thrones,S1E1
2022-05-22,bob,Bridgerton,S1E1
2022-05-22,carol,Bridgerton,S1E2
2022-05-22,bob,Bridgerton,S1E1
2022-05-22,erin,Game of Thrones,S1E1
2022-05-22,erin,Bridgerton,S1E2
2022-05-23,erin,Game of Thrones,S1E1
2022-05-23,alice,Game of Thrones,S1E1
$$) settings schema_inference_make_columns_nullable = 0;

   ┌─name────┬─type───┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
1. │ date    │ Date   │              │                    │         │                  │                │
2. │ uid     │ String │              │                    │         │                  │                │
3. │ show    │ String │              │                    │         │                  │                │
4. │ episode │ String │              │                    │         │                  │                │
   └─────────┴────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

This looks good! (The astute reader may notice that I sneaked in a schema_inference_make_columns_nullable setting - I know there are no null values in the data so it is good practice to avoid the overhead of allowing nullable columns.)

Now let’s define the aggregation. The way to do this in ClickHouse is to use an AggregatingMergeTree table. As a general rule, if you use AggregatingMergeTree then:

the dimensions are what you group by in the INSERT query, and they form the order key;
the measures are defined as AggregateFunctions.

Here, we will group by date, show, and episode; measures will be the count of original events, and the theta sketch representing the unique visitors. Here’s the DDL for the table:

CREATE TABLE theta_test
(
    `date` Date,
    `show` String,
    `episode` String,
    `count` AggregateFunction(count, String),
    `theta_uid` AggregateFunction(uniqTheta, String)
)
ENGINE = AggregatingMergeTree
ORDER BY (date, show, episode);

and the INSERT query, using the same here-doc as above:

INSERT INTO theta_test SELECT date, show, episode, countState(uid) AS count, uniqThetaState(uid) AS theta_uid FROM format(CSVWithNames,
$$date,uid,show,episode
2022-05-19,alice,Game of Thrones,S1E1
2022-05-19,alice,Game of Thrones,S1E2
2022-05-19,alice,Game of Thrones,S1E1
2022-05-19,bob,Bridgerton,S1E1
2022-05-20,alice,Game of Thrones,S1E1
2022-05-20,carol,Bridgerton,S1E2
2022-05-20,dan,Bridgerton,S1E1
2022-05-21,alice,Game of Thrones,S1E1
2022-05-21,carol,Bridgerton,S1E1
2022-05-21,erin,Game of Thrones,S1E1
2022-05-21,alice,Bridgerton,S1E1
2022-05-22,bob,Game of Thrones,S1E1
2022-05-22,bob,Bridgerton,S1E1
2022-05-22,carol,Bridgerton,S1E2
2022-05-22,bob,Bridgerton,S1E1
2022-05-22,erin,Game of Thrones,S1E1
2022-05-22,erin,Bridgerton,S1E2
2022-05-23,erin,Game of Thrones,S1E1
2022-05-23,alice,Game of Thrones,S1E1
$$) GROUP BY 1,2,3 settings schema_inference_make_columns_nullable = 0;

This stores the theta sketches in the aggregate table and as we expect, querying the columns directly yields garbled binary data:

SELECT *
FROM theta_test:

    ┌───────date─┬─show────────────┬─episode─┬─count─┬─theta_uid─┐
│ 2022-05-19 │ Bridgerton      │ S1E1    │       │ ̓_2???'       │
│ 2022-05-19 │ Game of Thrones │ S1E1    │       │ ̓W??9??k       │
│ 2022-05-19 │ Game of Thrones │ S1E2    │       │ ̓W??9??k       │
│ 2022-05-20 │ Bridgerton      │ S1E1    │       │ ̓?՝??Օ        │
│ 2022-05-20 │ Bridgerton      │ S1E2    │       │ ̓#NL??g      │
│ 2022-05-20 │ Game of Thrones │ S1E1    │       │ ̓W??9??k       │
│ 2022-05-21 │ Bridgerton      │ S1E1    │       │  ̓#NL??gW??9??k  │
│ 2022-05-21 │ Game of Thrones │ S1E1    │       │  ̓???96DGW??9??k  │
│ 2022-05-22 │ Bridgerton      │ S1E1    │       │ ̓_2???'       │
│ 2022-05-22 │ Bridgerton      │ S1E2    │       │  ̓???96DG#NL??g │
│ 2022-05-22 │ Game of Thrones │ S1E1    │       │  ̓_2???'???96DG  │
│ 2022-05-23 │ Game of Thrones │ S1E1    │       │  ̓???96DGW??9??k  │
    └────────────┴─────────────────┴─────────┴───────┴───────────┘

Simple counting

We have to use the -Merge combinator to extract the final numbers - here the number of unique visitors by day:

SELECT
    date,
    countMerge(count),
    uniqThetaMerge(theta_uid)
FROM theta_test
GROUP BY 1;

   ┌───────date─┬─countMerge(count)─┬─uniqThetaMerge(theta_uid)─┐
│ 2022-05-19 │                 4 │                         2 │
│ 2022-05-20 │                 3 │                         3 │
│ 2022-05-21 │                 4 │                         3 │
│ 2022-05-22 │                 6 │                         3 │
│ 2022-05-23 │                 2 │                         2 │
   └────────────┴───────────────────┴───────────────────────────┘

We can merge and collapse the aggregation states at the original granularity:

SELECT
    date,
    show,
    episode,
    uniqThetaMerge(theta_uid) AS users
FROM theta_test
GROUP BY
    1,
    2,
    3;

    ┌───────date─┬─show────────────┬─episode─┬─users─┐
│ 2022-05-20 │ Game of Thrones │ S1E1    │     1 │
│ 2022-05-21 │ Game of Thrones │ S1E1    │     2 │
│ 2022-05-19 │ Game of Thrones │ S1E1    │     1 │
│ 2022-05-20 │ Bridgerton      │ S1E1    │     1 │
│ 2022-05-22 │ Bridgerton      │ S1E2    │     2 │
│ 2022-05-19 │ Bridgerton      │ S1E1    │     1 │
│ 2022-05-21 │ Bridgerton      │ S1E1    │     2 │
│ 2022-05-23 │ Game of Thrones │ S1E1    │     2 │
│ 2022-05-19 │ Game of Thrones │ S1E2    │     1 │
│ 2022-05-22 │ Game of Thrones │ S1E1    │     2 │
│ 2022-05-20 │ Bridgerton      │ S1E2    │     1 │
│ 2022-05-22 │ Bridgerton      │ S1E1    │     1 │
    └────────────┴─────────────────┴─────────┴───────┘

or just merge and keep the binary state representation using -MergeState - remember this because it will come back later:

SELECT
    date,
    show,
    episode,
    uniqThetaMergeState(theta_uid) AS users
FROM theta_test
GROUP BY
    1,
    2,
    3;

    ┌───────date─┬─show────────────┬─episode─┬─users─────┐
│ 2022-05-20 │ Game of Thrones │ S1E1    │ ̓W??9??k       │
│ 2022-05-21 │ Game of Thrones │ S1E1    │  ̓???96DGW??9??k  │
│ 2022-05-19 │ Game of Thrones │ S1E1    │ ̓W??9??k       │
│ 2022-05-20 │ Bridgerton      │ S1E1    │ ̓?՝??Օ        │
│ 2022-05-22 │ Bridgerton      │ S1E2    │  ̓???96DG#NL??g │
│ 2022-05-19 │ Bridgerton      │ S1E1    │ ̓_2???'       │
│ 2022-05-21 │ Bridgerton      │ S1E1    │  ̓#NL??gW??9??k  │
│ 2022-05-23 │ Game of Thrones │ S1E1    │  ̓???96DGW??9??k  │
│ 2022-05-19 │ Game of Thrones │ S1E2    │ ̓W??9??k       │
│ 2022-05-22 │ Game of Thrones │ S1E1    │  ̓_2???'???96DG  │
│ 2022-05-20 │ Bridgerton      │ S1E2    │ ̓#NL??g      │
│ 2022-05-22 │ Bridgerton      │ S1E1    │ ̓_2???'       │
    └────────────┴─────────────────┴─────────┴───────────┘

Filtering the Sketch column

Each aggregate function can take a filter modifier by appending the suffix -If to the function name. This works like the FILTER(WHERE ...) clause in other SQL dialects but is arguably simpler and more concise. Give me the unique visitors who watched Bridgerton:

SELECT uniqThetaMergeIf(theta_uid, show = 'Bridgerton')
FROM theta_test;

   ┌─uniqThetaMer⋯idgerton'))─┐
1. │                        5 │
   └──────────────────────────┘

We can string many modifiers together and keep the sketch rather than collapsing it:

SELECT uniqThetaMergeStateIf(theta_uid, show = 'Bridgerton')
FROM theta_test;

   ┌─uniqThetaMer⋯idgerton'))─┐
1. │ 8̓?՝??Օ_2???'???96DG#NL??gW??9??k        │
   └──────────────────────────┘

And there we have it. This binary blob defines the segment of unique visitors that watched Bridgerton.

Set operations

With that, let’s get to the meat of the post: set operations on visitor segments. How many users watched both episodes of Bridgerton?

We introduce two new functions:

uniqThetaIntersect creates the intersection of two theta sketch objects, which is another theta sketch
finalizeAggregation collapses an aggregation state (in our case the resulting theta sketch) into a number.

The query will be

SELECT
    finalizeAggregation(
        uniqThetaIntersect(
            uniqThetaMergeStateIf(theta_uid, (show = 'Bridgerton') AND (episode = 'S1E1')),
            uniqThetaMergeStateIf(theta_uid, (show = 'Bridgerton') AND (episode = 'S1E2'))
        )
    ) AS users
FROM theta_test;

   ┌─users─┐
1. │     1 │
   └───────┘

This works similarly for set unions (How many users watched any of the episodes of Bridgerton?)

SELECT
    finalizeAggregation(
        uniqThetaUnion(
            uniqThetaMergeStateIf(theta_uid, (show = 'Bridgerton') AND (episode = 'S1E1')),
            uniqThetaMergeStateIf(theta_uid, (show = 'Bridgerton') AND (episode = 'S1E2'))
        )
    ) AS users
FROM theta_test;

   ┌─users─┐
1. │     5 │
   └───────┘

and for the set difference (How many users watched episode 1 of Bridgerton but not episode 2?):

SELECT
    finalizeAggregation(
        uniqThetaNot(
            uniqThetaMergeStateIf(theta_uid, (show = 'Bridgerton') AND (episode = 'S1E1')),
            uniqThetaMergeStateIf(theta_uid, (show = 'Bridgerton') AND (episode = 'S1E2'))
        )
    ) AS users
FROM theta_test;

   ┌─users─┐
1. │     3 │
   └───────┘

Conclusion

This post touches a number of interesting techniques in ClickHouse SQL:

Theta sketches allow for counting unique things such as visitors, and applying arbitrary set operations on the segments thus obtained. This is great for affinity analysis (which segments correlate or overlap by how much?)
Here documents and the format table function enable the inlining of small data sets into SQL queries.
-Merge merges and collapses an aggregation state into a number, while -MergeState keeps the intermediate state.
finalizeAggregation collapses an aggregation state (such as obtained by -MergeState) into a single number.
-If adds a filter to any aggregation, like a FILTER WHERE clause.

“This image is taken from Page 377 of Praktisches Kochbuch für die gewöhnliche und feinere Küche” by Medical Heritage Library, Inc. is licensed under CC BY-NC-SA 2.0 .