Data Cookbook Kitchen

After a brief hiatus, the Data Cookbook Elf is back! I have a new job and am now working at ClickHouse, so the data engineering kitchen will be changing its menu a bit and will be dealing with facts and features around ClickHouse. I am truly happy to work with this database now, and the richness of ClickHouse’s SQL dialect is simply amazing!

Today, the Data Cookbook Elf will be looking at some of the array processing features that come built in with ClickHouse.

Custom aggregation over an array

In ClickHouse, you can define user defined functions (UDF) in SQL. UDFs use a lambda notation so a simple square function would look like this:

CREATE OR REPLACE FUNCTION mySquare AS (x) -> x * x;

The return type would be inferred from the argument types.

But arguments to lambda functions are not limited to primitive data types. You can use any type that ClickHouse knows about. How about some array functions?

The universal Lego block that will help us here is the function arrayFold(). arrayFold(), in the simplest case, takes three arguments:

  • a lambda expression
  • the input array that is processed
  • a variable or data structure that accumulates state.

The first argument of the lambda expression is the current value of the accumulator, the second argument is populated by the elements of the input array, one at a time. The result of the lambda expression becomes the new value of the accumulator after each step. The return value of arrayFold() is the final content of the accumulator.

To make this easier to understand, let’s look at a simple example. We will define a custom function that sums up all elements in a array of numbers. The accumulator starts at an initial value of 0.0, and each subsequent processing step adds the value of the current array element to the accumulator.

Here is the code to achieve this:

CREATE OR REPLACE FUNCTION myArraySum AS (arr) ->
    arrayFold(
        (acc, x) -> acc + x,
        arr,
        0::Float64
    );

Sure there are simpler ways to achieve this result in ClickHouse, but I wanted to illustrate the general principle.

More than one array

Just for completeness, you can feed more than one array to arrayFold(). In the case of multiple input arrays, corresponding elements of each array are fed into the lambda expression (which has to change its signature accordingly.) All input arrays have to have the same number of elements, otherwise arrayFold() fails with an error!

Here is an example. This function computes the dot product of two numeric vectors:

CREATE OR REPLACE FUNCTION myDotProduct AS (a1, a2) ->
    arrayFold(
        (acc, x, y) -> acc + x * y,
        a1,
        a2,
        0::Float64
    );

(Again, ClickHouse has this as a builtin function already!)

Returning an array instead of a single number?

But can we transform arrays using this technique? Yes, we can. Nothing stops us from using a more complex data type as the accumulator. And with that, the return value of our UDF can be an array, too.

In the next example, let us return an array of the differences of neighboring elements in the input array. So an input of [1, 4, 9] would yield [3, 5].

In this example, the accumulator is actually a tuple. The first element of the tuple is the array we want to return, the second element is the previously seen number.

CREATE OR REPLACE FUNCTION arrayDiff AS (x) ->
    arrayFold(
        (acc, x) -> (CAST(arrayPushBack(acc.1, x - acc.2), 'Array(Int32)'), toInt32(x)),
        arrayPopFront(x),
        (emptyArrayInt32(), x[1]::Int32)
    ).1;

Note how we use the dot notation to return only the first element of the tuple, since the second element is used only for housekeeping.

One interesting idiom here is the use of arrayPopFront() to chop off the first element of the array before feeding it to the lambda function. That first element is instead used to prepopulate the accumulator, so as to avoid off-by-one errors. I’ll use this technique in the following examples, too.

Weeding out incorrect state transitions in an online app

Let’s look at another example: we want to know if, in an array, each number is greater or equal than the previous one.

Why is this interesting? Imagine state transitions in a web app, such as a MOOC or an ecommerce shop. Say you can go from product to cart, cart to checkout, but not from checkout to product. Any case of a non-allowed state transition is bad data. Instead of the greater than or equal operator, you would have a custom function that returns true only for allowed transitions.

In this case, the accumulator is a tuple of the last seen number and the cumulative state (which is just a boolean). We return the final state. If only one bad transition is in the session, the entire session is discarded.

CREATE OR REPLACE FUNCTION isMonotonous AS (x) ->
    arrayFold(
        (acc, x) -> ( x::Int32, acc.2 AND x >= acc.1 ),
        arrayPopFront(x),
        (x[1]::Int32, 1::UInt8)
    ).2;

Now let’s modify the example: We don’t want to discard all the data of one session, instead we only drop the incorrect rows:

CREATE OR REPLACE FUNCTION eliminateBacksteps AS (x) ->
    arrayFold(
        (acc, x) -> ( if(x >= acc.1[-1], arrayPushBack(acc.1, x), acc.1), x >= acc.1[-1] ),
        arrayPopFront(x),
        ([ x[1]::Int32 ], 1::UInt8)
    ).1;

Here, the first element of the accumulator is the cleansed array. The second element is 1 if the last element in the array is part of the cleansed data. This can be interesting information to keep if you are inserting data into a set of good/bad record tables. In that case you would let the eliminateBacksteps function return the entire accumulator, not just the first element.

Conclusion

  • The combination of arrayFold() and UDFs unlocks powerful custom aggregations and transformations over arrays.
  • arrayFold() uses an accumulator to track state but you can make this accumulator arbitrarily complex to emulate additional local variables.
  • This is valuable for funnel quality analytics.

This image is taken from Page 377 of Praktisches Kochbuch für die gewöhnliche und feinere Küche” by Medical Heritage Library, Inc. is licensed under CC BY-NC-SA 2.0 .