Druid Cookbook

In the latest versions of Druid, handling of logical expressions, and in particular of NULL (unknown) values, has been changed to match the SQL standard. This leads to some behavior that may be surprising to long time Druid users. Let’s take a look at some examples!

Server configuration settings that affect NULL handling

There are three configuration settings that affect NULL handling:

  • druid.generic.useDefaultValueForNull
  • druid.expressions.useStrictBooleans
  • druid.generic.useThreeValueLogicForNativeFilters

The Status tile in the Druid web console indicates whether Druid is configured in SQL compliant mode; if you hover over the corresponding text, it shows a detailed breakdown of the settings:

Null mode settings

What do these settings do? Let’s look at them in a bit more detail.

useDefaultValueForNull

Originally, Druid did not have a separate representation for empty or NULL values. A NULL value would be simply treated as an empty string, or as a numeric 0 (zero); and it would be equivalent to these default values for all intents and purposes.

The new default for this setting is false.

useStrictBooleans

This setting now defaults to true. If set, it forces the result of all logical expressions to be 0 or 1. Older versions of Druid would keep the original values of input parameters, since anything that was not 0 or an empty string would be considered a true value. (This resembles the way logical expressions are handled in some scripting languages.)

useThreeValueLogicForNativeFilters

This setting defaults to true, too. Its effect is that NULL is kept as a distinct value that is separate from either true or false, in all logical evaluations. Any expression that contains a NULL value would yield a result of NULL, too. This has a number of interesting ramifications. Let’s look at some of them today!

This tutorial can be done using the Druid 29.0.1 quickstart.

Ingestion

First of all, let’s create a very simple data set.

Ingestion

The table I am going to use has but four rows of data. There is a column color, which can be either a string, or NULL.

Here is the Druid SQL to create the sample data set:

REPLACE INTO "inline_data" OVERWRITE ALL
WITH "ext" AS (
  SELECT *
  FROM TABLE(
    EXTERN(
      '{"type":"inline","data":"{\"id\": 1, \"color\": \"red\"}\n{\"id\": 1, \"color\": null}\n{\"id\": 1, \"color\": \"blue\"}\n{\"id\": 1, \"color\": \"green\"}\n"}',
      '{"type":"json"}'
    )
  ) EXTEND ("id" BIGINT, "color" VARCHAR)
)
SELECT
  TIMESTAMP '2000-01-01 00:00:00' AS "__time",
  "id",
  "color"
FROM "ext"
PARTITIONED BY ALL

If you ingest this data set and list the entire table, you should get something like this:

Select all data

Note the NULL value in the color column.

Let’s run some more queries now!

Comparing against single values

Since a comparison with NULL is never true, NULL is not even considered equal to itself. This is why the following query

Select null

yields no rows in return. Even more, if you change the condition to WHERE color <> NULL you get nothing, too!

You have to use the operators IS NULL and IS NOT NULL, instead.

In a similar vein, let’s get all the entries whose color is not red. Naïvely, we try:

Select out single value, does not work

We get only the blue and green entries - the NULL value is, again, not caught by the operator.

You could construct a combined filter clause handling the NULL case separately. But there is a special operator that allows one to treat NULL values like regular values:

Select out single value using IS DISTINCT FROM

The IS DISTINCT FROM operator does what we need: it treats NULL values as equivalent and distinct from any other value.

Comparing against multiple values

How about filtering multiple values with an IN clause? Let’s try to retrieve only those rows that have color red or NULL:

Select with IN clause

After the previous experiments, this should not come as a surprise: The query returns only the row for red. But what if we invert the condition?

Select with NOT IN clause

This one returns no rows at all! Supposedly, the comparison with anything that contains NULL would always give a NULL result, and in fact the entire filter is optimized out of the query plan.

There is a workaround though. Instead of using a list with IN, we can also try an array literal like so:

Select with array

This gives the same result as the IN filter. But if we invert the filter condition, we get something that is more along the lines of the expectation:

Select with array, inverted

Learnings

  • Druid’s handling of unknown values has been made SQL compliant.
  • This can lead to unexpected results since any comparison with a NULL value yields a NULL value itself: NULL is equal to nothing, but is also not equal to nothing - not even to itself!
  • In order to handle NULL values properly, special operators exist, such as IS NULL and IS DISTINCT FROM.
  • Beware of NULL values in IN () filter clauses! Using an array literal instead can help.

This image is taken from Page 500 of Praktisches Kochbuch für die gewöhnliche und feinere Küche” by Medical Heritage Library, Inc. is licensed under CC BY-NC-SA 2.0 .