Druid Cookbook

Let’s look at a neat new feature in Druid 30 that makes using the API more flexible.

For this tutorial, download a fresh copy of Druid 30. Run a local quickstart instance and ingest the Wikipedia sample data as per this tutorial.

Recap: Parameters in the query API

Druid is typically queried through a REST API; the payload is documented here. The API supports parameterizing queries, which is frequently used by programming language specific clients that create a wrapper layer around the API calls.

For instance, a simple query with a filter could be written like this:

{
    "query": "SELECT COUNT(*) FROM wikipedia WHERE channel = ?",
    "parameters": [
        {
            "type": "VARCHAR",
            "value": "#en.wikipedia"
        }
    ]
}

You can submit this query to the SQL endpoint using curl like so:

curl --location 'http://localhost:8888/druid/v2/sql' \
--header 'Content-Type: application/json' \
--data '{
    "query": "SELECT COUNT(*) FROM wikipedia WHERE channel = ?",
    "parameters": [
        {
            "type": "VARCHAR",
            "value": "#en.wikipedia"
        }
    ]
}'

or you can use Postman:

Postman query

Let’s make the query a bit more complex. We want to count the rows for more than one channel with a simple GROUP BY and an IN clause:

{
    "query": "SELECT COUNT(*) FROM wikipedia WHERE channel IN ?",
    "parameters": [
        {
            "type": "VARCHAR",
            "value": "(#en.wikipedia, #de.wikipedia, #fr.wikipedia)"
        }
    ]
}

Alas, this fails.

Failing Postman query

And until Druid 29, you would have to work around this problem because ARRAYs as parameters weren’t really supported.

Two new features in Druid

Druid 30 brings two new features that help us here:

  • Druid 30 supports passing ARRAYs as parameters. (https://github.com/apache/druid/pull/16274) The way to do this is to specify a type of "ARRAY" and to use a JSON array as the value.
  • There is a new SCALAR_IN_ARRAY function that checks for presence of a particular value in an array. In fact, a conventional IN filter would internally use this functionality if the number of elements in the list is large enough. (https://github.com/apache/druid/pull/16306)

With that, we have everything to make the query work.

Now, let’s make it work

We’ll pass the list of channels as an ARRAY type parameter and use SCALAR_IN_ARRAY in place of IN:

{
    "query": "SELECT channel, COUNT(*) FROM wikipedia WHERE SCALAR_IN_ARRAY(channel, ?) GROUP BY 1",
    "parameters": [
        {
            "type": "ARRAY",
            "value": ["#en.wikipedia", "#de.wikipedia", "#fr.wikipedia"]
        }
    ]
}

The query works and returns the expected result:

Successful Postman query

Conclusion

We’ve learned that a simple IN filter cannot be parameterized through the Druid REST API. However, Druid 30 introduces an alternative because it supports array parameters. With array parameters and the new SCALAR_IN_ARRAY function, you can efficiently parameterize filter lists through the SQL REST API.


This image is taken from Page 500 of Praktisches Kochbuch für die gewöhnliche und feinere Küche” by Medical Heritage Library, Inc. is licensed under CC BY-NC-SA 2.0 .