Map

A lesser known feature of Apache Druid is the ability to handle spatial data directly. There are a number of built in functions that allow query filtering on spatial structures like rectangles or polygons. Also, there is a data type for spatial coordinates which you can specify in the ingestion spec.

Let’s look at a simple example. I am going to take a list of German cities that I generated using the Python Faker module:

latitude,longitude,place_name,country_code,timezone
50.09019,8.4493,Hofheim am Taunus,DE,Europe/Berlin
52.47774,10.5511,Gifhorn,DE,Europe/Berlin
52.53048,13.29371,Charlottenburg-Nord,DE,Europe/Berlin
48.21644,9.02596,Albstadt,DE,Europe/Berlin
52.53048,13.29371,Charlottenburg-Nord,DE,Europe/Berlin
49.68369,8.61839,Bensheim,DE,Europe/Berlin
50.64336,7.2278,Bad Honnef,DE,Europe/Berlin
48.46458,9.22796,Pfullingen,DE,Europe/Berlin
53.59337,9.47629,Stade,DE,Europe/Berlin
50.80904,8.77069,Marburg an der Lahn,DE,Europe/Berlin

I created an ingestion spec for these data:

{
  "type": "index_parallel",
  "spec": {
    "ioConfig": {
      "type": "index_parallel",
      "inputSource": {
        "type": "inline",
        "data": "latitude,longitude,place_name,country_code,timezone\n50.09019,8.4493,Hofheim am Taunus,DE,Europe/Berlin\n52.47774,10.5511,Gifhorn,DE,Europe/Berlin\n52.53048,13.29371,Charlottenburg-Nord,DE,Europe/Berlin\n48.21644,9.02596,Albstadt,DE,Europe/Berlin\n52.53048,13.29371,Charlottenburg-Nord,DE,Europe/Berlin\n49.68369,8.61839,Bensheim,DE,Europe/Berlin\n50.64336,7.2278,Bad Honnef,DE,Europe/Berlin\n48.46458,9.22796,Pfullingen,DE,Europe/Berlin\n53.59337,9.47629,Stade,DE,Europe/Berlin\n50.80904,8.77069,Marburg an der Lahn,DE,Europe/Berlin"
      },
      "inputFormat": {
        "type": "csv",
        "findColumnsFromHeader": true
      }
    },
    "tuningConfig": {
      "type": "index_parallel",
      "partitionsSpec": {
        "type": "dynamic"
      }
    },
    "dataSchema": {
      "dataSource": "geo_data",
      "timestampSpec": {
        "column": "!!!_no_such_column_!!!",
        "missingValue": "2010-01-01T00:00:00Z"
      },
      "dimensionsSpec": {
        "spatialDimensions": [
          {
            "dimName": "coordinates",
            "dims": [
              "latitude",
              "longitude"
            ]
          }
        ],
        "dimensions": [
          {
            "type": "double",
            "name": "latitude"
          },
          {
            "type": "double",
            "name": "longitude"
          },
          "place_name",
          "country_code",
          "timezone"
        ]
      },
      "granularitySpec": {
        "queryGranularity": "none",
        "rollup": false,
        "segmentGranularity": "day"
      }
    }
  }
}

Note how we have a spatialDimensions spec besides the regular dimensions.

Let’s query the data! GeoQuery

Aha! The spatial data seems to be represented internally as a string dimension, but unfortunately our original coordinate fields are gone. As a general rule, in Druid, you can use each data field only once as a dimension. If you want to use the same field twice, you need to declare a logical duplicate using a transform spec.

Let’s try this:

{
  "type": "index_parallel",
  "spec": {
    "ioConfig": {
      "type": "index_parallel",
      "inputSource": {
        "type": "inline",
        "data": "latitude,longitude,place_name,country_code,timezone\n50.09019,8.4493,Hofheim am Taunus,DE,Europe/Berlin\n52.47774,10.5511,Gifhorn,DE,Europe/Berlin\n52.53048,13.29371,Charlottenburg-Nord,DE,Europe/Berlin\n48.21644,9.02596,Albstadt,DE,Europe/Berlin\n52.53048,13.29371,Charlottenburg-Nord,DE,Europe/Berlin\n49.68369,8.61839,Bensheim,DE,Europe/Berlin\n50.64336,7.2278,Bad Honnef,DE,Europe/Berlin\n48.46458,9.22796,Pfullingen,DE,Europe/Berlin\n53.59337,9.47629,Stade,DE,Europe/Berlin\n50.80904,8.77069,Marburg an der Lahn,DE,Europe/Berlin"
      },
      "inputFormat": {
        "type": "csv",
        "findColumnsFromHeader": true
      }
    },
    "tuningConfig": {
      "type": "index_parallel",
      "partitionsSpec": {
        "type": "dynamic"
      }
    },
    "dataSchema": {
      "dataSource": "geo_data",
      "timestampSpec": {
        "column": "!!!_no_such_column_!!!",
        "missingValue": "2010-01-01T00:00:00Z"
      },
      "dimensionsSpec": {
        "spatialDimensions": [
          {
            "dimName": "coordinates",
            "dims": [
              "lat1",
              "lon1"
            ]
          }
        ],
        "dimensions": [
          {
            "type": "double",
            "name": "latitude"
          },
          {
            "type": "double",
            "name": "longitude"
          },
          "place_name",
          "country_code",
          "timezone"
        ]
      },
      "granularitySpec": {
        "queryGranularity": "none",
        "rollup": false,
        "segmentGranularity": "day"
      },
      "transformSpec": {
        "transforms": [
          {
            "type": "expression",
            "expression": "longitude",
            "name": "lon1"
          },
          {
            "type": "expression",
            "expression": "latitude",
            "name": "lat1"
          }
        ]
      }
    }
  }
}

And this time, it works! We get both the spatial dimension and the regular fields.

In a future post, I’ll look at what we can do with those data.

Learnings

  • Druid has built in support for spatial data.
  • You may need a dimension transformation if you want to preserve the original data that goes into a spatial dimension.
  • As of now, this needs to entered manually as a JSON spec. The ingestion wizard does not support spatial dimensions.