Migration guide: MVDs to arrays
Druid now supports SQL-compliant arrays. We recommend using arrays over multi-value dimensions (MVDs) whenever possible. For new projects and complex use cases involving multiple data types, use arrays. Use MVDs for specific use cases, such as operating directly on individual elements like regular strings. If your operations involve entire arrays of values, including the ordering of values within a row, use arrays over MVDs.
Comparison between arrays and MVDs
The following table compares the general behavior between arrays and MVDs. For specific query differences between arrays and MVDs, see Querying arrays and MVDs.
Array | MVD | |
---|---|---|
Data types | Supports VARCHAR, BIGINT, and DOUBLE types (ARRAY<STRING>, ARRAY<LONG>, ARRAY<DOUBLE>) | Only supports arrays of strings (VARCHAR) |
SQL compliance | Behaves like standard SQL arrays with SQL-compliant behavior | Behaves like SQL VARCHAR rather than standard SQL arrays and requires special SQL functions to achieve array-like behavior. See the examples. |
Ingestion |
|
|
Filtering and grouping |
|
|
Conversion | Convert an MVD to an array using MV_TO_ARRAY | Convert an array to an MVD using ARRAY_TO_MV |
Querying arrays and MVDs
In SQL queries, Druid operates on arrays differently than MVDs. A value in an array column is treated as a single array entity (SQL ARRAY), whereas a value in an MVD column is treated as individual strings (SQL VARCHAR). This behavior applies even though multiple string values within the same MVD are still stored as a single field in the MVD column.
For example, consider the same value, ['a', 'b', 'c']
ingested into an array column and an MVD column.
In your query, you want to filter results by comparing some value with ['a', 'b', 'c']
.
For array columns, Druid only returns the row when an equality filter matches the entire array.
For example:WHERE "array_column" = ARRAY['a', 'b', 'c']
.For MVD columns, Druid returns the row when an equality filter matches any value of the MVD.
For example, any of the following filters return the row for the query:
WHERE "mvd_column" = 'a'
WHERE "mvd_column" = 'b'
WHERE "mvd_column" = 'c'
Note this difference between arrays and MVDs when you write queries that involve filtering or grouping.
When your query applies both filters and grouping, MVDs may return rows that don't seem to match the filter, since the grouping occurs after Druid applies the filter. For an example, see Filter and group by array elements.
Examples
The following examples highlight a few analogous queries between arrays and MVDs. For more information and examples, see Querying arrays and Querying multi-value dimensions.
Filter by an array element
Filter rows that have a certain value in the array or MVD.
Array
SELECT label, tags
FROM "array_example"
WHERE ARRAY_CONTAINS(tags, 't3')
MVD
SELECT label, tags
FROM "mvd_example"
WHERE tags = 't3'
Filter by one or more elements
Filter rows for which the array or MVD contains one or more elements. Notice that ARRAY_OVERLAP checks for any overlapping elements, whereas ARRAY_CONTAINS in the previous example checks that all elements are included.
Array
SELECT *
FROM "array_example"
WHERE ARRAY_OVERLAP(tags, ARRAY['t1', 't7'])
MVD
SELECT *
FROM "mvd_example"
WHERE tags = 't1' OR tags = 't7'
Filter using array equality
Filter rows for which the array or MVD is equivalent to a reference array.
Array
SELECT *
FROM "array_example"
WHERE tags = ARRAY['t1', 't2', 't3']
MVD
SELECT *
FROM "mvd_example"
WHERE MV_TO_ARRAY(tags) = ARRAY['t1', 't2', 't3']
Group results by array
Group results by the array or MVD.
Array
SELECT label, tags
FROM "array_example"
GROUP BY 1, 2
MVD
SELECT label, MV_TO_ARRAY(tags)
FROM "mvd_example"
GROUP BY 1, 2
Group by array elements
Group results by individual array or MVD elements.
Array
SELECT label, strings
FROM "array_example" CROSS JOIN UNNEST(tags) as u(strings)
GROUP BY 1, 2
MVD
SELECT label, tags
FROM "mvd_example"
GROUP BY 1, 2
Filter and group by array elements
Filter rows that have a certain value, then group by elements in the array or MVD. This example illustrates that while the results of filtering may match between arrays and MVDs, be aware that MVDs implicitly unnest their values so that results differ when you also apply a GROUP BY.
For example, consider the queries from Filter by an array element. Both queries return the following rows:
{"label":"row1","tags":["t1","t2","t3"]}
{"label":"row2","tags":["t3","t4","t5"]}
However, adding GROUP BY 1, 2
to both queries changes the output.
The two queries are now:
-- Array
SELECT label, tags
FROM "array_example"
WHERE ARRAY_CONTAINS(tags, 't3')
GROUP BY 1, 2
-- MVD
SELECT label, tags
FROM "mvd_example"
WHERE tags = 't3'
GROUP BY 1, 2
The array query returns the following:
{"label":"row1","tags":["t1","t2","t3"]}
{"label":"row2","tags":["t3","t4","t5"]}
The MVD query returns the following:
{"label":"row1","tags":"t1"}
{"label":"row1","tags":"t2"}
{"label":"row1","tags":"t3"}
{"label":"row2","tags":"t3"}
{"label":"row2","tags":"t4"}
{"label":"row2","tags":"t5"}
The MVD results appear to show four extra rows for which tags
does not equal t3
.
However, the rows match the filter based on how Druid evaluates equalities for MVDs.
For the equivalent query on MVDs, use the MV_FILTER_ONLY function:
SELECT label, MV_FILTER_ONLY(tags, ARRAY['t3'])
FROM "mvd_example"
WHERE tags = 't3'
GROUP BY 1, 2
How to ingest data as arrays
You can ingest arrays in Druid as follows:
For native batch and streaming ingestion, configure the dimensions in
dimensionsSpec
. WithindimensionsSpec
, set"useSchemaDiscovery": true
, and usedimensions
to list the array inputs with typeauto
.
For an example, see Ingesting arrays: Native batch and streaming ingestion.For SQL-based batch ingestion, include the query context parameter
"arrayIngestMode": "array"
and reference the relevant array type (VARCHAR ARRAY
,BIGINT ARRAY
, orDOUBLE ARRAY
) in the EXTEND clause that lists the column names and data types. For examples, see Ingesting arrays: SQL-based ingestion.As a best practice, always use the ARRAY data type in your input schema. If you want to ingest MVDs, explicitly wrap the string array in ARRAY_TO_MV. For an example, see Multi-value dimensions: SQL-based ingestion.