Generalized Inverted Index (GIN) (BSON)
On this page
Generalized Inverted Index (GIN) is a multi-valued index that maps values to the row IDs containing the value.
GIN stores pairs consisting of a value and a list of row IDs in which the value occurs.
SingleStore's GIN index is fundamentally a hash index and is built on the same effective technology as the regular columnar hash indices.
BSON type includes extending BSON_
with equality filters inside to be optimized using GIN.
-
BSON_
MATCH_ ANY(MATCH_ PARAM_ BSON() = 'value', 'path') -
BSON_
MATCH_ ANY(MATCH_ PARAM_ BSON() IN ( 'value1', 'value2', 'value3'), 'path')
Syntax
For SQL Queries
Adding GIN index to new tables:
CREATE TABLE <tablename>(<col1> BSON NULL,GIN INDEX(col1) INDEX_OPTIONS='<options>';
Adding GIN index to existing tables:
ALTER TABLE <tableName>ADD GIN INDEX (col1) INDEX_OPTIONS='<options>';
where '<options>'
is a JSON object to specify the tokenizer and other metadata related to it.
'{"tokenizer":"MATCH_ANY", "path":[<path_array>]}'
where path_
is a comma-separated list of values that specify the path to the field that needs to be indexed."PATH":[]
, is specified, then top-level elements in the array are indexed (such as [1,2,3]).
For MongoDB® Queries
db.<collection>.createIndex({"<path_name>":"gin"})
where <path_
is the path to a property in the BSON document.
Remarks
-
Currently, GIN is supported only on BSON type columns.
-
GIN index is case-sensitive (for BSON types).
-
GIN index activates when matched using the following operators:
=
,<=>
, andIN()
.The following table lists the expressions that support GIN index for each of these operators:
Operator
Supported Expression
=
,<=>
-
Constant literals, for example strings,
NULL
,DOUBLE
,INT
, etc. -
Typecast operators
:>
and!:>
, for example,100001:>BSON
-
User-defined variables (UDVs), for example:
SET @obj = '{"city": "New York"}':>BSON;SELECT a:>JSON FROM tWHERE BSON_MATCH_ANY(MATCH_PARAM_BSON()=@obj, a, 'city'); -
Procedural SQL variables (stored procedure arguments and local variables)
-
Nested constant (deterministic) built-in expressions, for example:
... BSON_BUILD_ARRAY(BSON_SET_BSON('{}', 'a', @bson), '{"a":4}':>JSON)
-
A combination of all of the above
IN()
-
Hexadecimal literal of type X'.
. . ' in non-parameterized IN
lists, for example:SELECT ... BSON_MATCH_ANY(MATCH_PARAM_BSON() IN ( X'D204000010', X'A186010010'), product_ids, 'product_type') -
Parameterized
IN
lists with:-
Literals of the same types
-
Single argument built-in expressions of the same shape and literal types in the list, for example,
IN(HEX('a'), HEX('b'))
-
-
Unsupported Features or Expressions
GIN index is not activated when used in the following:
-
Calls to user-defined functions (UDFs)
-
References to other unindexed (GIN) fields on the right-hand side of the expression
-
Non-deterministic built-in functions, for example
RAND()
-
Aggregate and window functions
-
MATCH_
expressions other than BSON type on the right-hand side, because it's evaluated at runtimePARAM_ <type> -
BSON_
predicate withMATCH_ ANY MATCH_
optionELEMENTS -
!=
and other comparison operators (excluding=
and<=>
) -
NOT IN
lists
Examples
Example 1 - Using SQL
Consider the following table named orders.
CREATE TABLE orders(id BIGINT PRIMARY KEY,created DATETIME(6),product_ids BSON,GIN INDEX (product_ids) INDEX_OPTIONS='{"TOKENIZER":"MATCH_ANY","PATH":[]}');INSERT INTO orders VALUES(1, '2025-03-03 12:34:56.000001', '100001':>BSON),(2, '2025-03-03 12:34:56.000002', '100002':>BSON),(3, '2025-03-03 12:34:56.000003', '100003':>BSON),(4, '2025-03-03 12:34:56.000004', '100004':>BSON);
Optimize the table:
OPTIMIZE TABLE orders FULL;
Perform the lookup:
SELECT id, created, product_ids:>JSON AS product_idsFROM ordersWHERE BSON_MATCH_ANY(MATCH_PARAM_BSON() = 100001:>BSON, product_ids);
+----+----------------------------+-------------+
| id | created | product_ids |
+----+----------------------------+-------------+
| 1 | 2025-03-03 12:34:56.000001 | 100001 |
+----+----------------------------+-------------+
The product_
column is cast to JSON for clarity.
The query execution benefits from the GIN index, note the ColumnStoreFilter [orders.
in the output:
EXPLAIN SELECT id, created, product_ids:>JSON AS product_idsFROM ordersWHERE BSON_MATCH_ANY(MATCH_PARAM_BSON() = 100001:>BSON, product_ids);
+-----------------------------------------------------------------------------------------------------------+
| EXPLAIN |
+-----------------------------------------------------------------------------------------------------------+
| Gather partitions:all alias:remote_0 parallelism_level:segment |
| Project [orders.id, orders.created, (orders.product_ids:>JSON COLLATE utf8mb4_bin NULL) AS product_ids] |
| ColumnStoreFilter [BSON_MATCH_ANY(MATCH_PARAM_BSON() = (100001:>bson NULL),orders.product_ids) gin index] |
| ColumnStoreScan dbTest.orders, SORT KEY __UNORDERED () table_type:sharded_columnstore |
+-----------------------------------------------------------------------------------------------------------+
Example 2 - Using GIN at Scale
The following example shows how a large data set can benefit from GIN at scale.
Create a table bookings, and add GIN to the product_
CREATE TABLE bookings(id BIGINT PRIMARY KEY,created DATETIME(6),product_ids BSON,GIN INDEX (product_ids) INDEX_OPTIONS='{"TOKENIZER":"MATCH_ANY","PATH":[]}');-- Add 1Million rows, each with 5 products sampled from 100000 product idsDELIMITER //DO DECLAREarr ARRAY(RECORD(id BIGINT, created DATETIME(6), product_ids BSON)) =CREATE_ARRAY(1000000);product_ids BSON;n BIGINT;BEGINFOR i IN 0..999999 LOOPproduct_ids = CONCAT('[',CONCAT_WS(',', (RAND()*100000):>INT,(RAND()*100000):>INT, (RAND()*100000):>INT,(RAND()*100000):>INT, (RAND()*100000):>INT),']');arr[i] = ROW(i, NOW(), product_ids);END LOOP;n = INSERT_ALL('bookings', arr);END //DELIMITER ;
Optimize the table:
OPTIMIZE TABLE bookings FULL;
Perform the lookup:
SELECT id, created, product_ids:>JSON AS product_idsFROM bookingsWHERE BSON_MATCH_ANY(MATCH_PARAM_BSON() = 94567:>BSON, product_ids);
+--------+----------------------------+---------------------------------+
| id | created | product_ids |
+--------+----------------------------+---------------------------------+
| 719886 | 2025-03-03 12:50:49.000000 | [77807,55647,59828,9374,94567] |
| 500322 | 2025-03-03 12:50:49.000000 | [94567,58653,77465,55399,16804] |
| 207632 | 2025-03-03 12:50:48.000000 | [63421,19400,16959,65091,94567] |
| 943351 | 2025-03-03 12:50:50.000000 | [4616,41546,31047,94567,68472] |
| 363185 | 2025-03-03 12:50:48.000000 | [50258,88567,99311,94567,2009] |
| 803463 | 2025-03-03 12:50:49.000000 | [22582,64484,15540,44440,94567] |
| 698599 | 2025-03-03 12:50:49.000000 | [84654,32124,94567,26840,90261] |
| 105530 | 2025-03-03 12:50:47.000000 | [57273,21771,31151,4171,94567] |
| 781873 | 2025-03-03 12:50:49.000000 | [2862,12649,71063,90341,94567] |
--- Output truncated ---
+--------+----------------------------+---------------------------------+
The product_
column is cast to JSON for clarity.
The query execution benefits from the GIN index (the ColumnStoreFilter [bookings.
in the output indicates that the query used an index):
EXPLAIN SELECT id, created, product_ids:>JSON AS product_idsFROM bookingsWHERE BSON_MATCH_ANY(MATCH_PARAM_BSON() = 94567:>BSON, product_ids);
+---------------------------------------------------------------------------------------------------------------+
| EXPLAIN |
+---------------------------------------------------------------------------------------------------------------+
| Gather partitions:all alias:remote_0 parallelism_level:segment |
| Project [bookings.id, bookings.created, (bookings.product_ids:>JSON COLLATE utf8mb4_bin NULL) AS product_ids] |
| ColumnStoreFilter [BSON_MATCH_ANY(MATCH_PARAM_BSON() = (94567:>bson NULL),bookings.product_ids) gin index] |
| ColumnStoreScan dbTest.bookings, SORT KEY __UNORDERED () table_type:sharded_columnstore |
+---------------------------------------------------------------------------------------------------------------+
Last modified: May 12, 2025