Generalized Inverted Index (GIN) (BSON)

GIN stores pairs consisting of a value and a list of row IDs in which the value occurs. Each value is stored only once, so a GIN index remains compact even when the same value appears multiple times. The performance of queries that search for specific words in documents benefits if those keywords are indexed using GIN.

SingleStore's GIN index is fundamentally a hash index and is built on the same effective technology as the regular columnar hash indices.

BSON type includes extending BSON_MATCH_ANY with equality filters inside to be optimized using GIN. The following query shapes benefit from GIN index creation:

BSON_MATCH_ANY(MATCH_PARAM_BSON() = 'value', 'path')
BSON_MATCH_ANY(MATCH_PARAM_BSON() IN ( 'value1', 'value2', 'value3'), 'path')

Syntax

For SQL Queries

Adding GIN index to new tables:

SQL

CREATE TABLE <tablename> 
  (<col1> BSON NULL, 
   GIN INDEX(col1) INDEX_OPTIONS='<options>';

Adding GIN index to existing tables:

SQL

ALTER TABLE <tableName> 
  ADD GIN INDEX (col1) INDEX_OPTIONS='<options>';

where '<options>' is a JSON object to specify the tokenizer and other metadata related to it.

'{"tokenizer":"MATCH_ANY", "path":[<path_array>]}'

where path_array is a comma-separated list of values that specify the path to the field that needs to be indexed. If an empty path array, for example, "PATH":[], is specified, then top-level elements in the array are indexed (such as [1,2,3]). Given a BSON column and an optional path, the GIN index extracts all values at that path.

For MongoDB® Queries

MongoDB

db.<collection>.createIndex({"<path_name>":"gin"})

where <path_name> is the path to a property in the BSON document.

Remarks

Currently, GIN is supported only on BSON type columns.
GIN index is case-sensitive (for BSON types).

GIN index activates when matched using the following operators: =, <=>, and IN().

The following table lists the expressions that support GIN index for each of these operators:

Operator	Supported Expression
`=`, `<=>`	Constant literals, for example strings, `NULL`, `DOUBLE`, `INT`, etc. Typecast operators `:>` and `!:>`, for example, `100001:>BSON` User-defined variables (UDVs), for example: SQL SET @obj = '{"city": "New York"}':>BSON; SELECT a:>JSON FROM t WHERE BSON_MATCH_ANY(MATCH_PARAM_BSON()=@obj, a, 'city'); Procedural SQL variables (stored procedure arguments and local variables) Nested constant (deterministic) built-in expressions, for example: `... BSON_BUILD_ARRAY(BSON_SET_BSON('{}', 'a', @bson), '{"a":4}':>JSON)` A combination of all of the above
`IN()`	Hexadecimal literal of type X'...' in non-parameterized `IN` lists, for example: SQL SELECT ... BSON_MATCH_ANY(MATCH_PARAM_BSON() IN ( X'D204000010', X'A186010010'), product_ids, 'product_type') Parameterized `IN` lists with: Literals of the same types Single argument built-in expressions of the same shape and literal types in the list, for example, `IN(HEX('a'), HEX('b'))`

Operator

Supported Expression

=, <=>

Constant literals, for example strings, NULL, DOUBLE, INT, etc.
Typecast operators :> and !:>, for example, 100001:>BSON

User-defined variables (UDVs), for example:

SQL

SET @obj = '{"city": "New York"}':>BSON;

SELECT a:>JSON FROM t 
WHERE BSON_MATCH_ANY(MATCH_PARAM_BSON()=@obj, a, 'city');

Procedural SQL variables (stored procedure arguments and local variables)

Nested constant (deterministic) built-in expressions, for example:

... BSON_BUILD_ARRAY(BSON_SET_BSON('{}', 'a', @bson), '{"a":4}':>JSON)

A combination of all of the above

IN()

Hexadecimal literal of type X'...' in non-parameterized IN lists, for example:

SQL

SELECT ... BSON_MATCH_ANY(MATCH_PARAM_BSON() IN ( X'D204000010', X'A186010010'), product_ids, 'product_type')

Parameterized IN lists with:
- Literals of the same types
- Single argument built-in expressions of the same shape and literal types in the list, for example, IN(HEX('a'), HEX('b'))

Unsupported Features or Expressions

GIN index is not activated when used in the following:

Calls to user-defined functions (UDFs)
References to other unindexed (GIN) fields on the right-hand side of the expression
Non-deterministic built-in functions, for example RAND()
Aggregate and window functions
MATCH_PARAM_<type> expressions other than BSON type on the right-hand side, because it's evaluated at runtime
BSON_MATCH_ANY predicate with MATCH_ELEMENTS option
!= and other comparison operators (excluding = and <=>)
NOT IN lists

Examples

Example 1 - Using SQL

Consider the following table named orders. Note that the GIN index is added to the product_ids column:

SQL

CREATE TABLE orders(  
  id BIGINT PRIMARY KEY,   
  created DATETIME(6),  
  product_ids BSON,  
  GIN INDEX (product_ids) INDEX_OPTIONS='{"TOKENIZER":"MATCH_ANY","PATH":[]}');

INSERT INTO orders VALUES
  (1, '2025-03-03 12:34:56.000001', '100001':>BSON),
  (2, '2025-03-03 12:34:56.000002', '100002':>BSON),
  (3, '2025-03-03 12:34:56.000003', '100003':>BSON),
  (4, '2025-03-03 12:34:56.000004', '100004':>BSON);

Optimize the table:

SQL

OPTIMIZE TABLE orders FULL;

Perform the lookup:

SQL

SELECT id, created, product_ids:>JSON AS product_ids
  FROM orders 
  WHERE BSON_MATCH_ANY(MATCH_PARAM_BSON() = 100001:>BSON, product_ids);

+----+----------------------------+-------------+
| id | created                    | product_ids |
+----+----------------------------+-------------+
|  1 | 2025-03-03 12:34:56.000001 | 100001      |
+----+----------------------------+-------------+

The product_ids column is cast to JSON for clarity.

The query execution benefits from the GIN index, note the ColumnStoreFilter [orders.product_ids = x'a186010010' index] in the output:

SQL

EXPLAIN SELECT id, created, product_ids:>JSON AS product_ids
  FROM orders 
  WHERE BSON_MATCH_ANY(MATCH_PARAM_BSON() = 100001:>BSON, product_ids);

+-----------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                   |
+-----------------------------------------------------------------------------------------------------------+
| Gather partitions:all alias:remote_0 parallelism_level:segment                                            |
| Project [orders.id, orders.created, (orders.product_ids:>JSON COLLATE utf8mb4_bin NULL) AS product_ids]   |
| ColumnStoreFilter [BSON_MATCH_ANY(MATCH_PARAM_BSON() = (100001:>bson NULL),orders.product_ids) gin index] |
| ColumnStoreScan dbTest.orders, SORT KEY __UNORDERED () table_type:sharded_columnstore                     |
+-----------------------------------------------------------------------------------------------------------+

Example 2 - Using GIN at Scale

The following example shows how a large data set can benefit from GIN at scale. You add a GIN and then query the indexed column or path.

Create a table bookings, and add GIN to the product_ids column:

SQL

CREATE TABLE bookings(  
  id BIGINT PRIMARY KEY,   
  created DATETIME(6),  
  product_ids BSON,  
  GIN INDEX (product_ids) INDEX_OPTIONS='{"TOKENIZER":"MATCH_ANY","PATH":[]}'
);

-- Add 1Million rows, each with 5 products sampled from 100000 product ids

DELIMITER //

DO DECLARE    
   arr ARRAY(RECORD(id BIGINT, created DATETIME(6), product_ids BSON)) = 
   CREATE_ARRAY(1000000);    
     product_ids BSON;    
     n BIGINT;
BEGIN    
   FOR i IN 0..999999 LOOP        
       product_ids = CONCAT('[',CONCAT_WS(',', (RAND()*100000):>INT, 
                     (RAND()*100000):>INT, (RAND()*100000):>INT, 
                     (RAND()*100000):>INT, (RAND()*100000):>INT),']');        
                        arr[i] = ROW(i, NOW(), product_ids);    
   END LOOP;    
   n = INSERT_ALL('bookings', arr);
END //

DELIMITER ;