Load Avro Files with LOAD DATA

Avro LOAD DATA

Syntax for LOAD DATA Local Infile

SQL

LOAD DATA [LOCAL] INFILE 'file_name'
  WHERE/SET/SKIP ERRORS[REPLACE | SKIP { CONSTRAINT | DUPLICATE KEY } ERRORS]
  INTO TABLE tbl_name
  FORMAT AVRO SCHEMA REGISTRY {"IP" | "Hostname"}
  subvalue_mapping
  [SET col_name = expr,...]
  [WHERE expr,...]
  [MAX_ERRORS number]
  [ERRORS HANDLE string]
  [SCHEMA 'avro_schema']

subvalue_mapping:
  ( {col_name | @variable_name} <- subvalue_path, ...)

subvalue_path:
  {% | [%::]ident [::ident ...]}

See the associated GitHub repo.

Syntax for LOAD DATA AWS S3 Source

Avro-formatted data stored in an AWS S3 bucket can use a LOAD DATA query without a pipeline. This streamlines the process of loading cloud-stored data into tables.

SQL

LOAD DATA S3 '<bucket name>'
CONFIG '{"region" : "<region_name>"}' 
CREDENTIALS '{"aws_access_key_id" : "<key_id> ", 
             "aws_secret_access_key": "<access_key>"}'
INTO TABLE <table_name>
       (`<col_a>` <- %, 
 `<col_b>` <- % DEFAULT NULL , 
  ) FORMAT AVRO;

This data can also be loaded from S3 with a connection link. Refer to CREATE LINK for more information on connection links.

SQL

LOAD DATA LINK <link_name> '<bucket name>'
INTO TABLE <table_name>
(`<col_a>` <- %,`<col_b>` <- % DEFAULT NULL ,
) FORMAT AVRO;

Semantics

Error Logging and Error Handling are discussed in the LOAD DATA page.

LOAD DATA for Avro does not support file name globbing (for example: LOAD DATA INFILE '/data/nfs/gp1/*.avro). LOAD DATA for Avro only supports loading a single file per statement.

Extract specified subvalues from each Avro value in file_name. Assign them to specified columns of a new row in tbl_name, or to variables used for a column assignment in a SET clause. Discard rows which don’t match the WHERE clause.

To specify the compression type of an input file, use the COMPRESSION clause. See LOAD DATA for more information.

Avro LOAD DATA expects Avro data in one of two sub-formats, depending on the SCHEMA clause.

If no SCHEMA clause is provided, file_name must name an Avro Object Container File as described in version 1.8.2 of the Avro specification. In addition, the following restrictions hold:

The compression codec of the file must be null.
Array and map values must not have more than 16384 elements.
The type name of a record must not be used in a symbolic reference to previously defined name in any of its fields. It may still be used in a symbolic reference outside the record definition, however.

For example, self-referential schemas like the following are rejected by LOAD DATA:
JSON
```
{
  "type": "record",
  "name": "PseudoLinkedList",
  "fields" : [{"name": "value", "type": "long"},
              {"name": "next", "type": ["null", "PseudoLinkedList"]}]
}
```

If a SCHEMA clause is provided, the file must be a raw stream consisting of only the concatenated binary encodings of instances of avro_schema. avro_schema must be a SQL string containing a JSON Avro schema. The restrictions on Object Container Files also apply to raw stream files.

Warning

It’s an error to provide a SCHEMA clause when loading an Object Container File because it contains metadata alongside the encoded values.

All optional Avro schema attributes except the namespace attribute are ignored. Notably, logicalType attributes are ignored.

If file_name ends in .gz or .lz4, it will be decompressed.

Writing to multiple databases in a transaction is not supported.

The SCHEMA REGISTRY {"IP" | "Hostname"} option allows LOAD DATA to pull the schema from a schema registry. For more information, see the Avro Schema Evolution With Pipelines topic.

Load Avro Files with LOAD DATA

On this page

Avro LOAD DATA

Syntax for LOAD DATA Local Infile

Syntax for LOAD DATA AWS S3 Source

Semantics

Was this article helpful?

On this page

Was this article helpful?