Insert rows in BigQuery tables with complex columns

Insert rows in BigQuery tables with complex columns

By César BEREZOWSKI

Nov 22, 2019

Categories: Cloud Computing, Data Engineering | Tags: GCP, Schema, BigQuery, SQL [more][less]

Google’s BigQuery is a cloud data warehousing system designed to process enormous volumes of data with several features available. Out of all those features, let’s talk about the support of Struct data types and repeated columns.

Complex columns

Struct type columns (we’ll call them complex columns) allow you to define the content of the column as a struct with multiple typed properties (STRUCT).

For example, let’s say we want to store the following informations about a person:

  • First name
  • Last name
  • Date of birth
  • Address

    • Street
    • City
    • Zip code
    • Country

In a traditional SQL system, you could make a second table to keep the addresses and have a foreign key on your person’s row or flatten the address object in the person’s row, or a number of other possibilities.

With BigQuery, complex columns allows us to store addresses as objects so we’re adding only one column to our schema and we don’t have to use joins to access the information.

To define the schema of a complex column, you set the column type value to RECORD and add the property fields which is an array of fields. Here’s what the schema of our table would look like in JSON:

[
    { 
        "name": "FIRST_NAME",
        "type": "STRING",
        "mode": "REQUIRED"
    },
    { 
        "name": "LAST_NAME",
        "type": "STRING",
        "mode": "REQUIRED"
    },
    { 
        "name": "DATE_OF_BIRTH",
        "type": "DATE",
        "mode": "REQUIRED"
    },
    { 
        "name": "ADDRESS",
        "type": "RECORD",
        "mode": "NULLABLE",
        "fields": [
            { 
                "name": "STREET",
                "type": "STRING",
                "mode": "REQUIRED"
            },
            { 
                "name": "CITY",
                "type": "STRING",
                "mode": "REQUIRED"
            },
            { 
                "name": "ZIP_CODE",
                "type": "STRING",
                "mode": "NULLABLE"
            },
            { 
                "name": "COUNTRY",
                "type": "STRING",
                "mode": "REQUIRED"
            }
        ]
    }
]

When reading the schema in BigQuery’s UI, the complex column will first appear with it’s defined type and mode (record, nullable) and then be repeated for each fields with the format column.field and the type and mode of the field.

Person's table schema with complex column in BigQuery's UI

Now that the schema is defined, let’s see how to insert data. There’s a lot of documentation from Google on how to insert rows but, surprisingly, not with plain old SQL INSERT INTO (at publication time of this article). Basically, you set the complex values between parenthesis (as a tuple). However, to not have issues with the fields type you should specify the columns you are inserting into and might have to cast the values if your inserting NULL. For example:

INSERT INTO `project.dataset.person_table` (FIRST_NAME, LAST_NAME, DATE_OF_BIRTH, ADDRESS)
VALUES
    ("Jeff", "Smith", "1980-10-10", ("#1 7th Avenue", "New York", "100011", "United States")),
    ("Charlotte", "Lalande", "1990-01-01", ("3Bis Avenue des Champs Élysées", "Paris", STRING(NULL), "France"))

To display complex columns, BigQuery’s UI will apply the same logic as to the schema, each field of the complex column appears and is named column.field.

Person's table content with complex column in BigQuery's UI

Repeated columns

BigQuery also allows us to define repeated columns, which basically amounts to setting the type to ARRAY.

We’ll update our previous table to apply the following changes:

  • A person can have middle names
  • A person can have a secondary adresses which we’ll translate as a column of repeated adresses.

Here’s our new schema:

[
    { 
        "name": "FIRST_NAME",
        "type": "STRING",
        "mode": "REQUIRED"
    },
    { 
        "name": "MIDDLE_NAMES",
        "type": "STRING",
        "mode": "REPEATED"
    },
    { 
        "name": "LAST_NAME",
        "type": "STRING",
        "mode": "REQUIRED"
    },
    { 
        "name": "DATE_OF_BIRTH",
        "type": "DATE",
        "mode": "REQUIRED"
    },
    { 
        "name": "ADDRESS",
        "type": "RECORD",
        "mode": "NULLABLE",
        "fields": [
            { 
                "name": "STREET",
                "type": "STRING",
                "mode": "REQUIRED"
            },
            { 
                "name": "CITY",
                "type": "STRING",
                "mode": "REQUIRED"
            },
            { 
                "name": "ZIP_CODE",
                "type": "STRING",
                "mode": "NULLABLE"
            },
            { 
                "name": "COUNTRY",
                "type": "STRING",
                "mode": "REQUIRED"
            }
        ]
    },
    { 
        "name": "SECONDARY_ADDRESS",
        "type": "RECORD",
        "mode": "REPEATED",
        "fields": [
            { 
                "name": "STREET",
                "type": "STRING",
                "mode": "REQUIRED"
            },
            { 
                "name": "CITY",
                "type": "STRING",
                "mode": "REQUIRED"
            },
            { 
                "name": "ZIP_CODE",
                "type": "STRING",
                "mode": "NULLABLE"
            },
            { 
                "name": "COUNTRY",
                "type": "STRING",
                "mode": "REQUIRED"
            }
        ]
    }    
]

BigQuery UI’s will display them in the schema like any other column except that its mode is REPEATED.

Person's table schema with repeated columns in BigQuery's UI

To insert in a repeated field you have to set the value to [VALUE1, VALUE2, ...]:

INSERT INTO `project.dataset.person_table` (FIRST_NAME, MIDDLE_NAMES, LAST_NAME, DATE_OF_BIRTH, ADDRESS, SECONDARY_ADDRESS)
VALUES
    ("Jeff", ["Pierre", "Jack"], "Smith", "1980-10-10", ("#1 7th Avenue", "New York", "100011", "United States"), [("3Bis Avenue des Champs Élysées", "Paris", "75008", "France")]),
    ("Charlotte", ["Marie"], "Lalande", "1990-01-01", ("3Bis Avenue des Champs Élysées", "Paris", STRING(NULL), "France"), NULL)

To show the repeated fields, the UI will add more lines to the same row: each line containing a new value of the repeated field, and non-repeated fields will be grayed out.

Person's table content with repeated columns in BigQuery's UI

One thing to note, repeated fields are not optimized for querying. If you are looking to filter a data from a repeated field in your table, it is better to duplicate this data on a specific column or simply not to put it in the repeated field.

Congratulations, now you know how to manually insert data in complex & repeated columns!

Canada - Morocco - France

International locations

10 rue de la Kasbah
2393 Rabbat
Canada

We are a team of Open Source enthusiasts doing consulting in Big Data, Cloud, DevOps, Data Engineering, Data Science…

We provide our customers with accurate insights on how to leverage technologies to convert their use cases to projects in production, how to reduce their costs and increase the time to market.

If you enjoy reading our publications and have an interest in what we do, contact us and we will be thrilled to cooperate with you.