How to create External table with avro formate in Big Query

3/4/2024
All Articles

Avro schema example #External table with avro formate in Big Query #bigdata #spark

How to create External table with avro formate in Big Query

How to Create an External Table with Avro Format in BigQuery: A Step-by-Step Guide

Introduction to Avro Format in BigQuery

Avro is a popular data serialization system used in big data environments. It relies on a schema defined in JSON to specify the structure of your data. BigQuery, Google Cloud's fully-managed data warehouse, supports Avro format for external tables, enabling seamless integration with platforms like Apache Hive, Spark, and Hadoop.

In this article, we’ll walk you through the process of creating an external table with Avro format in BigQuery, including how to define partition columns and configure table options.

What is Avro Format?

Avro is a binary data format that is compact and efficient for serializing data. It uses JSON for defining schemas, making it human-readable and easy to integrate with various big data tools. Avro is widely used in distributed systems for its schema evolution capabilities and compatibility with multiple programming languages.

Steps to Create an External Table with Avro Format in BigQuery

To create an external table in BigQuery using Avro format, follow these steps:

1. Define the Avro Schema

Since Avro requires a schema, you need to define the structure of your data in JSON format. Here’s an example of an Avro schema:

{
  "type": "record",
  "name": "User",
  "fields": [
    {"name": "id", "type": "int"},
    {"name": "name", "type": "string"},
    {"name": "email", "type": "string"}
  ]
}

2. Create the External Table

Use the following SQL syntax to create an external table in BigQuery with Avro format:

CREATE EXTERNAL TABLE `project_id.dataset_id.table_name`
WITH PARTITION COLUMNS (
  -- Define partition columns here
  date STRING
)
OPTIONS (
  format = 'AVRO',
  uris = ['gs://your-bucket/path/to/avro/files/*'],
  hive_partition_uri_prefix = 'gs://your-bucket/path/to/folder/',
  enum_as_string = false,
  enable_list_inference = false
);

Explanation of the Code:

  • project_id.dataset_id.table_name: Replace with your project ID, dataset ID, and table name.
  • PARTITION COLUMNS: Define partition columns (e.g., date STRING).
  • OPTIONS: Configure table options:
    • format: Set to 'AVRO'.
    • uris: Specify the path to your Avro files in Google Cloud Storage (GCS).
    • hive_partition_uri_prefix: Define the prefix for Hive-style partitioning.
    • enum_as_string: Set to false to treat enums as integers.
    • enable_list_inference: Set to false to disable list inference.

Advantages of Using Avro Format in BigQuery

  1. Schema Evolution: Avro supports schema evolution, allowing you to modify the schema without breaking existing data.
  2. Compact Storage: Avro files are binary-encoded, making them smaller in size compared to text-based formats like JSON or CSV.
  3. Compatibility: Avro is compatible with various big data tools like Apache Hive, Spark, and Hadoop.
  4. Efficient Data Processing: BigQuery can efficiently process Avro files, reducing query execution time.

Conclusion

Creating an external table with Avro format in BigQuery is a powerful way to integrate and analyze data from distributed systems. By following the steps outlined in this guide, you can efficiently manage and query large datasets stored in Avro format.

Whether you're working with Apache Hive, Spark, or Hadoop, Avro’s schema flexibility and compact storage make it an excellent choice for big data environments. Start leveraging Avro in BigQuery today to streamline your data workflows.

 

Article