Step-by-Step Guide: Handling Avro File Format in Hive

9/11/2025
All Articles

Avro File Format in Hive

Step-by-Step Guide: Handling Avro File Format in Hive

Step-by-Step Guide: Handling Avro File Format in Hive

Handling Avro file format in Hive involves using AvroSerDe (Serializer/Deserializer) to allow Hive to read and write data in Avro format. This enables you to leverage Avro's schema evolution capabilities and efficient binary serialization within your Hive environment.

Step 1: What is Avro Format in Hive?

  • Avro is a row-based storage format that uses JSON-like schemas for data serialization.

  • It supports schema evolution, making it ideal when the data structure changes over time.

Benefits:

  • Compact binary format → saves storage space

  • Self-describing (schema embedded)

  • Supports schema evolution

  • Interoperable with many big data tools (Hive, Pig, Spark)


Step 2: Creating an Avro Table Using Avro Schema

You can create a Hive table using an external Avro schema file.

CREATE TABLE employees_avro
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES (
  'avro.schema.url'='hdfs:///user/hive/schemas/employees.avsc'
);
  • employees.avsc is an Avro schema file stored on HDFS.

Sample Avro Schema (employees.avsc):

{
  "type": "record",
  "name": "Employee",
  "fields": [
    {"name": "id", "type": "int"},
    {"name": "name", "type": "string"},
    {"name": "department", "type": "string"},
    {"name": "salary", "type": "double"}
  ]
}

Step 3: Creating an Avro Table Without External Schema

Alternatively, you can define the schema inline.

CREATE TABLE employees_avro_inline (
  id INT,
  name STRING,
  department STRING,
  salary DOUBLE
)
STORED AS AVRO;
  • Hive internally creates an Avro schema for the table.


Step 4: Loading Data into Avro Table

You can load Avro data from HDFS.

LOAD DATA INPATH '/user/hive/input/employees.avro'
INTO TABLE employees_avro;

Step 5: Converting Text Data to Avro Format (CTAS)

Use Create Table As Select (CTAS) to convert existing text data to Avro.

CREATE TABLE employees_avro
STORED AS AVRO
AS
SELECT * FROM employees_text;

Step 6: Querying Avro Tables

You can query Avro tables like any other Hive table.

SELECT name, department FROM employees_avro WHERE salary > 50000;

Step 7: Evolving Avro Schemas in Hive

If your schema changes (like adding new columns):

  1. Update your .avsc schema file.

  2. Update the table property:

ALTER TABLE employees_avro
SET TBLPROPERTIES ('avro.schema.url'='hdfs:///user/hive/schemas/employees_v2.avsc');
  • Avro can handle new fields with default values.


✅ Best Practices

  • Use Avro when schema evolution is expected.

  • Store schema in a version-controlled location (like HDFS or Git).

  • Always validate .avsc schema before use.

  • Combine Avro with partitioning for better performance.

  • For analytics-heavy workloads, convert Avro to ORC/Parquet after ingestion.


This guide helps you handle Avro files in Hive efficiently, with schema evolution support and smooth integration.

Article