Step-by-Step Guide: Handling Avro File Format in Hive

Handling Avro file format in Hive involves using AvroSerDe (Serializer/Deserializer) to allow Hive to read and write data in Avro format. This enables you to leverage Avro's schema evolution capabilities and efficient binary serialization within your Hive environment.

Step 1: What is Avro Format in Hive?

Avro is a row-based storage format that uses JSON-like schemas for data serialization.
It supports schema evolution, making it ideal when the data structure changes over time.

Benefits:

Compact binary format → saves storage space
Self-describing (schema embedded)
Supports schema evolution
Interoperable with many big data tools (Hive, Pig, Spark)

Step 2: Creating an Avro Table Using Avro Schema

You can create a Hive table using an external Avro schema file.

CREATE TABLE employees_avro
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES (
  'avro.schema.url'='hdfs:///user/hive/schemas/employees.avsc'
);

employees.avsc is an Avro schema file stored on HDFS.

Sample Avro Schema (employees.avsc):

{
  "type": "record",
  "name": "Employee",
  "fields": [
    {"name": "id", "type": "int"},
    {"name": "name", "type": "string"},
    {"name": "department", "type": "string"},
    {"name": "salary", "type": "double"}
  ]
}

Step 3: Creating an Avro Table Without External Schema

Alternatively, you can define the schema inline.

CREATE TABLE employees_avro_inline (
  id INT,
  name STRING,
  department STRING,
  salary DOUBLE
)
STORED AS AVRO;

Hive internally creates an Avro schema for the table.

Step 4: Loading Data into Avro Table

You can load Avro data from HDFS.

LOAD DATA INPATH '/user/hive/input/employees.avro'
INTO TABLE employees_avro;

Step 5: Converting Text Data to Avro Format (CTAS)

Use Create Table As Select (CTAS) to convert existing text data to Avro.

CREATE TABLE employees_avro
STORED AS AVRO
AS
SELECT * FROM employees_text;

Step 6: Querying Avro Tables

You can query Avro tables like any other Hive table.

SELECT name, department FROM employees_avro WHERE salary > 50000;

Step 7: Evolving Avro Schemas in Hive

If your schema changes (like adding new columns):

Update your .avsc schema file.
Update the table property:

ALTER TABLE employees_avro
SET TBLPROPERTIES ('avro.schema.url'='hdfs:///user/hive/schemas/employees_v2.avsc');

Avro can handle new fields with default values.

✅ Best Practices

Use Avro when schema evolution is expected.
Store schema in a version-controlled location (like HDFS or Git).
Always validate .avsc schema before use.
Combine Avro with partitioning for better performance.
For analytics-heavy workloads, convert Avro to ORC/Parquet after ingestion.

This guide helps you handle Avro files in Hive efficiently, with schema evolution support and smooth integration.