Step-by-Step Guide: Handling Avro File Format in Hive
Avro File Format in Hive
Avro is a row-based storage format that uses JSON-like schemas for data serialization.
It supports schema evolution, making it ideal when the data structure changes over time.
Benefits:
Compact binary format → saves storage space
Self-describing (schema embedded)
Supports schema evolution
Interoperable with many big data tools (Hive, Pig, Spark)
You can create a Hive table using an external Avro schema file.
CREATE TABLE employees_avro
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES (
'avro.schema.url'='hdfs:///user/hive/schemas/employees.avsc'
);
employees.avsc is an Avro schema file stored on HDFS.
Sample Avro Schema (employees.avsc):
{
"type": "record",
"name": "Employee",
"fields": [
{"name": "id", "type": "int"},
{"name": "name", "type": "string"},
{"name": "department", "type": "string"},
{"name": "salary", "type": "double"}
]
}
Alternatively, you can define the schema inline.
CREATE TABLE employees_avro_inline (
id INT,
name STRING,
department STRING,
salary DOUBLE
)
STORED AS AVRO;
Hive internally creates an Avro schema for the table.
You can load Avro data from HDFS.
LOAD DATA INPATH '/user/hive/input/employees.avro'
INTO TABLE employees_avro;
Use Create Table As Select (CTAS) to convert existing text data to Avro.
CREATE TABLE employees_avro
STORED AS AVRO
AS
SELECT * FROM employees_text;
You can query Avro tables like any other Hive table.
SELECT name, department FROM employees_avro WHERE salary > 50000;
If your schema changes (like adding new columns):
Update your .avsc schema file.
Update the table property:
ALTER TABLE employees_avro
SET TBLPROPERTIES ('avro.schema.url'='hdfs:///user/hive/schemas/employees_v2.avsc');
Avro can handle new fields with default values.
Use Avro when schema evolution is expected.
Store schema in a version-controlled location (like HDFS or Git).
Always validate .avsc schema before use.
Combine Avro with partitioning for better performance.
For analytics-heavy workloads, convert Avro to ORC/Parquet after ingestion.
This guide helps you handle Avro files in Hive efficiently, with schema evolution support and smooth integration.