Step-by-Step Guide: Handling ORC (Optimized Row Columnar) Files in Hive
ORC (Optimized Row Columnar) Files in Hive
In this article we provide step by step on ORC (Optimized Row Columnar) is a self-describing, type-aware columnar file format designed for Hadoop workloads, offering an efficient way to store, read, and process data, particularly within Apache Hive. It was introduced to address limitations of earlier Hive file formats like RCFile
ORC (Optimized Row Columnar) is a highly efficient columnar storage format used in Hive.
It offers better compression, faster reads, and optimized storage compared to TextFile or SequenceFile formats.
Benefits:
High compression ratio → saves storage space
Columnar storage → reads only required columns
Splittable → supports parallel processing
Built-in indexes → improves query performance
Use the STORED AS ORC clause while creating a Hive table.
CREATE TABLE employees_orc (
id INT,
name STRING,
department STRING,
salary DOUBLE
)
STORED AS ORC;
This creates an empty table in ORC format.
You can load data from HDFS into the ORC table.
LOAD DATA INPATH '/user/hive/input/employees_orc.csv'
INTO TABLE employees_orc;
If your source data is in text format, convert it first using CTAS (see Step 4).
Use Create Table As Select (CTAS) to convert existing data to ORC.
CREATE TABLE employees_orc
STORED AS ORC
AS
SELECT * FROM employees_text;
This copies data from the text-based table to a new ORC table.
Enable ORC optimizations to get better performance.
SET hive.exec.orc.split.strategy = ETL;
SET hive.optimize.index.filter = true;
SET hive.vectorized.execution.enabled = true;
SET hive.vectorized.execution.reduce.enabled = true;
You can query ORC tables like any other Hive table.
SELECT department, AVG(salary) FROM employees_orc GROUP BY department;
You can view ORC file metadata and compression details using:
hive --orcfiledump /user/hive/warehouse/employees_orc/000000_0
Always use ORC for large analytical datasets.
Enable vectorized execution and predicate pushdown.
Combine ORC with partitioning and bucketing for best performance.
Periodically analyze tables to update ORC statistics.
Avoid too many small ORC files; merge them when possible.
This guide helps you handle ORC files in Hive efficiently, ensuring faster queries and optimized storage.