Step-by-Step Guide: Handling ORC (Optimized Row Columnar) Files in Hive

In this article we provide step by step on ORC (Optimized Row Columnar) is a self-describing, type-aware columnar file format designed for Hadoop workloads, offering an efficient way to store, read, and process data, particularly within Apache Hive. It was introduced to address limitations of earlier Hive file formats like RCFile

Step 1: What is ORC Format in Hive?

ORC (Optimized Row Columnar) is a highly efficient columnar storage format used in Hive.
It offers better compression, faster reads, and optimized storage compared to TextFile or SequenceFile formats.

Benefits:

High compression ratio → saves storage space
Columnar storage → reads only required columns
Splittable → supports parallel processing
Built-in indexes → improves query performance

Step 2: Creating a Table with ORC Format

Use the STORED AS ORC clause while creating a Hive table.

CREATE TABLE employees_orc (
  id INT,
  name STRING,
  department STRING,
  salary DOUBLE
)
STORED AS ORC;

This creates an empty table in ORC format.

Step 3: Loading Data into ORC Table

You can load data from HDFS into the ORC table.

LOAD DATA INPATH '/user/hive/input/employees_orc.csv'
INTO TABLE employees_orc;

If your source data is in text format, convert it first using CTAS (see Step 4).

Step 4: Converting Text Data to ORC Format (CTAS)

Use Create Table As Select (CTAS) to convert existing data to ORC.

CREATE TABLE employees_orc
STORED AS ORC
AS
SELECT * FROM employees_text;

This copies data from the text-based table to a new ORC table.

Step 5: Enabling ORC-Specific Optimizations

Enable ORC optimizations to get better performance.

SET hive.exec.orc.split.strategy = ETL;
SET hive.optimize.index.filter = true;
SET hive.vectorized.execution.enabled = true;
SET hive.vectorized.execution.reduce.enabled = true;

Step 6: Querying ORC Tables

You can query ORC tables like any other Hive table.

SELECT department, AVG(salary) FROM employees_orc GROUP BY department;

Step 7: Inspecting ORC File Structure

You can view ORC file metadata and compression details using:

hive --orcfiledump /user/hive/warehouse/employees_orc/000000_0

Best Practices

Always use ORC for large analytical datasets.
Enable vectorized execution and predicate pushdown.
Combine ORC with partitioning and bucketing for best performance.
Periodically analyze tables to update ORC statistics.
Avoid too many small ORC files; merge them when possible.

This guide helps you handle ORC files in Hive efficiently, ensuring faster queries and optimized storage.