Step-by-Step Guide: Handling Parquet Files in Hive

Here in this article we provide steps by step detail in hive integration with parquet .Hive provides native support for reading and writing Parquet files, leveraging their columnar storage and efficient compression for improved query performance

Step 1: What is Parquet Format in Hive?

Parquet is a columnar storage format supported by Hive.
It offers high compression, efficient encoding, and fast query performance.

Benefits:

Columnar format → reads only necessary columns
High compression → saves storage space
Splittable → supports parallel processing
Widely used with big data frameworks (Hive, Spark, Impala)

Step 2: Creating a Table with Parquet Format

Use the STORED AS PARQUET clause while creating a Hive table.

CREATE TABLE employees_parquet (
  id INT,
  name STRING,
  department STRING,
  salary DOUBLE
)
STORED AS PARQUET;

This creates an empty table using the Parquet format.

Step 3: Loading Data into Parquet Table

You can load data from HDFS into the Parquet table.

LOAD DATA INPATH '/user/hive/input/employees_parquet.csv'
INTO TABLE employees_parquet;

If the source data is in text format, convert it using CTAS (Step 4).

Step 4: Converting Text Data to Parquet Format (CTAS)

Use Create Table As Select (CTAS) to convert existing data to Parquet.

CREATE TABLE employees_parquet
STORED AS PARQUET
AS
SELECT * FROM employees_text;

This copies data from the text-based table to a new Parquet table.

Step 5: Enabling Parquet-Specific Optimizations

Enable Parquet-related optimizations to get better performance.

SET hive.vectorized.execution.enabled = true;
SET hive.vectorized.execution.reduce.enabled = true;
SET parquet.memory.min.chunk.size = 134217728;

Step 6: Querying Parquet Tables

You can query Parquet tables like any other Hive table.

SELECT department, COUNT(*) FROM employees_parquet GROUP BY department;

Step 7: Inspecting Parquet File Metadata

You can inspect Parquet file metadata using the parquet-tools command (if installed):

parquet-tools meta /user/hive/warehouse/employees_parquet/000000_0

Best Practices

Use Parquet for large analytics datasets.
Enable vectorized execution for faster queries.
Combine Parquet with partitioning and bucketing.
Periodically ANALYZE TABLE to refresh statistics.
Avoid many small Parquet files; merge them for better performance.

This guide helps you handle Parquet files in Hive efficiently for faster queries and better storage optimization.