Step-by-Step Guide: Handling Parquet Files in Hive
Parquet Files in Hive
Here in this article we provide steps by step detail in hive integration with parquet .Hive provides native support for reading and writing Parquet files, leveraging their columnar storage and efficient compression for improved query performance
Parquet is a columnar storage format supported by Hive.
It offers high compression, efficient encoding, and fast query performance.
Benefits:
Columnar format → reads only necessary columns
High compression → saves storage space
Splittable → supports parallel processing
Widely used with big data frameworks (Hive, Spark, Impala)
Use the STORED AS PARQUET clause while creating a Hive table.
CREATE TABLE employees_parquet (
id INT,
name STRING,
department STRING,
salary DOUBLE
)
STORED AS PARQUET;
This creates an empty table using the Parquet format.
You can load data from HDFS into the Parquet table.
LOAD DATA INPATH '/user/hive/input/employees_parquet.csv'
INTO TABLE employees_parquet;
If the source data is in text format, convert it using CTAS (Step 4).
Use Create Table As Select (CTAS) to convert existing data to Parquet.
CREATE TABLE employees_parquet
STORED AS PARQUET
AS
SELECT * FROM employees_text;
This copies data from the text-based table to a new Parquet table.
Enable Parquet-related optimizations to get better performance.
SET hive.vectorized.execution.enabled = true;
SET hive.vectorized.execution.reduce.enabled = true;
SET parquet.memory.min.chunk.size = 134217728;
You can query Parquet tables like any other Hive table.
SELECT department, COUNT(*) FROM employees_parquet GROUP BY department;
You can inspect Parquet file metadata using the parquet-tools command (if installed):
parquet-tools meta /user/hive/warehouse/employees_parquet/000000_0
Use Parquet for large analytics datasets.
Enable vectorized execution for faster queries.
Combine Parquet with partitioning and bucketing.
Periodically ANALYZE TABLE to refresh statistics.
Avoid many small Parquet files; merge them for better performance.
This guide helps you handle Parquet files in Hive efficiently for faster queries and better storage optimization.