Step-by-Step Guide: Handling Sequence Files in Hive
Sequence Files in Hive
SequenceFile in Hive refers to a specific file format used for storing data in Hadoop, which Hive can then query. SequenceFiles are a binary, flat file format that store data as key-value pairs. They are particularly useful in Hadoop MapReduce jobs and can be a more efficient way to store data than plain text files, especially when dealing with large numbers of small files or when compression is desired.
SequenceFile is a flat file format consisting of binary key-value pairs.
It is splittable and compressible, making it efficient for large datasets.
Benefits:
Supports compression (block or record level)
Splittable → supports parallel processing
Faster read/write than plain text
Widely supported by Hadoop ecosystem
Use the STORED AS SEQUENCEFILE clause while creating a Hive table.
CREATE TABLE employees_seq (
id INT,
name STRING,
department STRING,
salary DOUBLE
)
STORED AS SEQUENCEFILE;
This creates an empty table using SequenceFile format.
You can load data from HDFS into the SequenceFile table.
LOAD DATA INPATH '/user/hive/input/employees_seq.csv'
INTO TABLE employees_seq;
If your source data is in text format, convert it using CTAS (Step 4).
Use Create Table As Select (CTAS) to convert existing data to SequenceFile.
CREATE TABLE employees_seq
STORED AS SEQUENCEFILE
AS
SELECT * FROM employees_text;
This copies data from the text-based table to a new SequenceFile table.
You can enable block-level compression for better performance.
SET hive.exec.compress.output = true;
SET mapreduce.output.fileoutputformat.compress = true;
SET mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.SnappyCodec;
SET mapreduce.output.fileoutputformat.compress.type = BLOCK;
You can query SequenceFile tables like any other Hive table.
SELECT department, COUNT(*) FROM employees_seq GROUP BY department;
Use SequenceFile when you need splittable compressed storage.
Prefer block compression for better performance.
Combine with partitioning and bucketing to improve query speed.
For analytical queries, consider converting to ORC/Parquet later.
Avoid too many small SequenceFiles; merge them regularly.
This guide helps you handle Sequence files in Hive efficiently, offering better storage and performance compared to plain text.