Step-by-Step Guide: Handling Sequence Files in Hive

SequenceFile in Hive refers to a specific file format used for storing data in Hadoop, which Hive can then query. SequenceFiles are a binary, flat file format that store data as key-value pairs. They are particularly useful in Hadoop MapReduce jobs and can be a more efficient way to store data than plain text files, especially when dealing with large numbers of small files or when compression is desired.

Step 1: What is SequenceFile Format in Hive?

SequenceFile is a flat file format consisting of binary key-value pairs.
It is splittable and compressible, making it efficient for large datasets.

Benefits:

Supports compression (block or record level)
Splittable → supports parallel processing
Faster read/write than plain text
Widely supported by Hadoop ecosystem

Step 2: Creating a Table with SequenceFile Format

Use the STORED AS SEQUENCEFILE clause while creating a Hive table.

CREATE TABLE employees_seq (
  id INT,
  name STRING,
  department STRING,
  salary DOUBLE
)
STORED AS SEQUENCEFILE;

This creates an empty table using SequenceFile format.

Step 3: Loading Data into SequenceFile Table

You can load data from HDFS into the SequenceFile table.

LOAD DATA INPATH '/user/hive/input/employees_seq.csv'
INTO TABLE employees_seq;

If your source data is in text format, convert it using CTAS (Step 4).

Step 4: Converting Text Data to SequenceFile Format (CTAS)

Use Create Table As Select (CTAS) to convert existing data to SequenceFile.

CREATE TABLE employees_seq
STORED AS SEQUENCEFILE
AS
SELECT * FROM employees_text;

This copies data from the text-based table to a new SequenceFile table.

Step 5: Enabling Compression with SequenceFile

You can enable block-level compression for better performance.

SET hive.exec.compress.output = true;
SET mapreduce.output.fileoutputformat.compress = true;
SET mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.SnappyCodec;
SET mapreduce.output.fileoutputformat.compress.type = BLOCK;

Step 6: Querying SequenceFile Tables

You can query SequenceFile tables like any other Hive table.

SELECT department, COUNT(*) FROM employees_seq GROUP BY department;

✅ Best Practices

Use SequenceFile when you need splittable compressed storage.
Prefer block compression for better performance.
Combine with partitioning and bucketing to improve query speed.
For analytical queries, consider converting to ORC/Parquet later.
Avoid too many small SequenceFiles; merge them regularly.

This guide helps you handle Sequence files in Hive efficiently, offering better storage and performance compared to plain text.