Step-by-Step Guide: Handling Sequence Files in Hive

9/11/2025
All Articles

Sequence Files in Hive

Step-by-Step Guide: Handling Sequence Files in Hive

Step-by-Step Guide: Handling Sequence Files in Hive

SequenceFile in Hive refers to a specific file format used for storing data in Hadoop, which Hive can then query. SequenceFiles are a binary, flat file format that store data as key-value pairs. They are particularly useful in Hadoop MapReduce jobs and can be a more efficient way to store data than plain text files, especially when dealing with large numbers of small files or when compression is desired.

Step 1: What is SequenceFile Format in Hive?

  • SequenceFile is a flat file format consisting of binary key-value pairs.

  • It is splittable and compressible, making it efficient for large datasets.

Benefits:

  • Supports compression (block or record level)

  • Splittable → supports parallel processing

  • Faster read/write than plain text

  • Widely supported by Hadoop ecosystem


Step 2: Creating a Table with SequenceFile Format

Use the STORED AS SEQUENCEFILE clause while creating a Hive table.

CREATE TABLE employees_seq (
  id INT,
  name STRING,
  department STRING,
  salary DOUBLE
)
STORED AS SEQUENCEFILE;
  • This creates an empty table using SequenceFile format.


Step 3: Loading Data into SequenceFile Table

You can load data from HDFS into the SequenceFile table.

LOAD DATA INPATH '/user/hive/input/employees_seq.csv'
INTO TABLE employees_seq;
  • If your source data is in text format, convert it using CTAS (Step 4).


Step 4: Converting Text Data to SequenceFile Format (CTAS)

Use Create Table As Select (CTAS) to convert existing data to SequenceFile.

CREATE TABLE employees_seq
STORED AS SEQUENCEFILE
AS
SELECT * FROM employees_text;
  • This copies data from the text-based table to a new SequenceFile table.


Step 5: Enabling Compression with SequenceFile

You can enable block-level compression for better performance.

SET hive.exec.compress.output = true;
SET mapreduce.output.fileoutputformat.compress = true;
SET mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.SnappyCodec;
SET mapreduce.output.fileoutputformat.compress.type = BLOCK;

Step 6: Querying SequenceFile Tables

You can query SequenceFile tables like any other Hive table.

SELECT department, COUNT(*) FROM employees_seq GROUP BY department;

✅ Best Practices

  • Use SequenceFile when you need splittable compressed storage.

  • Prefer block compression for better performance.

  • Combine with partitioning and bucketing to improve query speed.

  • For analytical queries, consider converting to ORC/Parquet later.

  • Avoid too many small SequenceFiles; merge them regularly.


This guide helps you handle Sequence files in Hive efficiently, offering better storage and performance compared to plain text.

Article