Step-by-Step Guide: Hive Integration with Apache Pig

9/11/2025

All Articles

Step-by-Step Guide: Hive Integration with Apache Pig

Apache Pig and Apache Hive, both integral to the Hadoop ecosystem, can be integrated to leverage their respective strengths: Pig for data transformation and ETL, and Hive for data warehousing and SQL-like querying. The primary mechanism for this integration is through HCatalog.

Step 1: Why Integrate Hive with Pig?

Apache Pig is a high-level platform for analyzing large datasets using Pig Latin scripts.
Hive stores structured data and metadata in a centralized Metastore.
Integration allows Pig to query Hive tables directly without exporting data.

Step 2: Prerequisites

Installed Apache Pig and Apache Hive
Configured Hadoop and HDFS
Hive Metastore running

Step 3: Enable Hive Support in Pig

Pig can access Hive tables using HCatLoader (part of HCatalog, which ships with Hive).
Make sure hcatalog-core.jar is available in Pig’s classpath.

Step 4: Load Hive Tables in Pig Using HCatalog

Example Pig script:

-- Register HCatalog libraries
REGISTER /usr/lib/hive-hcatalog/share/hcatalog/hcatalog-core.jar;
REGISTER /usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-pig-adapter.jar;

-- Load data from Hive table using HCatLoader
emp_data = LOAD 'employees' USING org.apache.hive.hcatalog.pig.HCatLoader();

-- Display first 10 rows
DUMP emp_data;

Step 5: Store Pig Output into Hive Tables

You can write Pig results back into Hive tables.

STORE emp_data INTO 'employee_summary'
USING org.apache.hive.hcatalog.pig.HCatStorer();

This writes processed data back to a Hive-managed table.

Step 6: Verify Data in Hive

After Pig finishes writing, switch to Hive CLI:

SELECT * FROM employee_summary LIMIT 5;

Step 7: Best Practices

Always register the correct version of hcatalog-core.jar.
Use Parquet or ORC format for better query performance.
Partition your Hive tables for faster Pig jobs.
Use meaningful schema definitions in Hive for seamless integration.

Summary

Use HCatLoader and HCatStorer for integration.
Run Pig scripts directly on Hive tables.
Move data seamlessly between Hive and Pig.

This tutorial helps you integrate Apache Hive with Apache Pig for efficient big data processing and analytics.