External Tables in Hive – A Complete Guide

Apache Hive is a powerful data warehousing tool built on top of Hadoop that allows users to query and manage large datasets using a SQL-like language called HiveQL. One of Hive’s key features is its support for external tables, which provide a flexible way to manage data stored outside Hive’s warehouse directory.

In this article, we’ll explore what external tables are, how they work, their benefits, and how to create them with practical examples.

What Are External Tables in Hive?

In Hive, tables are used to store and organize data. There are two types of tables: Managed Tables and External Tables.

Managed Table: Hive owns both the table metadata and the underlying data. If you drop the table, Hive deletes both metadata and data.
External Table: Hive only manages the table metadata, while the actual data stays in its original location. Dropping an external table only removes the table metadata; the data remains intact.

This means external tables are ideal when your data is shared across multiple tools or systems, or when you don’t want Hive to delete the data on table drop.

Syntax to Create External Table

Here’s the basic syntax to create an external table in Hive:

CREATE EXTERNAL TABLE table_name (
    column1_name column1_type,
    column2_name column2_type,
    ...
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 'hdfs://path/to/external/data';

Key Points:

The EXTERNAL keyword defines the table as external.
The LOCATION clause specifies the path of data files.
The schema is stored in Hive Metastore, while the data stays in HDFS or any other supported storage.

Example: Creating an External Table

Let’s say you have CSV data stored in HDFS at /user/hive/external/customers. To create an external table on top of it:

CREATE EXTERNAL TABLE customers (
    id INT,
    name STRING,
    email STRING,
    country STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/hive/external/customers';

Now, you can query this table like a normal Hive table:

SELECT * FROM customers;

If you drop this table:

DROP TABLE customers;

The table metadata will be removed, but the actual CSV files in /user/hive/external/customers will remain untouched.

When to Use External Tables in Hive

External tables are best suited for:

Data that is shared between multiple applications or systems.
Read-only or reference datasets that should not be deleted accidentally.
Data stored in cloud storage or remote HDFS locations.
Situations where you need to preserve the data even after dropping the table.

Best Practices

Always specify the EXTERNAL keyword and LOCATION when creating external tables.
Ensure the data path is accessible by the Hive user.
Maintain proper permissions and backup for external data files.
Use Partitioning and Bucketing on external tables for better query performance on large datasets.

Conclusion

External tables in Hive provide an efficient and safe way to work with existing datasets stored outside the Hive warehouse. They allow you to leverage Hive’s powerful query engine without giving Hive ownership of your data. By understanding when and how to use external tables, you can build a more scalable and flexible data architecture on Hadoop.

Quick Summary

External tables store only metadata in Hive.
Data remains intact even if the table is dropped.
Best used for shared, read-only, or persistent datasets.