Step-by-Step Guide to Creating Tables in Hive
Creating Tables in Hive - Step by Step
This guide walks you through creating Hive tables from zero to production-ready, with commands you can copy-paste into Hive CLI or Beeline.
Before creating tables, understand the two main types:
Managed (Internal) Table — Hive owns the data and metadata. Dropping the table deletes both.
External Table — Hive stores only metadata; data remains at the original location even if the table is dropped.
Hadoop HDFS running and accessible to the Hive user.
Hive installed and Hive Metastore configured.
Java and required services running on your cluster.
Proper HDFS permissions for any LOCATION you use.
Hive CLI (legacy):
hive
Beeline (recommended):
beeline -u "jdbc:hive2://<host>:10000/default" -n <user> -p <password>
CREATE DATABASE IF NOT EXISTS analytics;
USE analytics;
Create the table (managed):
CREATE TABLE employees (
id INT,
name STRING,
department STRING,
salary DOUBLE
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
Put data into HDFS and load:
hdfs dfs -mkdir -p /user/hive/warehouse/employees
hdfs dfs -put employees.csv /user/hive/warehouse/employees/
LOAD DATA INPATH '/user/hive/warehouse/employees/employees.csv' INTO TABLE employees;
Verify:
SELECT COUNT(*) FROM employees;
Upload files to a shared HDFS location:
hdfs dfs -mkdir -p /data/customers
hdfs dfs -put customers.csv /data/customers/
Create the external table:
CREATE EXTERNAL TABLE customers (
id INT,
name STRING,
email STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/data/customers';
Verify and note behavior on DROP:
SELECT * FROM customers LIMIT 10;
DROP TABLE customers; removes metadata only — files remain at /data/customers.
CREATE TABLE sales_summary
STORED AS PARQUET
AS
SELECT store_id, SUM(amount) total_sales
FROM sales
GROUP BY store_id;
Create partitioned table:
CREATE TABLE logs (
event_time STRING,
event_type STRING,
message STRING
)
PARTITIONED BY (dt STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
Static load into partition:
LOAD DATA INPATH '/data/logs/2025-09-11/logs.tsv' INTO TABLE logs PARTITION (dt='2025-09-11');
Dynamic partitioning insert:
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
INSERT INTO TABLE logs PARTITION (dt)
SELECT event_time, event_type, message, dt
FROM staging_logs;
Create bucketed table:
CREATE TABLE users_bucketed (
id INT,
name STRING,
email STRING
)
CLUSTERED BY (id) INTO 8 BUCKETS
STORED AS ORC;
Enable enforcement and insert:
SET hive.enforce.bucketing = true;
INSERT INTO TABLE users_bucketed SELECT id, name, email FROM users;
Parquet/ORC for analytics:
CREATE TABLE events_parquet (id INT, type STRING, payload STRING) STORED AS PARQUET;
JSON SerDe example:
CREATE EXTERNAL TABLE events_json (id INT, type STRING, payload STRING)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION '/data/events/json';
Inspect schema and location:
DESCRIBE FORMATTED table_name;
SHOW CREATE TABLE table_name;
SHOW PARTITIONS table_name;
Repair partitions:
MSCK REPAIR TABLE table_name;
ALTER TABLE table_name RECOVER PARTITIONS;
Add a column:
ALTER TABLE employees ADD COLUMNS (birth_date STRING);
Change location:
ALTER TABLE customers SET LOCATION '/new/path/customers';
Drop behavior:
DROP TABLE employees; (managed) deletes data and metadata.
DROP TABLE customers; (external) deletes metadata only.
Permission issues: use hdfs dfs -chown and hdfs dfs -chmod appropriately.
Parsing errors: verify delimiters and SerDe settings.
Missing partitions: run MSCK REPAIR TABLE.
Small files: use INSERT OVERWRITE or compaction to reduce file count.
Use columnar formats (Parquet/ORC) with compression for analytics.
Partition sensibly (dates, regions) but avoid too many tiny partitions.
Use bucketing where it benefits joins.
Backup Hive Metastore and test schema evolution.