Step-by-Step Guide to Creating Tables in Hive

9/11/2025
All Articles

Creating Tables in Hive - Step by Step

Step-by-Step Guide to Creating Tables in Hive

Step-by-step: Creating Tables in Hive

This guide walks you through creating Hive tables from zero to production-ready, with commands you can copy-paste into Hive CLI or Beeline.

Before creating tables, understand the two main types:

  • Managed (Internal) Table — Hive owns the data and metadata. Dropping the table deletes both.

  • External Table — Hive stores only metadata; data remains at the original location even if the table is dropped.

1. Prerequisites

  • Hadoop HDFS running and accessible to the Hive user.

  • Hive installed and Hive Metastore configured.

  • Java and required services running on your cluster.

  • Proper HDFS permissions for any LOCATION you use.

2. Connect to Hive

  • Hive CLI (legacy):

hive
  • Beeline (recommended):

beeline -u "jdbc:hive2://<host>:10000/default" -n <user> -p <password>

3. Create or switch to a database

CREATE DATABASE IF NOT EXISTS analytics;
USE analytics;

4. Create a managed table (step-by-step)

  1. Create the table (managed):

CREATE TABLE employees (
  id INT,
  name STRING,
  department STRING,
  salary DOUBLE
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
  1. Put data into HDFS and load:

hdfs dfs -mkdir -p /user/hive/warehouse/employees
hdfs dfs -put employees.csv /user/hive/warehouse/employees/
LOAD DATA INPATH '/user/hive/warehouse/employees/employees.csv' INTO TABLE employees;
  1. Verify:

SELECT COUNT(*) FROM employees;

5. Create an external table (step-by-step)

  1. Upload files to a shared HDFS location:

hdfs dfs -mkdir -p /data/customers
hdfs dfs -put customers.csv /data/customers/
  1. Create the external table:

CREATE EXTERNAL TABLE customers (
  id INT,
  name STRING,
  email STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/data/customers';
  1. Verify and note behavior on DROP:

  • SELECT * FROM customers LIMIT 10;

  • DROP TABLE customers; removes metadata only — files remain at /data/customers.

6. Create Table As Select (CTAS)

CREATE TABLE sales_summary
STORED AS PARQUET
AS
SELECT store_id, SUM(amount) total_sales
FROM sales
GROUP BY store_id;

7. Partitioned tables (static & dynamic)

  • Create partitioned table:

CREATE TABLE logs (
  event_time STRING,
  event_type STRING,
  message STRING
)
PARTITIONED BY (dt STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
  • Static load into partition:

LOAD DATA INPATH '/data/logs/2025-09-11/logs.tsv' INTO TABLE logs PARTITION (dt='2025-09-11');
  • Dynamic partitioning insert:

SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;

INSERT INTO TABLE logs PARTITION (dt)
SELECT event_time, event_type, message, dt
FROM staging_logs;

8. Bucketing (clustering)

  • Create bucketed table:

CREATE TABLE users_bucketed (
  id INT,
  name STRING,
  email STRING
)
CLUSTERED BY (id) INTO 8 BUCKETS
STORED AS ORC;
  • Enable enforcement and insert:

SET hive.enforce.bucketing = true;
INSERT INTO TABLE users_bucketed SELECT id, name, email FROM users;

9. SerDe and file formats

  • Parquet/ORC for analytics:

CREATE TABLE events_parquet (id INT, type STRING, payload STRING) STORED AS PARQUET;
  • JSON SerDe example:

CREATE EXTERNAL TABLE events_json (id INT, type STRING, payload STRING)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION '/data/events/json';

10. View metadata & maintenance

  • Inspect schema and location:

DESCRIBE FORMATTED table_name;
SHOW CREATE TABLE table_name;
SHOW PARTITIONS table_name;
  • Repair partitions:

MSCK REPAIR TABLE table_name;
ALTER TABLE table_name RECOVER PARTITIONS;

11. Alter and drop table

  • Add a column:

ALTER TABLE employees ADD COLUMNS (birth_date STRING);
  • Change location:

ALTER TABLE customers SET LOCATION '/new/path/customers';
  • Drop behavior:

    • DROP TABLE employees; (managed) deletes data and metadata.

    • DROP TABLE customers; (external) deletes metadata only.

12. Troubleshooting

  • Permission issues: use hdfs dfs -chown and hdfs dfs -chmod appropriately.

  • Parsing errors: verify delimiters and SerDe settings.

  • Missing partitions: run MSCK REPAIR TABLE.

  • Small files: use INSERT OVERWRITE or compaction to reduce file count.

13. Best practices

  • Use columnar formats (Parquet/ORC) with compression for analytics.

  • Partition sensibly (dates, regions) but avoid too many tiny partitions.

  • Use bucketing where it benefits joins.

  • Backup Hive Metastore and test schema evolution.

Article