Table Types in Spark and Hive

Introduction

Apache Spark and Apache Hive support different types of tables for efficient data management. Understanding these table types is essential for handling large-scale datasets and optimizing storage.

Table Types in Spark

Since the introduction of native DDL support in Spark 2.0, it is possible to change the location of table data. The type of a table can be checked using the SparkSession API:

spark.catalog.getTable("table_name")

Managed Table in Spark

A Managed Table means that Spark handles both the metadata and the data. When a managed table is dropped, both the table data and metadata are deleted.

Key Features:

Spark stores the data in its default warehouse directory (/spark-warehouse/).
No need to specify a location while creating a table.
When the table is dropped, the data is also deleted.

Example of Creating a Managed Table:

CREATE TABLE developer (id INT, name STRING);

OR using Delta format:

batched_orders.write.format("delta").partitionBy("submitted_yyyy_mm").mode("overwrite").saveAsTable("orders_table")

External Table in Spark

An External Table means that Spark manages only the metadata, while the actual data is stored at a user-defined location. When the table is dropped, only the metadata is removed, but the data remains intact.

Key Features:

Explicit location is required while creating an External Table.
Data is not deleted when dropping the table.
Useful for sharing datasets across multiple Spark sessions.

Example of Creating an External Table:

CREATE EXTERNAL TABLE developer (id INT, name STRING) LOCATION '/tmp/tables/developer';

Creating Tables in Different Formats

1. Creating Managed Table Using Delta Format

batched_orders.write.format("delta").partitionBy("submitted_yyyy_mm").mode("overwrite").saveAsTable("orders_table")

2. Creating External Table Using Delta Format

CREATE TABLE orders USING DELTA LOCATION '/path/to/data';

3. Creating Managed Table Using Parquet Format

CREATE TABLE developer (id INT, name STRING) USING PARQUET;

4. Creating External Table Using Parquet Format

CREATE TABLE developer (id INT, name STRING) USING PARQUET OPTIONS ('path'='/tmp/tables/table6');

Conclusion

In this article, we covered Managed Tables and External Tables in Spark, their key differences, and how to create them using Delta and Parquet formats. Spark also supports other formats like Avro, ORC, and JSON.

For more updates on Big Data and Apache Spark, follow us on Instagram and Facebook!

More Related Article on hive external or internal table ,please check below link :