Logical Planning and Physical Planning: in spark
#Logical Planning and Physical Planning: in spark #spark #optimization #sparkrdd #sparkdataframe #sql
Taking user code and converting it into a logical plan is the first phase of execution. The logical plan only converts the user’s set of expressions into the most optimized version. It does this by converting the user code into an unresolved logical plan. This logical plan is unresolved because although your code might be valid, the tables or columns that it refers to may or may not exist. Spark uses a repository of all table, catalog and DataFrame information to resolve columns and tables in the analyzer. The analyzer may reject the unresolved logical plan if the required table or column name does not exist in the catalog. If the analyzer is able to resolve it, the result is passed through the Catalyst Optimizer. The Packages can extend Catalyst to include their own rules for domain specific optimizations.
After successfully creating optimized logical plan, Spark begins the physical planning process. The physical plan, called Spark plan, specifies how the logical plan will execute on the cluster by generating different physical execution strategies and comparing them through a cost model. An example of the cost comparison could be choosing how to perform a given join by looking at the physical attributes of a given table. A series of RDDs and transformations is the results of Physical Planning. This result is why it is sometimes said that Spark is referred to as a compiler—it takes queries in DataFrames, Datasets, and SQL and compiles them into RDD transformations.