Friday, December 8, 2023
HomeEducationWhat is Spark SQL? Libraries, Features and more

What is Spark SQL? Libraries, Features and more

what is spark sql

What is Spark SQL?

Spark was developed by Matei Zaharia in 2009 in UC Berkeley’s AMPLab as a sub-project of Hadoop. In 2010, it was open-sourced and donated to Apache Software Foundation in 2013. Now, Apache takes care of all of its versions and updates. Spark SQL is a module based on a cluster computing framework. Apache Spark is mainly used for the fast computation of clusters, and it can be integrated with its functional programming to do the relational processing of the data. Spark SQL is capable of in-memory computation of clusters that results in increased processing speed of the application. 

Several tasks can be performed easily with the help of Spark SQL, like iterative workloads, batch processing, interactive queries, algorithms used for various processes, and streaming. Spark also makes the management of separate tools easier. It is also popular as it can handle structured and semi-structured data. Structured data, as the name suggests, is the data in a proper format with a schema of the known field of sets. On the other hand, semi-structured data cannot be separated from the schema and has a limited known field of sets. 

Some people think Spark SQL is a database, but it is majorly used to abstract DataFrames and implement a distributed SQL query engine. Spark SQL comes with the use of DataFrames, a distributed data set. A data frame is used to organize columns adequately, and DataFrames can be easily constructed from different array sources such as Hive Tables or structured data files. DataFrames are very useful as they provide API for programming languages such as Java, Python, R, and Scala. 

This article will be beneficial for beginners and intermediate-level users of Spark. In this article, we will discuss various concepts of Spark SQL, like spark basics, libraries and features. We’ll also see some important examples of Spark SQL, such as how to query Relational databases, add Schema to Relational databases, and much more.

Why is Spark SQL used?

Spark SQL was created to resolve the limitations of Apache Hive, which were as below. 

1. You cannot resume the workflow in Hive if the processing is stopped in the middle of the workflow. So, it doesn’t matter how much the work was processed, and if it’s stopped, you will have to restart the whole process again. 

2. Whenever you are executing ad-hoc queries in Apache Hive, it launches MapReduce jobs that reduce the performance in analyzing even medium-sized datasets. So, we cannot think about processing a large-sized dataset (>200GB) for analysis in Apache Hive. 

3. Removing encrypted databases in Apache Hive is impossible and may result in an execution error. So, the encrypted data cannot be moved into the trash whenever it is enabled. But, to remove it permanently, the trash needed to be skipped, which will not let us restore the deleted databases. 

4. Apache Hive doesn’t support using subqueries to process the data from databases which is a drawback for its users. 

5. It doesn’t allow its users to query in real-time, which means you cannot query and get the result at the same time. 

So, the above discussed were some of the limitations of Apache Hive that let the development of Spark SQL overcome its limitations and drawbacks. 

Spark SQL simplifies the workload through its faster computation power. 

It is a module of Spark used for processing structured and semi-structured datasets. The datasets are processed much faster than other SQL like MySQL. Spark SQL deployment can run up to 100X faster for existing datasets. It is also one of the reasons for using Spark SQL. As the datasets become larger, processing big data becomes difficult for other frameworks. But, it can be processed much faster in Spark SQL because it uses all cores for the cluster nodes to process the queries over a large dataset. 

Spark SQL is based on a key idea such as Resilient Distributed Datasets (RDDs). The datasets in RDD are divided into some partitions, such as a logical partition that will compute the different nodes of the cluster. Also, the object created is sharable among other jobs, networks, and storage, making data sharing faster. RDDs can contain objects of any programming language like Python, Java, R, and Scala, including the user-defined classes of these programming languages. This is the main reason to use Spark SQL. 

How does Spark SQL work?

In this section, we will discuss the working Architecture of Spark SQL. The architecture of Spark consists of three main layers that include the following:

1. Language API: The language API is the top layer of Spark SQL Architecture that shows the compatibility of Spark SQL with different languages such as Python, Scala, Java, HiveQL, etc. 

2. Schema RDD: This is the middle layer of Spark SQL Architecture responsible for tables, records, and schemas. The Schema RDD can be used as a temporary table and called a Data Frame. 

3. Data Sources: Data Sources are the last layer of the Architecture where the data sources are usually text files, databases, tables, etc. Spark SQL has different data sources such as JSON documents, HIVE tables, Parquet files, and the Cassandra database. 

It doesn’t provide a strong relation between Resilient Distributed Datasets (RDDs) and relational tables. But integration between the relational and procedural processing is robust. The reason behind this is the declaration of DataFrame APIs integrated with Spark Code. Spark SQL also provides highly optimized results of datasets. Spark SQL is extremely useful for optimizing the current users and adding new users. The DataFrame APIs used in Spark SQL is capable of performing the relational operations on each source of data like the external sources and its inbuilt distributed collection of datasets. 

It supports a range of data sources and the algorithms used for Big-data processing through its extensible optimization called Catalyst. It improves the overall productivity of a developer in dealing with written queries and is helpful in the transformation of relational queries for their execution. 

Spark works on all operating systems, including Windows, Linux, and macOS. Therefore, it becomes easy to run Spark SQL locally on a system. To run Spark SQL, you first need to install Java on your system PATH and add its path to the environment variables on your machine. After that, you need to install Scala. You can download and install Apache Spark on your machine when all these installations are done. 

When you have successfully installed Apache Spark on your system, you can verify its installation by the using the following command in the Spark Shell:


This command will show you some output without any error if the installation is successful and also show you the version of Spark on your system. 

Spark SQL Libraries

Spark SQL libraries are very useful as they interact with relational and procedural processing for managing the data frames. The libraries of Spark SQL are as follows:

1. DataFrame API: 

DataFrame is a distributed collection of data where you will find the columns listed in an organized form. This is similar to the optimization techniques used in relational tables. An array of sources like Hive Tables, external databases, RDDs, data files, etc., can be used to construct a DataFrame. 

2. SQL Interpreter and Optimizer: 

The SQL interpreter and optimizer depend on the functional programming that can be done in the Scala programming language. As it is the most helpful component of SparkSQL, it provides the framework that can be used to transform trees, graphs, etc. This transformed data is useful for analysis, planning, optimization, and code-spawning at run time. 

3. Data Source API: 

Data Source API is the Universal API for fetching structured data from the data sources. The features of this API are as follows:

  • It supports various data sources such as Avro files, Hive databases, Parquet files, JSON documents, JDBC, etc. 
  • It also supports smart sources of data. 
  • You can also integrate it with third-party packages of Spark.
  • Data Source API can easily integrate with Big Data tools and frameworks through Spark-core. 
  • It provides API for programming languages like Python, Scala, Java, and R. 
  • It can process the data in massive amounts like the size of kilobytes or Petabytes. 
  • The data processing can be done on a single node or multiple node clusters based on the data size. 
  • This API works for data abstraction and domain-specific language for structured and semi-structured data. 

4. SQL Service: 

To work with structured data in Spark SQL, the SQL service is the first step you need to do. You can create DataFrame objects with the help of SQL service. You can also execute the SQL queries by using this library. 

Features of Spark SQL

Spark SQL provides a large number of features, and that is the reason it is mostly used over Apache Hive. Some of the features of Spark SQL are as follows:

  • Spark Integration: The Spark SQL queries can be integrated easily with the Spark programs. You can also query the structured data in these programs using SQL or DataFrame APIs. 
  • Performance: Spark SQL has high performance over Hadoop and provides better performance with increased iterations for datasets due to its in-memory processing power. 
  • Scalability: Spark SQL can be used with a code-based optimizer, columnar storage, or code generator that makes most of the queries agile along with computing the nodes through Spark Engine. This makes scalability easier and uses extra information to read data from multiple sources. 
  • Connectivity: The connectivity of Spark SQL can be done through JDBC or ODBC without any problem. These are very helpful for data connectivity and work as a business intelligence tool.
  • Hive Compatibility: The unmodified queries of Spark SQL can be run on the current data. Spark SQL is also compatible with rewriting Hive front-end and meta store data. 
  • Uniform Data Access: There is a common way to access various data sources that joins the data across Data Frames and SQL. The uniform data access method is very helpful in assisting all its users with Spark SQL.
  • Support for existing Data Formats: There are several data formats, and Spark SQL supports all these data formats like Apache HIVE, JSON document, Parquet file, Cassandra, etc. 
  • Analysis of Structured and Semi-structured data: The analysis of structured and semi-structured data can be done more accurately in Spark SQL.
  • Data Transformations: The RDD API of Spark SQL is very useful as it provides the best performance for the transformations. The transformations with SQL queries are convertible to RDDs. 
  • Relational Processing: The relational processing ability of Spark SQL comes under its functional programming. 

Querying using Spark SQL

Here, we will see how you can query using Spark SQL. The queries in Spark SQL are very similar to the popular SQL clients. First, we need to launch the Spark shell, where you will write SQL queries. So, there will be two files that we’ll use to execute queries. The first is a text file, and the other is a JSON document. You can follow up on the code for both these files below:


Edward, 22

Jack, 23

Ashu, 21

Robin, 24

Richie, 31


{“name”: “Edward”, “age”: 22}

{“name”: “Jack”, “age”: 23}

{“name”: “Ashu”, “age”: 21}

{“name”: “Robin”, “age”: 24}

{“name”: “Richie”, “age”: 31}

So, both of these files should be saved under the directory ‘sfiles/src/main/scala/org/apache/spark/sql/two.files.scala’. After putting these files here in this directory, you need to set the path of the files in the lines of code below:

import org.apache.spark.sql.SparkSession
val sparksql = SparkSession.builder().nameOfapp(“Spark SQL Query Runner”).config(“spark.some.config.option”, “app-value”).getOrCreate()
import spark.implicits._
val dataframe =“sfiles/src/main/scala/file.json”)

Explanation of Code:

We imported a Spark Session to create a session using the ‘builder()’ function. We also imported a class of Spark named ‘implicit’ to our spark session. Then we created a data frame in our code to import the data from file.json. After that, the data frame will read the JSON file and show the data as a table in the output. 

Displaying only names in the output using SQL queries:

The following code may be carried out to list only names as output:

import spark.implicits._

Explanation of the code:

Here we imported the ‘implicit’ class in our spark session. After that, we used our data frame to print the schema by selecting only the ‘name’ column of the table. This will print only the name column in a table format as output. 

Modifying or updating the data using Spark SQL:

In this example, we will modify the data we have in our files by using the SQL query. See the below code for your reference:$”name”, $”age” +3).show()
dataframe.filter($”age” >25).show()


In the above code, we first selected the column names from the table and then added 3 to the second column named ‘age.’ So, now the data in this column ‘age’ will be incremented by 3. In the second line of our code, we used an expression that will only show the data where age is greater than 25. We used ‘gt’ in our code, which means “greater than” in Spark SQL. 

So, the output will be two tables for these two lines. First is the table for incremented age, and the second for the age greater than 25. 

Counting the total number of entries for each age in our data:

val dataframeOfsql = spark.sql(“SELECT * FROM file”)


Our code makes it very easy to understand what it does. First, it counts the total number of entries for each age in our data, then shows the data for the second line of code. It simply outputs the data type for both columns, such as name and age. And lastly, it shows the complete table as output. 

Now, let us see How to create Datasets using Spark SQL:

case class file(name: String, age: Long)
val dataset = Seq(file(“Ashu”, 21)).toDS()
val mainDataset = Seq(1, 2, 3).toDS()
() + 1).collect()


In the above code, we created a class named ‘file’ where we specified the data type to the columns ‘name’ and ‘age.’ Then, we created a dataset that will record the data of ‘Ashu’ in it. In the following line, it will show the dataset as output. And then, we used ‘Seq’ in our code to find the datatype of our dataset. Lastly, it will map the dataset and give us the output. 

Adding Schema to RDDs

RDD is an abbreviation used for Resilient Distributed Dataset, an immutable fault-tolerant for the datasets operated on it in parallel. RDD may contain some object created by picking up an external dataset. Schema RDD is used to run the SQL queries on it. But Schema RDD is more than that of SQL as it is a Unified Interface for our structured data. 

Let us see an example to create DataFrame for our transformations:

import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.sql.Encoder
import spark.implicits._
val fileDataFrame = spark.sparkContext.textfile(“sfiles/src/main/scala/file.txt”).map(_.split(“,”)).map(attributes = > File(attributes(0), attributes(1).trim.toInt)).toDF()
val secondDF = spark.sql(“SELECT name, age FROM file WHERE age BETWEEN 20 AND 25”) =&apm;gt; “Name: “ + second(0)).show()

Explanation of the code: 

In the above code, we imported specific encoders for RDD into the shell. Then we create a data frame for our text file. After that, we defined our data frame to list all the names and ages where the age is between 18 to 25, and it just represents the table as output. 

To do mapping using the data frames, the following code should be used: => “Name: “ + second.getAs[String](“name”)).show()
implicit val mapEncoder = org.apache.spark.sql.Encoders.kryo[Map[String, Any]] = > seoncd.getValuesMap[Any](List(“name”,”age”))).collect()


The above code is used for converting the mapped names into the format of strings for transformations. 

We use mapEncoder that we just imported as a class that will map the names to the ages in string. The result will show the names that are mapped to their respective ages. 

RDDs support two types of operations:

  • Actions: The operations like count, first, run, reduce, etc., are actions that return something after running a computation on an RDD. For example, Reduce is an action in which we reduce the values of the elements in the RDD by using some function, and the driver program returns the final result. 
  • Transformations: These are the operations responsible for creating a new dataset from the existing one. For example, a transformation is done through a map that passes the dataset element by a function and gives an output of a new RDD representing the new dataset. 

The transformation operations in Spark are considered ‘lazy,’ meaning they do not compute the results in less time resulting in a long time for computation. The transformations are better only because they remember the operations to be performed, and transformations also remember the datasets on which the operations are to be performed. 

Transformations and actions are somewhat related because transformations are performed only when an action is called and after the result of that action is returned by the driver program. The result returned is stored as DAG(Directed Acyclic Graphs). The design of these graphs allows Spark to run more efficiently and return the result as quickly as possible. Suppose a big file was transformed in different ways and first passed through the action to get a DAG as an output. Spark would only process and give the output as the first line rather than doing the entire work on the file. 

Some default methods for transformed RDD recompute each time whenever the action operation is performed on it. But, it also persists an RDD in the memory by using the cache method. In this case, Spark saves the elements around the cluster for much faster computation whenever you run the query next time on it. 

RDDs as Relations

RDDs (Resilient Distributed Datasets) are distributed among the memory abstraction that helps the programmers to run computations for in-memory processing of large clusters. This computation is done in a fault-tolerant manner that reduces the probability of fault occurrence. It is possible to create RDDs from any data source, e.g., local files, Hadoop, amazon cloud, Hive, JSON document, and HBase Table. 

Let’s understand the code for specifying a schema in RDDs.

import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val staffRDD = spark.sparkContext.textFile(“sfiles/src/main/scala/file.txt”)
val schemaForString = “name age”
val fields = schemaForString.split(“ ”).map(fieldname = > StructField(fieldname, StringType, nullable = true))
val schema = StructType(fields)

Explanation of the code: 

In the above code, we imported some classes such as ‘types’ and ‘Row’ to our Spark Shell. The Row class is used when you want to map the RDD schema. After that, we created an RDD named ‘staffRDD’ from our data source file ‘file.txt.’ Then, we defined a schema as ‘schemaForString’ with the ‘name age’ value used for mapping the columns of the Resilient Distributed Dataset. In the following line, we defined fields that split the Schema by a blank space. And lastly, we map the ‘fields’ RDD into ‘schema.’  

Now, let us see the result of the RDD transformation:

val firstRDD =“,”)).map(attributes = > Row(attributes(0), attributes(1).trim))
val staffDF = spark.createDataFrame(firstRDD, schema)
val results = spark.sql(“SELECT name FROM employee”) = > “Name:” + attributes(0)).show()


Now, we created a new RDD called ‘firstRDD’ that transforms the ‘staffRDD’ by using the ‘map’ function to our ‘firstRDD’. After that, we defined a new Dataframe as ‘staffDF’ that stores the RDD schema. In the following line of our code, we are creating a temporary view of our Dataframe to view the text file. Then, we performed a SQL operation to select all the names from the table and display the name only as an output. 

It doesn’t matter in RDDs if they are defined or not, but they don’t contain any kind of data in them. RDDs implement the computation for creating the datasets in them only when the data is referenced. For instance, Writing the results in RDD or writing the cache data in RDDs. 

Caching Tables In-Memory

Spark SQL uses an in-memory columnar format for caching tables from our data sources. It performs the following operations while caching tables in in-memory format.

  • Allocation of limited objects
  • Scanning only the required fields and columns
  • Selecting the best comparison without any manual intervention. 

Here we will see the code for loading the data:

import spark.implicits._
Val staffDF =“sfiles/src/main/scala/file.json”)

Explanation of code:

Here we just imported the implicits class in our Spark Shell and then created a Dataframe to read our data source. The dataframe will read the JSON file we provided as a data source. This way, the loading of the JSON file is completed in our Spark.

Now, let us see how you can display the results from a Parquet DataFrame:

val parquetDF =“staff.parquet”)
val dfNames = spark.sql(“SELECT name FROM fileparquet WHERE age BETWEEN 20 AND 25”) = > “Name:” + attributes(0)).show()


Here, we created a dataframe ‘parquetDF’ for the temporary view of our main dataframe ‘staffDF.’ We also gave an expression to select only the entries for the ages between 20 and 25 from the parquet file. Then lastly, the displaying of the result is done through Spark SQL operation. 

Now, we will see how we can perform operations on the JSON dataset as a data source:

First, we will import the JSON document as Spark SQL also supports the JSON dataset, and we will create a dataframe to perform the operations on it. Then we will specify the schema for this dataframe and display the results of ages between 20 and 25. 


val filepath = “sfiles/src/main/scala/file.json”
val staffDF =
val namesDF = spark.sql(“SELECT name FROM Staff WHERE age BETWEEN 20 AND 25”)


This code created a variable for storing the path to our data source or the input JSON document ‘file.json.’ then, we created a DataFrame as ‘staffDF’ that reads the data from our JSON file. Next, we used printSchema for our DataFrame to print the schema of ‘staffDF.’ Then, we created a temporary view of our dataframe to map with ‘Staff.’ Lastly, we gave an expression to pick only those entries whose ages value is between 20 and 25 and display the contents of our DataFrame as output. 

For RDDs transformations on JSON document, the following code incorporates:

val staffITRDD = spark.sparkContext.makeRDD(“””{“name”: “Ashu”, “address”:{“city”: “Mumbai”, “state”:”Maharashtra”}}””” :: Nil)
val staffIT =


In this code, we created an RDD ‘staffITRDD’ for the IT staff of an organization where we gave the content of a staff member Ashu with the address Mumbai, Maharashtra. After that, we assigned the new RDD to ‘staffIT.’ Lastly, we display the results using the show() command. 

Let us see how you can perform operations using Hive tables in Spark:

import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession
case class Record (key: Int, value: String)
val storageloc = “spark-warehouse”
val spark = SparkSession.builder().appName(“Hive Tables in Spark”).config(“spark.sql.warehouse.dir”, storageloc).enableHiveSupport().getOrCreate()
import spark.implicits._
import spark.sql


In this example, we performed some operations using Hive Tables where we first imported the classes ‘Row’, ‘Record,’ and Spark session into the Spark Shell. The Row class is mainly used for mapping the RDD schema to our data source. Then we set the location of ‘storageloc’ to the Spark Warehouse. Then we created a spark session ‘spark’ that will take the Hive tables in Spark SQL. Lastly, we created a table using a SQL query with the src key to store the values and datatypes. 

Selection of data from our Hive Tables:


sql(“LOAD DATA LOCAL INPATH ‘sfiles/src/main/scala/file.txt’ INTO TABLE src”)
sql(“SELECT * FROM src”).show()


Here we loaded the data content from the source by providing the path. In this example, we used the ‘file.txt’ and a query that will show the contents of this file in a table. 

Create DataFrames by using Hive Tables:


sql(“SELECT COUNT(*) FROM src”).show()
val queryDF = sql(“SELECT key, value FROM src WHERE key & amp;amp;amp;lt; 10 ORDER BY key”)val theDS ={case Row(key: Int, value: String) = > s”Key: $key, Value: $value}


In this example, we performed the ‘count’ operation to select a total number of keys in our ‘src’ table. To select all the records of our ‘src’ table, the expression for the key-value ‘less than 10’ is used, and the values are stored in the DataFrame ‘queryDF.’ Then, we created a dataset ‘theDS’ from our dataframe ‘queryDF.’ Now, lastly, we will display the contents of the ‘theDS’ dataset. 

In this example, we will record the results of hive operations in Spark SQL.


val recordResults = spark.createDataFRame((1 to 100).map(i=> Record(I, s”val_$i”)))
sql(“SELECT * FROM records r JOIN src s ON r.key = s.key”).show()


In this example, we created a DataFrame to store the results of the hive tables. Then we specified the records to be from 1 to 100 that will be stored in our DataFrame. After that, we created a temporary view of our records for the ‘recordResults’ dataframe. Lastly, we used the show() command to display the contents of our joined tables with ‘records’ and ‘src.’ In this join operation, we set ‘key’ as the primary key for both these tables ‘records’ and ‘src.’ 


Is Spark SQL a database?

Anything mentioned with SQL doesn’t mean that it’s a database. Therefore, Spark SQL is also not a database. But it is a module of Spark where you can process structured and semi-structured datasets where majorly you deal with DataFrames. The DataFrames processed in Spark SQL are usually based on the programming abstraction and act like a distributed SQL query engine. Spark SQL allows you to run unmodified Hive queries much faster for the existing data and deployments. 
However, Spark also works as a database where you can create managed tables and maintain your data with the available SQL tools. You can connect JDBC-ODBC with the Spark database using SQL queries and expressions. It also allows you to integrate with third-party tools like Tableau, Power BI, and Talend. 

Is Spark SQL the same as MySQL?

Spark SQL is a module of Spark for processing structured data. Whereas MySQL is used for the management of the relational database. SQL is the primary query language for processing queries, and MySQL enables the handling, modifications, storing, and deletion of data in a well-organized way. 
The Spark SQL gives you a Spark SQL environment for processing queries which MySQL also provides. The main motive is to scan the whole data for processing the query, but in the case of Spark SQL, it scans the required data only. 
The main difference between Spark SQL and MySQL is that Spark SQL makes the queries run 10x faster. MySQL uses only one CPU core for a single query, whereas Spark SQL uses all cores on all cluster nodes for running the queries. 

What is the advantage of Spark SQL?

The advantages of Spark SQL are as follows:
It provides security by using SSL and HTTP protocols. The encryption of these protocols makes it more secure. 
Many features are supported by Spark SQL, including analysis of large amounts of data, integration of Spark SQL with Spark itself, fast processing speed, real-time stream processing, and more. 
Spark SQL is dynamic, which makes it efficient for continuously changing data. 
The demand for Spark SQL developers is high in the market. 
By using Spark SQL, you can access big data without any hassle. 
Spark SQL supports different types of data for use in Machine Learning.
It becomes easy to add more optimization rules in Spark SQL. You can also add more data types and data sources with the help of the Scala programming language. 
The data pipeline can be written easily in Spark SQL.
The DataFrames of Spark SQL can process large sets of structured and semi-structured data. It can also handle petabytes of data. 
Spark SQL API supports programming languages such as Java, Python, R, and Scala. 
The observations in Spark DataFrame are in a well-organized format with the name of columns that helps to identify the data and schema efficiently. 

Is Spark SQL and PySpark SQL the same?

Spark SQL and PySpark SQL aren’t the same, but we can integrate PySpark SQL with Spark SQL for processing the relational databases. It can be done through the functional programming API of Spark SQL. Using an SQL query language, you can also extract the data from the database. You need to use SQL before you write the SQL queries to extract the data, which will return the data based on the query. 

Source: GreatLearning Blog

- Advertisment -

Most Popular

Recent Comments