SQL on Hadoop is pretty common these days. Various new SQL-on-Hadoop projects are occurring, but Spark SQL is very promising.
Spark SQL is the future of Hadoop processing because of its features like simple APIs, and flexible and efficient executions models.
Apache Spark is a clustering-computing engine that is compatible with Hadoop.
Spark enables fast data processing by pinning datasets into memory across a cluster.
It also supports a variety of ways for processing, including Map-Reduce, iterative and graph processing.
Spark has expressive language and it lets users get up and running quickly with its API, defined in Python, Scala and Java.
Recently various Apache Mahout, machine-learning projects were implemented using Spark.
Difference between Spark SQL and Shark:
Shark was the first Spark system to provide SQL abilities using Spark. It uses Hive for query planning and Spark for query execution.
While Spark SQL uses its own query planner instead of Hive.
Spark consists of a core set of APIs and execution engine, on top of which exist other Spark systems that are used to provide APIs and processing for special activities such as designing stream pipelines.
Core consists of
1. Shark (SQL and Hive support)
2.Spark streaming(stream processing that uses the same language as batch)
3. MLib(scalable machine learning)
4.GraphX(work with graphics and collection)
5.Spark (a generalized processing engine that supports distributed datasets)
Any spark system can work on an RDD generated by another system which allows us to colocate processing code.
Data in Spark is represented using RDDs(resilient distributed datasets), which are an abstraction over a collection of items.
RDD is distributed over clusters so that each cluster node will store and manage certain items in an RDD.
RDD can be created from many sources like regular Scala or from data from HDFS.
RDDs can be in memory,on disk or on both.
Example: Creation of RDD
val stocks = sc.textfile(“stocks.txt”)
stocks: org.apache.spark.rdd.RDD[String]=MappedRDD at textfile
The Spark RDD has various operations that can be performed on RDD.
Operations in Spark fall in two categories : transformations and actions
1}Transformations operate on RDD to create a new RDD. examples: map,flatMap etc.
2}Actions perform certain activities on RDD and return results to the driver. Example: collect
Spark on Hadoop:
Spark supports several cluster managers like YARN.
In this mode Spark executors are YARN containers and Spark ApplicationMaster is responsible for managing Spark executors and sending them commands. Spark Driver can be either in Client process or ApplicationMaster.
- In client mode the driver resides inside the client so if executing a series of Spark tasks in this mode is interrupted if client is terminated.
- In cluster mode the driver executes in ApplicationMaster and doesn’t rely on the client for execution.
Example: Calculating stock averages with Spark SQL in Scala
Step 1: create a class that will represent all record in Spark table
case class Stock(symbol: String,price : Double)
Step 2: register RDD of these stock objects as table to perform SQL operations
- Create a SQL context
val sqlContext=new org.apache.spark.sql.SQLContext(sc)
- Import the context to access all the SQL functions
- Create an RDD of Stock objects by loading the stock from a text file, tokenizing the file, and creating stock instances
- Register the RDD as table called stock
Step 3: issue queries to stocks table. Following shows how to calculate average price for each symbol
val stock_averages=sql(“SELECT symbol,AVG(price) FROM stocks GROUP BY symbol”)
We have used Scala instead of Java API because it is more concise.
Spark evaluates transformations lazily therefore we need to execute action to execute operations.
I have tried to explain most of the concepts of Spark SQL with examples. Technology evolves regularly and there are advancements.
Hope it helps. Keep learning!