Q1 Name a few commonly used spark ecosystems
Answer:
·
Spark
SQL (Shark)
·
Spark
Streaming
·
GraphX
·
MLlib
·
SparkR
Q2 What is “Spark SQL”?
Answer:
Spark SQL is a Spark
interface to work with structured as well as semi-structured data. It has the
capability to load data from multiple structured sources like “text files”,
JSON files, Parquet files, among others. Spark SQL provides a special type of RDD
called SchemaRDD. These are row objects, where each object represents a record.
Q3 Can we do real-time processing using Spark SQL?
Answer:
Not directly but we can
register an existing RDD as a SQL table and trigger SQL queries on top of that.
Q4 Explain about the major libraries that constitute the Spark Ecosystem
Answer:
Spark
MLib- Machine learning library in Spark for commonly used learning algorithms
like clustering, regression, classification, etc.
Spark
Streaming – This library is used to process real time streaming data.
Spark
GraphX – Spark API for graph parallel computations with basic operators like
join Vertices, subgraph, aggregate Messages, etc.
Spark SQL – Helps execute
SQL like queries on Spark data using standard visualization or BI tools.
Q5 What is Spark SQL?
Answer:
SQL Spark, better known
as Shark is a novel module introduced in Spark to work with structured data and
perform structured data processing. Through this module, Spark executes
relational SQL queries on the data. The core of the component supports an
altogether different RDD called SchemaRDD, composed of rows objects and schema
objects defining data type of each column in the row. It is similar to a table
in relational database.
Q6 What is a Parquet file?
Answer:
Parquet is a columnar
format file supported by many other data processing systems. Spark SQL performs
both read and write operations with Parquet file and consider it be one of the
best big data analytics format so far.
Q7 List the functions of Spark SQL.
Answer:
Spark
SQL is capable of:
·
Loading
data from a variety of structured sources
·
Querying
data using SQL statements, both inside a Spark program and from external tools
that connect to Spark SQL through standard database connectors (JDBC/ODBC). For
instance, using business intelligence tools like Tableau
·
Providing
rich integration between SQL and regular Python/Java/Scala code, including the
ability to join RDDs and SQL tables, expose custom functions in SQL, and more
Q8 What is Spark?
Answer:
Spark is a parallel data
processing framework. It allows to develop fast, unified big data application
combine batch, streaming and interactive analytics.
Q9 What is Hive on Spark?
Answer:
Hive
is a component of Hortonworks’ Data Platform (HDP). Hive provides an SQL-like
interface to data stored in the HDP. Spark users will automatically get the complete
set of Hive’s rich features, including any new features that Hive might
introduce in the future.
The main task around
implementing the Spark execution engine for Hive lies in query planning, where
Hive operator plans from the semantic analyzer which is translated to a task
plan that Spark can execute. It also includes query execution, where the
generated Spark plan gets actually executed in the Spark cluster.
Q10 What is a “Parquet” in Spark?
Answer:
“Parquet” is a columnar
format file supported by many data processing systems. Spark SQL performs both
read and write operations with the “Parquet” file.
Q11 What is Catalyst framework?
Answer:
Catalyst framework is a
new optimization framework present in Spark SQL. It allows Spark to
automatically transform SQL queries by adding new optimizations to build a
faster processing system.
Q12 Why is BlinkDB used?
Answer:
BlinkDB is a query engine
for executing interactive SQL queries on huge volumes of data and renders query
results marked with meaningful error bars. BlinkDB helps users balance ‘query
accuracy’ with response time.
Q13 How can you compare Hadoop and Spark in terms of ease of use?
Answer:
Hadoop MapReduce requires
programming in Java which is difficult, though Pig and Hive make it
considerably easier. Learning Pig and Hive syntax takes time. Spark has
interactive APIs for different languages like Java, Python or Scala and also
includes Shark i.e. Spark SQL for SQL lovers – making it comparatively easier
to use than Hadoop.
Q14 What are the various data sources available in SparkSQL?
Answer:
·
Parquet
file
·
JSON
Datasets
·
Hive
tables
SparkSQL is a Spark
component that supports querying data either via SQL or via the Hive Query
Language. It originated as the Apache Hive port to run on top of Spark (in
place of MapReduce) and is now integrated with the Spark stack. In addition to
providing support for various data sources, it makes it possible to weave SQL
queries with code transformations which results in a very powerful tool. Below
is an example of a Hive compatible query.
Q15 What are benefits of Spark over MapReduce?
Answer:
·
Due
to the availability of in-memory processing, Spark implements the processing
around 10-100x faster than Hadoop MapReduce. MapReduce makes use of persistence
storage for any of the data processing tasks.
·
Unlike
Hadoop, Spark provides in-built libraries to perform multiple tasks form the
same core like batch processing, Steaming, Machine learning, Interactive SQL
queries. However, Hadoop only supports batch processing.
·
Hadoop
is highly disk-dependent whereas Spark promotes caching and in-memory data
storage
·
Spark
is capable of performing computations multiple times on the same dataset. This
is called iterative computation while there is no iterative computing
implemented by Hadoop.
Q16 How SparkSQL is different from HQL and SQL?
Answer:
SparkSQL is a special
component on the spark Core engine that support SQL and Hive Query Language
without changing any syntax. It’s possible to join SQL table and HQL table.
Comments
Post a Comment