Apache Spark is a unified analytics engine that is used for large-scale data processing. It is a popular framework among data scientists due to its speed, ease of use, flexibility, and scalability. Developed in 2009, it has become one of the most rapidly-adopted data analytics frameworks by companies in different industries across the globe.
Apache Spark jobs are one of the most in-demand jobs today. With demand, there is also competition, and to get a job in the field, you need to be one of the best. While having the necessary Apache Spark skills is half the job done, acing the job interview is a different story altogether. To help you succeed in your next Spark interview, we have compiled this list of top Apache Spark interview questions and answers.
Here are some of the frequently asked questions to help you to crack your upcoming Apache Spark interview and land in your dream job.
Top Apache Spark Interview Questions & Answers
Q1. What is RDD?
Ans. RDD (Resilient Distribution Datasets) is a fault-tolerant collection of operational elements that run parallel. The partitioned data in RDD is immutable and distributed.
Q2. Name the different types of RDD
Ans. There are primarily two types of RDD – parallelized collection and Hadoop datasets.
Q3. What are the methods of creating RDDs in Spark?
Ans. There are two methods –
- By parallelizing a collection in your Driver program.
- By loading an external dataset from external storage like HDFS, HBase, shared file system.
Q4. What is a Sparse Vector?
Ans. A sparse vector has two parallel arrays –one for indices and the other for values.
Q5. Mention some of the areas where Spark outperforms Hadoop in processing
Ans. Sensor data processing, real-time querying of data, and stream processing.
Q6. What are the languages supported by Apache Spark and which is the most popular one?
Ans. There are four languages supported by Apache Spark – Scala, Java, Python, and R. Scala is the most popular one.
Q7. What is Yarn?
Ans. Yarn is one of the key features in Spark, providing a central and resource management platform to deliver scalable operations across the cluster.
Also Read>> Top Data Analytics Interview Questions & Answers
Q8. Do you need to install Spark on all nodes of the Yarn cluster? Why?
Ans. No, because Spark runs on top of Yarn.
Q9. Is it possible to run Apache Spark on Apache Mesos?
Q10. What is the lineage graph?
Ans. The RDDs in Spark, depend on one or more other RDDs. The representation of dependencies in between RDDs is known as the lineage graph.
Q11. Define Partitions in Apache Spark
Ans. Partition is a smaller and logical division of data similar to ‘split’ in MapReduce. It is a logical chunk of a large distributed data set. Partitioning is the process to derive logical units of data to speed up the processing process.
Q12. What is a DStream?
Ans. Discretized Stream (DStream) is a sequence of Resilient Distributed Databases that represent a stream of data.
Q13. What is a Catalyst framework?
Ans. Catalyst framework is an optimization framework present in Spark SQL. It allows Spark to automatically transform SQL queries by adding new optimizations to build a faster processing system.
Q14. What are the Actions in Spark?
Ans. An action helps in bringing back the data from RDD to the local machine. An action’s execution is the result of all previously created transformations.
Q15. What is a Parquet file?
Ans. Parquet is a columnar format file supported by many other data processing systems.
Q16. What is GraphX?
Ans. Spark uses GraphX for graph processing to build and transform interactive graphs.
Q17. What file systems does Spark support?
Ans. Hadoop distributed file system (HDFS), local file system, and Amazon S3.
Q18. What are the different types of transformations on DStreams? Explain.
- Stateless Transformations – Processing of the batch does not depend on the output of the previous batch. Examples – map (), reduceByKey (), filter ().
- Stateful Transformations – Processing of the batch depends on the intermediary results of the previous batch. Examples –Transformations that depend on sliding windows.
Q19. What is the difference between persist () and cache ()?
Ans. Persist () allows the user to specify the storage level whereas cache () uses the default storage level.
Q20. What do you understand by SchemaRDD?
Ans. SchemaRDD is an RDD that consists of row objects (wrappers around the basic string or integer arrays) with schema information about the type of data in each column.
Q21. What are the important components of the Spark ecosystem?
Ans. A Spark ecosystem consists of the following components:
- Shark (SQL) – It does structured data analysis when the data is too big/large.
- Spark Streaming – You can create interactive applications for live streaming data.
- MLLib (Machine Learning) – It supports a variety of machine learning algorithms.
- GraphX– It is the API for graphs and graph parallel execution.
- SparkR – It promotes R Programming in the Spark engine. SparkR is a package for R language which is used to leverage the power of Spark from R shell.
Q22. What is the difference between Apache Spark and Hadoop MapReduce?
Ans. The differences between Apache Spark and Hadoop MapReduce are:
|Apache Spark||Hadoop MapReduce|
|It is a data analytics engine.||MapReduce is a data processing engine.|
|Spark does batch processing as well as real-time data processing.||It processes data in batches only.|
|It is 100 times faster in memory and 10 times faster while running on disk compared to MapReduce.||MapReduce is slower than Apache Spark when it comes to large scale data processing because of I/O disk latency.|
|Spark is costlier than MapReduce because of large RAM.||It is less costly than Apache Spark.|
|Data is stored in the RAM (in-memory) and is easier to retrieve.||Data is stored in HDFS and takes a long time to retrieve.|
|It processes every record only once and eliminates duplication.||It does not support this feature.|
Q23. What are the different cluster managers available in Apache Spark?
Ans. The different cluster managers n Apache Spark are:
Standalone Mode: It is a simple cluster manager that comes included with Spark. It uses a FIFO order for applications. Each application uses all the available nodes. The limit for the number of nodes can be either per application, per user, or globally.
Apache Mesos: Apache Mesos is a distributed systems kernel. It has master and slave processes. Apache Mesos manages computer clusters and is run Had capable of running Hadoop applications.
Hadoop YARN: Apache YARN is a distributed computing framework for job scheduling and cluster resource management.
Q24. What are the advantages of Spark over Hadoop MapReduce?
Ans. The following are the advantages of Spark over Hadoop MapReduce:
- Faster Speed: Spark is fast. It uses in-memory processing and runs programs up to 100x faster than MapReduce in memory and 10x faster while running on the disk.
- Multiple Tasks: Apache Spark has in-built libraries for performing multiple tasks from the same core. However, Hadoop supports only the batch processing task.
- Dependency on Disk: Spark uses caching and in-memory data storage while MapReduce is highly disk-dependent.
- Easily Switch Tasks: Spark allows you to perform different tasks using a single application or console. It gives immediate results. You can easily switch between running something else on the cluster.
Q25. What is a lazy evaluation in Spark?
Ans. For large data in Spark, multiple operations take place even for the execution of a basic transformation. When a transformation is called on an RDD, the operation does not occur immediately. Transformations in Spark are not evaluated until you trigger an action. This is known as lazy evaluation. It avoids unnecessary memory and CPU usage that could take place due to certain mistakes.
Q26. What is a Shuffle operation in Spark?
Ans. Typically, in Spark, a single task operates on elements in one partition. A Shuffle operation is used to re-distribute data across multiple partitions. It runs an operation on all elements of all partitions. Shuffle Operation is an expensive and complex operation.
It has two compression parameters:
- spark.shuffle.compress: it checks whether to compress shuffle output files or not.
- spark.shuffle.spill.compress: it checks whether to compress data spilled during shuffles or not.
Q27. What are the main functions of Spark Core in Apache Spark?
Ans. The Spark Core is the heart of Spark and performs the following functions:
- Task Scheduling
- Fault Recovery
- Interacting with Storage Systems
- Memory Management
Q28. What is a worker node in Spark?
Ans. A worker node is any node that runs the application code in the Spark cluster. While the tasks are assigned by the master node, the worker node performs the assigned work and keeps data in memory or disk storage across them. The worker node is a slave node that processes the data stored on the node and reports the resources to the master. The tasks are assigned by the master node based on resource availability.
Q29. What is DStream in Spark?
Ans. In Spark, DStream (Discretized Stream) is the basic abstraction of Spark Streaming. It is a continuous stream of data that is either in the form of input from various sources or a data stream generated by transforming the input stream. It provides you with a high-level API for convenience.
Q30. What are the different levels of persistence in Spark?
Ans. The different levels of persistence in Spark:
Q31. What does the Map () function do?
Ans. A Map () is a transformation operation in Spark that applies to each element of RDD and returns the result as a new RDD. Thus, the Map () function repeats over every line in the RDD and then split them into new RDD.
It takes one element as input, processes it as per the instructions/code provided by the developer, and then returns one element at a time. The map () function transforms an RDD of length N into another RDD of length N.
Q32. What does the Filter () function do?
Ans. The filter () function returns a new RDD that is formed by selecting those elements from the existing RDD, which returns a true value.
In simple terms, the filter () function will pick those elements which comply with the filter condition (function) that is passed as an argument to the method.
Q33. What is the use of the Sliding Window in Spark?
Ans. A Sliding Window in Spark defines each batch of Spark streaming that has to be processed. When the window slides over a source DStream, the source RDDs that come within the window are combined. It requires two parameters:
- Window length: it specifies the duration of the window
- Sliding interval: it refers to the interval at which the window operation is performed
Q34. What is RDD Lineage?
Ans. Resilient Distributed Dataset (RDD) Lineage is a process that helps to reconstruct lost partition. Since Spark does not replicate data in memory, it is possible to lose some data. RDD uses Lineage to rebuild that lost data. Thus, RDD Lineage helps in building a resilient system and also provides a solution for the enhanced performance of Spark.
Q35. Define Spark Driver.
Ans. Spark Driver is a program that declares the transformation and action on the data RDDs. It is responsible for launching various parallel operations on the cluster. It runs on the master node of the machine and is used to create SparkContext connecting to a given Spark Master. In those cases, where only the cluster manager runs, the Spark Driver conjointly submits RDD graphs to Masters. It splits a Spark application into tasks and schedules them to run on executors.
The driver process runs the main() function and performs the following:
- Maintains information about the Spark application
- Responds to a program or input
- Analyzes and distributes work across the executors
In case you have recently completed a professional course/certification, then