Big-data introduction series – Apache Spark

What is Apache Spark?

Apache Spark is an open-source data processing framework for Big Data Analytics. It is unified and parallel data processing framework, designed to cover a wide range of big data workloads such as Batch processing, Real-time processing, Stream analytics, Machine learning and Interactive SQL. Apart from supporting all these workloads in a system, Spark lessens the burden of memory and tools management.

Apache Spark was initially developed in UC Berkeley’s AMP Lab in the year 2009 but was open sourced in 2010 under BSD license. It was donated to Apache Software Foundation in 2013 and became an Apache project since 2014. It is now considered as an alternative to all other big data technologies.

Spark has originally been written in Scala programming language and runs on Java Virtual Machine environment (JVM Environment). It supports multiple programming languages for developing applications, these are – Scala, Java, Python, SQL, and R.

Apache Spark – Framework Libraries



Features of Apache Spark

Apache Spark is fast cluster computing technology, designed for fast processing of large-scale data. Spark provides a unified and comprehensive solution to manage numerous big data workloads. With features like near real-time processing and in-memory data storage, it improves the performance and makes speed several times faster as compared to other big data technologies. The important and advanced features of Apache Spark are:

  • Unified Framework – Spark offers a unified framework that is packaged with higher-level libraries. It can manage big data processing with different data sets having a diverse data source (batch and real-time streaming) and diverse nature (text, graphics, audio, video etc). This unification improves performance and developer’s productivity.


  • Speed – Spark allows applications in Hadoop cluster to run up to 100 times faster in memory and about 10 times faster on disk. Spark is actually considered to be super fast as it is about three times faster than Hadoop. It becomes possible because Spark reduces the number of the read/write operations on the disk, through holding intermediate results in memory rather than on disk.


  • Multiple Languages Support – Spark comes up with built-in consistent and concise APIs in multiple languages such as Java, Python, and Scala; and thus allows you to write applications in these different languages.


  • Ease of Use – Spark contains easy to use APIs for processing large datasets. It also has a built-in set of over 100 high-level operators for processing data, and can also be used to query data interactively within the shell.


  • Runs Everywhere – Apache Spark can run just everywhere; the user can run Spark standalone, on Hadoop, on Mesos, and in the cloud. Basically, Spark uses HDFS file system for the purpose of data storage but it can also work with any of the Hadoop compatible data sources such as HDFS, HBase, Tachyon, Cassandra, etc.


  • Advanced Analytics – Spark framework library consists of Spark streaming, SQL and Dataframes, MLib for machine learning and GraphX for graph computation. The developers can use these libraries separately or can combine them in the same application. Thus it supports streaming data, graph algorithms, SQL queries and machine learning along with Map and Reduce operations.


  • Spark Core Engine – Apache Spark is designed with executive engine i.e. Spark Core Engine can work in memory as well on disk. The high-level Spark operators perform external operations when data size exceeds memory limit. This engine can process all data streams and processes in a faster and expressive way.

There are some other features of Apache Spark that makes it prior over other big data processing techniques. The one of them is it supports lazy evaluation of big data queries that optimizes the number of steps in data processing. It also provides a higher level API that creates a consistent architect model for big data analytics and enhances developer productivity. These advanced features of Apache Spark extend MapReduce Model and make it efficient for stream processing and interactive queries.