hand holding glowing spark

Originally created at U.C. Berkeley’s AMPLab in 2009, Apache Spark is a “lightning-fast unified analytics engine” designed for large-scale data processing. It works with cluster computing platforms like Hadoop, Mesos, and Kubernetes, or it can run as a standalone cluster. It can also access data from many sources, including Hadoop Distributed File System (HDFS), Cassandra, and Hive.

In this article, we’ll dive into Spark, its libraries, and why it has grown into one of the most popular distributed processing frameworks in the industry. If you’re new to the world of Big Data, I highly recommend you read up on the Hadoop ecosystem first to get an idea of how Spark fits into a Big Data analytics stack.

Spark Core

Spark is 100x faster in memory and 10x faster on disk than the traditional Hadoop-MapReduce paradigm. How is Spark so fast? Meet Spark Core, a distributed execution engine built from the ground up with the Scala programming language. Scala is faster than Java (the language used for MapReduce) and better at concurrent execution, an important trait for building distributed systems such as a compute cluster.

Spark also gains a speed boost from RDD (Resilient Distributed Dataset): a fault-tolerant data structure that handles data as an immutable, distributed collection of objects. RDD makes the logical partitioning of datasets, parallel processing, and in-memory caching easy, providing a more efficient way to handle data than MapReduce’s sequential, disk-write heavy, map and reduce operations.

Finally, Spark gains a performance advantage via lazy evaluation. When transformations, which are operations such as map, join, and union, are called on RDDs they don’t execute immediately. Instead a record of the transformation is stored until its value is needed, reducing computational overhead.

Spark SQL

If you’re more used to dealing with data a la SQL, this Spark module has you covered. Spark SQL lets you process structured data via SQL and the DataFrame API. A DataFrame organizes data into familiar named columns similar to a relational database. It supports:

  • Data formats such as Avro, CSV, and Cassandra.
  • Storage systems such as HDFS, Hive, and MySQL.
  • APIs for Scala, Python, Java, and R programming.
  • Hive integration with HiveQL syntax, Hive SerDes and UDFs.

A built-in cost-based optimizer, columnar storage, and code generation make queries fast. Spark SQL takes full advantage of the Spark Core engine, letting you handle multi-hour queries and thousands of nodes.

Spark Streaming

From social network analytics, to video streaming, sensors, IoT devices, and online transactions, the demand for tools that help you process fault-tolerant, high-throughput, live data streams is constantly growing. The Spark Streaming module provides an API for receiving raw unstructured input data streams and processing them with the Spark engine. Data can be ingested from many sources:

  • HDFS/S3
  • Kafka
  • Flume
  • ZeroMQ
  • Kinesis
  • TCP sockets

Industry examples of Spark Streaming are numerous. Spark Streaming has helped Uber handle the terabytes of event data streaming off of its mobile users to provide real-time telemetry for drivers and passengers. Netflix uses a Kafka ETL pipeline with Spark Streaming to channel live analytics of user sessions across a diverse ecosystem of devices into a better recommendation engine.

MLlib

Turns out machine learning and cluster computing are a natural union, and Spark’s MLlib is one way to make that happen. MLlib gives you a way to use machine learning algorithms such as classification, clustering, and regression, with Spark’s fast and efficient data processing engine.

GraphX

If a picture is worth a thousand words, a graph can speak so much more: graph analytics (a.k.a network analytics) involves visually representing data as a graph of interconnected nodes. In graph theory, each node is a vertex and each connecting line is an edge, allowing you to visually represent information such as airline routes (edges) between cities (vertices). Spark uses graph theory to represent RDDs as vertices and operations as edges in a directed acyclic graph (DAG). GraphX extends this core feature with a full API for manipulating graphs and collections, including support for common graph algorithms such as SVD++, PageRank, and label propagation.

Kickstart your next analytics project with Apache Spark

From conventional structured data in archives, to high throughput data streams, machine learning, and graph analytics, Spark has grown from a fast and efficient data processing engine into a cluster computing platform that simplifies the processing of Big Data. Eager to learn more? Checkout Spark’s official programming guide for a quick tutorial to help you get up and running. Not a developer but looking to use Spark’s blazing fast data processing engine for your Big data needs? Consult with an Apache Spark specialist today.

Upwork is a freelancing website where businesses of all sizes can find talented professionals across multiple disciplines and categories. If you are a business and are looking to get projects done, consider signing up!

Read more: