Apache Spark Research Paper II

This follows up the last post and I will read the second Apache Spark paper Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, published by Matei Zaharia et al. in 2012.

Intro

Cluster computing frameworks like MapReduce and Dryad are widely adopted since they provide the high level abstraction of parallel programming without worrying about work distribution and fault tolerance
Frameworks like MapReduce lacks abstractions for leveraging distributed memory and makes them inefficient for iterative algorithms
Resilient Distributed Datasets (RDDs) are fault tolerant, parallel data structures that let users explicitly persists intermediate results in-memory, controlling their partitioning to optimize data placement
Existing abstractions for in-memory storage offer an interface based on fine-grained updates to mutable state. This makes the only way to provide fault tolerance being replicating the data across machines, which is too data intensive
RDDs provide an interface based on coarse-grained transformation (e.g., map, filter, join) that apply the same operation to many data items. this allows them to provide fault tolerance by logging the transformations used to build its lineage instead of replicating the actual data. If an RDD is lost, there’s enough information about how it was derived from other RDDs so it can be recomputed

Each dataset is represented as an object and interacted through an API
Spark computes RDDs lazily the first time they are used in an action, so that it can pipeline transformations

Compare with distributed shared memory (DSM), where applications read and write to arbitrary locations (fine-grained transformation)in a global address space
Main difference is that RDDs can only be created through coarse-grained transformation, which restricts RDDs to perform bulk writes, but allows for more efficient fault tolerance
Immutable natures of RDDs let the system mitigate slow nodes by running backup copies of slow tasks just like MapReduce
In bulk operations on RDDs, runtime can schedule task based on data locality, a.k.a “move processing to data”
RDDs degrade gracefully when there’s not enough memory: partitions that do not fit in RAM are spilled on disk

Not suitable for applications that make asynchronous fine-grained updates to shared state, such as a storage system for a web application or an incremental web crawler
Spark aims to provide an efficient programming model for batch analytics and is not suitable asynchronous applications

Developers write a driver program that connects to a cluster of workers
The workers are long-lived processes that can store RDD partitions in RAM across operations
Driver also keep tracks the RDDs’ lineage
Scala represents closures as Java objects, serialized and loaded them on another node to pass the closure across the network

A simple graph-based representation for RDDs that facilitates the goal of tracking lineage across a wide range of transformation
Represent each RDD through a five pieces of information
- a set of partitions: atomic pieces of the dataset
- a set of dependencies on parent RDDs
- a function for computing the dataset based on its parents
- metadata about its partitioning scheme and data placement
distinction between narrow dependencies, where each partition of the parent RDD is used by at most one partition of the child RDD (join) and wide dependencies, where multiple child partitions may depend on it (map)
- narrow dependencies allows for pipelined execution on one node, while wide dependencies require shuffling of the partitions across the node
- recovery after a node failure is more efficient with a narrow dependency

Whenever a user runs an action on RDD, the scheduler examines that RDD’s lineage to build a DAG of stages to execute, each stage consists of pipelined transformation with narrow dependencies. Boundaries of the stages are shuffling operations required for wide dependencies.
Scheduler assigns tasks to machines based on data locality using delay scheduling: send task to node if a partition is available in memory on that node, otherwise, send to preferred locations
Materialize intermediate records on the nodes holding parent partitions for wide dependencies

three levels of persistence:
- in memory storage as deserialized Java objects: fastest performance
- in memory storage as serialized Java objects: more memory-efficient representation at the cost of lower performance
- on-disk storage
- refer to Understanding Spark Caching for reference
For limited memory available, when a new RDD partition is computed but there’s not enough space
- evict a partition from the least recently accessed RDD, unless this is the same RDD as the one with the new partition
- If it’s the same RDD, we keep the old partition in memory since it’s likely that the partition already in memory will be needed soon

long linage with wide dependencies are costly to recover, thus checkpointing maybe worthwhile
Spark provides an API for checkpointing, but leaves the decision of which data to checkpoint to the user