EBU6502_cloud_computing_notes/4-1-beyond-map-reduce.md

# Beyond map reduce

## In memory processing

### Hadoop's problems

- Batch processing system
    - Does not process streamed data, thus the performance is slower
- Designed to process very large data, not suited for many small files
- Efficient at map stage, bad at IO:
    - Data loaded and written from HDFS
    - Shuffle and sort use a lot of net traffic
- Job startup and finish takes seconds, regardless of the size
- Not a good fit for every case
    - The structure is rigid: Map, Combiner, Shuffle and sort, Reduce
    - No support for iterations
    - Only one sync barrier
    - Bad at e.g. graph processing

### Intro to in memory Processing

- Definition: Load data in memory, before starting process
- Advantages:
    - More flexible computation
    - Iteration is supported
    - No slow IO required
- Disadvantages:
    - Data must fit in memory of distributed storage
    - Need additional measures for persistence
    - Mandatory fault-tolerant
- Major frameworks:
    - Apache spark
    - Graph-centric: Pregel
    - SQL focused read only: Cloudera Impala

### Spark

- Open source large and general engine for large scale distributed data
  processing
- is a **cluster computing** platform that has API for distributed programming
- In memory processing and storage engine
    - Load data from HDFS,Cassandra
    - Resource management via Spark, EC2, YARN
    - Can work with Hadoop, or standalone
- Runs on local or clusters
- Goal: to provide distributed datasets, that users can use as if they are local
- Has the shiny bits as MapReduce:
    - Fault tolerance
    - Data locality
    - Scalability
- Approach: argument data flow with RDD

### RDD: Resilient Distributed Datasets

- Basic level of abstraction in spark
- Distributed memory model: RDDs
    - Immutable collections of data **Distributed** across the nodes of cluster
    - New RDD is created by:
        - Loading data from input
        - transform existing collection to generate a new one
    - Can be saved to HDFS or other programs with action
- Operations:
    - Transformation: Define new RDD from existing one
        - `map`
        - `filter`
        - `sample`
        - `union`
        - `groupByKey`
        - `reduceByKey`
        - `join`
        - `cache`
    - Action: Take RDD and return a result to driver
        - `reduce`
        - `collect`
        - `count`
        - `save`
        - `lookupKey`

### Scala (Not in test)

- Scala is the native language for spark
- Similar syntax to java, but has powerful type inference

### Scala application

- Consists of a **driver** that executes various parallel operations on
  **RDDs**, partitioned across cluster
- Driver is on different machine where RDDs are created
    - Use action to retrieve data from RDD
- TODO: look at diagram at p20
- Driver program run the user's main function, executes parallel operations on a
  cluster

### Components

- Driver program run the user's main functions, executes parallel operation on a
  cluster
    - Run as **independent** sets of processors, coordinated by a `SparkContext`
      in driver
    - Context run in a cluster manager like YARN, which allocates system
      resources
- Working in cluster in managed by **executor**, which is managed by
  `SparkContext`
    - **Executor** responsible for executing task and store data
- Deploying is up to the cluster manager used, like YARN or standalone spark

### Computation

- Using anonymous functions
- Named functions
- `map`: create a new RDD, with the original value replaced by new value
  returned in map
- `filter`: create a new RDD, with less values

### Deferred execution

- Only executes the transformation, the moment they are needed
- Only the invocation of action triggers the execution chain
    - This allows internal optimization: combine the operations

### Spark performance

#### Issues

- Because spark has freedom, task allocation is much more challenging
- Errors appear more often, and hard to debug
- Knowledge of basics of map reduce helps

#### Tuning

- Memory: Spark uses more memory
- Partitioning for RDD
- Performance implication for each operation

### Spark ecosystem

- GraphX: Graph processing RDD
- MLib: machine learning
- Spark SQL
- Spark Streaming: **Stream** processing with D-Stream RDDs

## Stream processing

### Information streams

- Data continuously generated from various sources
    - Unbound, the arrival time is not fixed
- Process the information the moment it's generated
    - Apply a function to each new element
    - Look for **real time** changes and response

### Apache Storm

#### Intro

- Developed by BlackType, apache project
- Real time computation of streams
- Features
    - Scalable
    - No data loss guarantee
    - Extremely robust and fault tolerant
    - Programming language agnostic
    - Distributed Stream Processing: tasks distributed across cluster

## Discretized Streams

- Unlike true streaming processing, we process information in micro batches
    - Input -> Spark Streaming -> batched input data -> Spark Engine ->
      Processed batch data
- In spark:
    - reuse the spark framework
        - Can use spark transformations on RDDs
    - Construct a RDD every few seconds (defined time) to manage data streams
        - New RDD processed at each time slot

### DStream RDD

- Composes of a series of RDDs, to represent data over time
- Choosing timer interval:
    - Small interval: quicker response time, at the cost of frequent batching

### DStream Transformations

- Change each RDD in the stream

### DStream Streaming Context

- Create a `StreamingContext` to manage the stream and transformation, and need a action to
  collect the results

### DStream Sliding windows

- Some usage require looking at a set of stream messages, to perform computation
- Sliding window stores a rolling list with latest items from stream
    - The contents are changed over time, with new items added and old items popped
- Using in Spark DStream:
    - has API to configure size of window (seconds) and frequency of computation (seconds)
    - Code?: `reduceByWindowAndKey((a,b)=>math.max(a,b), Seconds(60), Seconds(5) )`