EBU6502_cloud_computing_notes/4-1-beyond-map-reduce.md
2024-12-31 20:24:24 +08:00

6 KiB

Beyond map reduce

In memory processing

Hadoop's problems

  • Batch processing system
    • Does not process streamed data, thus the performance is slower
  • Designed to process very large data, not suited for many small files
  • Efficient at map stage, bad at IO:
    • Data loaded and written from HDFS
    • Shuffle and sort use a lot of net traffic
  • Job startup and finish takes seconds, regardless of the size
  • Not a good fit for every case
    • The structure is rigid: Map, Combiner, Shuffle and sort, Reduce
    • No support for iterations
    • Only one sync barrier
    • Bad at e.g. graph processing

Intro to in memory Processing

  • Definition: Load data in memory, before starting process
  • Advantages:
    • More flexible computation
    • Iteration is supported
    • No slow IO required
  • Disadvantages:
    • Data must fit in memory of distributed storage
    • Need additional measures for persistence
    • Mandatory fault-tolerant
  • Major frameworks:
    • Apache spark
    • Graph-centric: Pregel
    • SQL focused read only: Cloudera Impala

Spark

  • Open source large and general engine for large scale distributed data processing
  • is a cluster computing platform that has API for distributed programming
  • In memory processing and storage engine
    • Load data from HDFS,Cassandra
    • Resource management via Spark, EC2, YARN
    • Can work with Hadoop, or standalone
  • Runs on local or clusters
  • Goal: to provide distributed datasets, that users can use as if they are local
  • Has the shiny bits as MapReduce:
    • Fault tolerance
    • Data locality
    • Scalability
  • Approach: argument data flow with RDD

RDD: Resilient Distributed Datasets

  • Basic level of abstraction in spark
  • Distributed memory model: RDDs
    • Immutable collections of data Distributed across the nodes of cluster
    • New RDD is created by:
      • Loading data from input
      • transform existing collection to generate a new one
    • Can be saved to HDFS or other programs with action
  • Operations:
    • Transformation: Define new RDD from existing one
      • map
      • filter
      • sample
      • union
      • groupByKey
      • reduceByKey
      • join
      • cache
    • Action: Take RDD and return a result to driver
      • reduce
      • collect
      • count
      • save
      • lookupKey

Scala (Not in test)

  • Scala is the native language for spark
  • Similar syntax to java, but has powerful type inference

Scala application

  • Consists of a driver that executes various parallel operations on RDDs, partitioned across cluster
  • Driver is on different machine where RDDs are created
    • Use action to retrieve data from RDD
  • TODO: look at diagram at p20
  • Driver program run the user's main function, executes parallel operations on a cluster

Components

  • Driver program run the user's main functions, executes parallel operation on a cluster
    • Run as independent sets of processors, coordinated by a SparkContext in driver
    • Context run in a cluster manager like YARN, which allocates system resources
  • Working in cluster in managed by executor, which is managed by SparkContext
    • Executor responsible for executing task and store data
  • Deploying is up to the cluster manager used, like YARN or standalone spark

Computation

  • Using anonymous functions
  • Named functions
  • map: create a new RDD, with the original value replaced by new value returned in map
  • filter: create a new RDD, with less values

Deferred execution

  • Only executes the transformation, the moment they are needed
  • Only the invocation of action triggers the execution chain
    • This allows internal optimization: combine the operations

Spark performance

Issues

  • Because spark has freedom, task allocation is much more challenging
  • Errors appear more often, and hard to debug
  • Knowledge of basics of map reduce helps

Tuning

  • Memory: Spark uses more memory
  • Partitioning for RDD
  • Performance implication for each operation

Spark ecosystem

  • GraphX: Graph processing RDD
  • MLib: machine learning
  • Spark SQL
  • Spark Streaming: Stream processing with D-Stream RDDs

Stream processing

Information streams

  • Data continuously generated from various sources
    • Unbound, the arrival time is not fixed
  • Process the information the moment it's generated
    • Apply a function to each new element
    • Look for real time changes and response

Apache Storm

Intro

  • Developed by BlackType, apache project
  • Real time computation of streams
  • Features
    • Scalable
    • No data loss guarantee
    • Extremely robust and fault tolerant
    • Programming language agnostic
    • Distributed Stream Processing: tasks distributed across cluster

Discretized Streams

  • Unlike true streaming processing, we process information in micro batches
    • Input -> Spark Streaming -> batched input data -> Spark Engine -> Processed batch data
  • In spark:
    • reuse the spark framework
      • Can use spark transformations on RDDs
    • Construct a RDD every few seconds (defined time) to manage data streams
      • New RDD processed at each time slot

DStream RDD

  • Composes of a series of RDDs, to represent data over time
  • Choosing timer interval:
    • Small interval: quicker response time, at the cost of frequent batching

DStream Transformations

  • Change each RDD in the stream

DStream Streaming Context

  • Create a StreamingContext to manage the stream and transformation, and need a action to collect the results

DStream Sliding windows

  • Some usage require looking at a set of stream messages, to perform computation
  • Sliding window stores a rolling list with latest items from stream
    • The contents are changed over time, with new items added and old items popped
  • Using in Spark DStream:
    • has API to configure size of window (seconds) and frequency of computation (seconds)
    • Code?: reduceByWindowAndKey((a,b)=>math.max(a,b), Seconds(60), Seconds(5) )