root/EBU6502_cloud_computing_notes

Fork 0

Ryan 3b7ee26a8c finish map reduce and cdn

2024-12-31 20:24:24 +08:00

6 KiB

Raw Blame History

Beyond map reduce

In memory processing

Hadoop's problems

Batch processing system
- Does not process streamed data, thus the performance is slower
Designed to process very large data, not suited for many small files
Efficient at map stage, bad at IO:
- Data loaded and written from HDFS
- Shuffle and sort use a lot of net traffic
Job startup and finish takes seconds, regardless of the size
Not a good fit for every case
- The structure is rigid: Map, Combiner, Shuffle and sort, Reduce
- No support for iterations
- Only one sync barrier
- Bad at e.g. graph processing

Intro to in memory Processing

Definition: Load data in memory, before starting process
Advantages:
- More flexible computation
- Iteration is supported
- No slow IO required
Disadvantages:
- Data must fit in memory of distributed storage
- Need additional measures for persistence
- Mandatory fault-tolerant
Major frameworks:
- Apache spark
- Graph-centric: Pregel
- SQL focused read only: Cloudera Impala

Spark

Open source large and general engine for large scale distributed data processing
is a cluster computing platform that has API for distributed programming
In memory processing and storage engine
- Load data from HDFS,Cassandra
- Resource management via Spark, EC2, YARN
- Can work with Hadoop, or standalone
Runs on local or clusters
Goal: to provide distributed datasets, that users can use as if they are local
Has the shiny bits as MapReduce:
- Fault tolerance
- Data locality
- Scalability
Approach: argument data flow with RDD

RDD: Resilient Distributed Datasets

Basic level of abstraction in spark
Distributed memory model: RDDs
- Immutable collections of data Distributed across the nodes of cluster
- New RDD is created by:
  - Loading data from input
  - transform existing collection to generate a new one
- Can be saved to HDFS or other programs with action
Operations:
- Transformation: Define new RDD from existing one
  - map
  - filter
  - sample
  - union
  - groupByKey
  - reduceByKey
  - join
  - cache
- Action: Take RDD and return a result to driver
  - reduce
  - collect
  - count
  - save
  - lookupKey

Scala (Not in test)

Scala is the native language for spark
Similar syntax to java, but has powerful type inference

Scala application

Consists of a driver that executes various parallel operations on RDDs, partitioned across cluster
Driver is on different machine where RDDs are created
- Use action to retrieve data from RDD
TODO: look at diagram at p20
Driver program run the user's main function, executes parallel operations on a cluster

Components

Driver program run the user's main functions, executes parallel operation on a cluster
- Run as independent sets of processors, coordinated by a SparkContext in driver
- Context run in a cluster manager like YARN, which allocates system resources
Working in cluster in managed by executor, which is managed by SparkContext
- Executor responsible for executing task and store data
Deploying is up to the cluster manager used, like YARN or standalone spark

Computation

Using anonymous functions
Named functions
map: create a new RDD, with the original value replaced by new value returned in map
filter: create a new RDD, with less values

Deferred execution

Only executes the transformation, the moment they are needed
Only the invocation of action triggers the execution chain
- This allows internal optimization: combine the operations

Spark performance

Issues

Because spark has freedom, task allocation is much more challenging
Errors appear more often, and hard to debug
Knowledge of basics of map reduce helps

Tuning

Memory: Spark uses more memory
Partitioning for RDD
Performance implication for each operation

Spark ecosystem

GraphX: Graph processing RDD
MLib: machine learning
Spark SQL
Spark Streaming: Stream processing with D-Stream RDDs

Stream processing

Information streams

Data continuously generated from various sources
- Unbound, the arrival time is not fixed
Process the information the moment it's generated
- Apply a function to each new element
- Look for real time changes and response

Apache Storm

Intro

Developed by BlackType, apache project
Real time computation of streams
Features
- Scalable
- No data loss guarantee
- Extremely robust and fault tolerant
- Programming language agnostic
- Distributed Stream Processing: tasks distributed across cluster

Discretized Streams

Unlike true streaming processing, we process information in micro batches
- Input -> Spark Streaming -> batched input data -> Spark Engine -> Processed batch data
In spark:
- reuse the spark framework
  - Can use spark transformations on RDDs
- Construct a RDD every few seconds (defined time) to manage data streams
  - New RDD processed at each time slot

DStream RDD

Composes of a series of RDDs, to represent data over time
Choosing timer interval:
- Small interval: quicker response time, at the cost of frequent batching

DStream Transformations

Change each RDD in the stream

DStream Streaming Context

Create a StreamingContext to manage the stream and transformation, and need a action to collect the results

DStream Sliding windows

Some usage require looking at a set of stream messages, to perform computation
Sliding window stores a rolling list with latest items from stream
- The contents are changed over time, with new items added and old items popped
Using in Spark DStream:
- has API to configure size of window (seconds) and frequency of computation (seconds)
- Code?: reduceByWindowAndKey((a,b)=>math.max(a,b), Seconds(60), Seconds(5) )

6 KiB Raw Blame History

Beyond map reduce

In memory processing

Hadoop's problems

Intro to in memory Processing

Spark

RDD: Resilient Distributed Datasets

Scala (Not in test)

Scala application

Components

Computation

Deferred execution

Spark performance

Issues

Tuning

Spark ecosystem

Stream processing

Information streams

Apache Storm

Intro

Discretized Streams

DStream RDD

DStream Transformations

DStream Streaming Context

DStream Sliding windows

6 KiB

Raw Blame History