199 lines
6 KiB
Markdown
199 lines
6 KiB
Markdown
|
# Beyond map reduce
|
||
|
|
||
|
## In memory processing
|
||
|
|
||
|
### Hadoop's problems
|
||
|
|
||
|
- Batch processing system
|
||
|
- Does not process streamed data, thus the performance is slower
|
||
|
- Designed to process very large data, not suited for many small files
|
||
|
- Efficient at map stage, bad at IO:
|
||
|
- Data loaded and written from HDFS
|
||
|
- Shuffle and sort use a lot of net traffic
|
||
|
- Job startup and finish takes seconds, regardless of the size
|
||
|
- Not a good fit for every case
|
||
|
- The structure is rigid: Map, Combiner, Shuffle and sort, Reduce
|
||
|
- No support for iterations
|
||
|
- Only one sync barrier
|
||
|
- Bad at e.g. graph processing
|
||
|
|
||
|
### Intro to in memory Processing
|
||
|
|
||
|
- Definition: Load data in memory, before starting process
|
||
|
- Advantages:
|
||
|
- More flexible computation
|
||
|
- Iteration is supported
|
||
|
- No slow IO required
|
||
|
- Disadvantages:
|
||
|
- Data must fit in memory of distributed storage
|
||
|
- Need additional measures for persistence
|
||
|
- Mandatory fault-tolerant
|
||
|
- Major frameworks:
|
||
|
- Apache spark
|
||
|
- Graph-centric: Pregel
|
||
|
- SQL focused read only: Cloudera Impala
|
||
|
|
||
|
### Spark
|
||
|
|
||
|
- Open source large and general engine for large scale distributed data
|
||
|
processing
|
||
|
- is a **cluster computing** platform that has API for distributed programming
|
||
|
- In memory processing and storage engine
|
||
|
- Load data from HDFS,Cassandra
|
||
|
- Resource management via Spark, EC2, YARN
|
||
|
- Can work with Hadoop, or standalone
|
||
|
- Runs on local or clusters
|
||
|
- Goal: to provide distributed datasets, that users can use as if they are local
|
||
|
- Has the shiny bits as MapReduce:
|
||
|
- Fault tolerance
|
||
|
- Data locality
|
||
|
- Scalability
|
||
|
- Approach: argument data flow with RDD
|
||
|
|
||
|
### RDD: Resilient Distributed Datasets
|
||
|
|
||
|
- Basic level of abstraction in spark
|
||
|
- Distributed memory model: RDDs
|
||
|
- Immutable collections of data **Distributed** across the nodes of cluster
|
||
|
- New RDD is created by:
|
||
|
- Loading data from input
|
||
|
- transform existing collection to generate a new one
|
||
|
- Can be saved to HDFS or other programs with action
|
||
|
- Operations:
|
||
|
- Transformation: Define new RDD from existing one
|
||
|
- `map`
|
||
|
- `filter`
|
||
|
- `sample`
|
||
|
- `union`
|
||
|
- `groupByKey`
|
||
|
- `reduceByKey`
|
||
|
- `join`
|
||
|
- `cache`
|
||
|
- Action: Take RDD and return a result to driver
|
||
|
- `reduce`
|
||
|
- `collect`
|
||
|
- `count`
|
||
|
- `save`
|
||
|
- `lookupKey`
|
||
|
|
||
|
### Scala (Not in test)
|
||
|
|
||
|
- Scala is the native language for spark
|
||
|
- Similar syntax to java, but has powerful type inference
|
||
|
|
||
|
### Scala application
|
||
|
|
||
|
- Consists of a **driver** that executes various parallel operations on
|
||
|
**RDDs**, partitioned across cluster
|
||
|
- Driver is on different machine where RDDs are created
|
||
|
- Use action to retrieve data from RDD
|
||
|
- TODO: look at diagram at p20
|
||
|
- Driver program run the user's main function, executes parallel operations on a
|
||
|
cluster
|
||
|
|
||
|
### Components
|
||
|
|
||
|
- Driver program run the user's main functions, executes parallel operation on a
|
||
|
cluster
|
||
|
- Run as **independent** sets of processors, coordinated by a `SparkContext`
|
||
|
in driver
|
||
|
- Context run in a cluster manager like YARN, which allocates system
|
||
|
resources
|
||
|
- Working in cluster in managed by **executor**, which is managed by
|
||
|
`SparkContext`
|
||
|
- **Executor** responsible for executing task and store data
|
||
|
- Deploying is up to the cluster manager used, like YARN or standalone spark
|
||
|
|
||
|
### Computation
|
||
|
|
||
|
- Using anonymous functions
|
||
|
- Named functions
|
||
|
- `map`: create a new RDD, with the original value replaced by new value
|
||
|
returned in map
|
||
|
- `filter`: create a new RDD, with less values
|
||
|
|
||
|
### Deferred execution
|
||
|
|
||
|
- Only executes the transformation, the moment they are needed
|
||
|
- Only the invocation of action triggers the execution chain
|
||
|
- This allows internal optimization: combine the operations
|
||
|
|
||
|
### Spark performance
|
||
|
|
||
|
#### Issues
|
||
|
|
||
|
- Because spark has freedom, task allocation is much more challenging
|
||
|
- Errors appear more often, and hard to debug
|
||
|
- Knowledge of basics of map reduce helps
|
||
|
|
||
|
#### Tuning
|
||
|
|
||
|
- Memory: Spark uses more memory
|
||
|
- Partitioning for RDD
|
||
|
- Performance implication for each operation
|
||
|
|
||
|
### Spark ecosystem
|
||
|
|
||
|
- GraphX: Graph processing RDD
|
||
|
- MLib: machine learning
|
||
|
- Spark SQL
|
||
|
- Spark Streaming: **Stream** processing with D-Stream RDDs
|
||
|
|
||
|
## Stream processing
|
||
|
|
||
|
### Information streams
|
||
|
|
||
|
- Data continuously generated from various sources
|
||
|
- Unbound, the arrival time is not fixed
|
||
|
- Process the information the moment it's generated
|
||
|
- Apply a function to each new element
|
||
|
- Look for **real time** changes and response
|
||
|
|
||
|
### Apache Storm
|
||
|
|
||
|
#### Intro
|
||
|
|
||
|
- Developed by BlackType, apache project
|
||
|
- Real time computation of streams
|
||
|
- Features
|
||
|
- Scalable
|
||
|
- No data loss guarantee
|
||
|
- Extremely robust and fault tolerant
|
||
|
- Programming language agnostic
|
||
|
- Distributed Stream Processing: tasks distributed across cluster
|
||
|
|
||
|
## Discretized Streams
|
||
|
|
||
|
- Unlike true streaming processing, we process information in micro batches
|
||
|
- Input -> Spark Streaming -> batched input data -> Spark Engine ->
|
||
|
Processed batch data
|
||
|
- In spark:
|
||
|
- reuse the spark framework
|
||
|
- Can use spark transformations on RDDs
|
||
|
- Construct a RDD every few seconds (defined time) to manage data streams
|
||
|
- New RDD processed at each time slot
|
||
|
|
||
|
### DStream RDD
|
||
|
|
||
|
- Composes of a series of RDDs, to represent data over time
|
||
|
- Choosing timer interval:
|
||
|
- Small interval: quicker response time, at the cost of frequent batching
|
||
|
|
||
|
### DStream Transformations
|
||
|
|
||
|
- Change each RDD in the stream
|
||
|
|
||
|
### DStream Streaming Context
|
||
|
|
||
|
- Create a `StreamingContext` to manage the stream and transformation, and need a action to
|
||
|
collect the results
|
||
|
|
||
|
### DStream Sliding windows
|
||
|
|
||
|
- Some usage require looking at a set of stream messages, to perform computation
|
||
|
- Sliding window stores a rolling list with latest items from stream
|
||
|
- The contents are changed over time, with new items added and old items popped
|
||
|
- Using in Spark DStream:
|
||
|
- has API to configure size of window (seconds) and frequency of computation (seconds)
|
||
|
- Code?: `reduceByWindowAndKey((a,b)=>math.max(a,b), Seconds(60), Seconds(5) )`
|