3-3, 1hr
This commit is contained in:
parent
b56bb2278e
commit
28b481f821
146
3-3-map-reduce-algorithms.md
Normal file
146
3-3-map-reduce-algorithms.md
Normal file
|
@ -0,0 +1,146 @@
|
|||
# Map Reduce Algorithms
|
||||
|
||||
## Map Reduce Design Patterns
|
||||
|
||||
### Design Patterns
|
||||
|
||||
- Definition of design patterns: the template for solving specific problems
|
||||
- Used by Programmers
|
||||
- MP patterns are tools to solve MP problems
|
||||
|
||||
### Decision Points
|
||||
|
||||
- Mapper's algorithm
|
||||
- Mapper's output key-value pairs
|
||||
- Reducer's algorithm
|
||||
- Reducer's output key value pairs
|
||||
|
||||
## Numerical Summarization pattern
|
||||
|
||||
### Intro
|
||||
|
||||
- **Goal**: Calculate aggregate statistical values over a large dataset
|
||||
- Extract **features** from dataset, and compute the same **function** for
|
||||
each feature
|
||||
- **Motivation**: To provide a **top** level view of large input datasets to
|
||||
identify trends or anomanies
|
||||
- **Examples**:
|
||||
- Count occurences
|
||||
- Max / min values
|
||||
- Average / median/ Standard deviation
|
||||
|
||||
### Examples
|
||||
|
||||
- Max PM2.5 for each location in the dataset
|
||||
- Average AQI for each week
|
||||
- Average is **NOT** an associative operation, so can't be executed
|
||||
partially
|
||||
- Change mapper result to solve this, like emitting occurences
|
||||
- Number of locations with PM2.5 exceeding 150
|
||||
|
||||
### Writing
|
||||
|
||||
- Mapper:
|
||||
- Find feature in input, like words
|
||||
- Set partial aggregate value
|
||||
- Reducer:
|
||||
- Complete the final aggregate result
|
||||
- Combiner:
|
||||
- Aggregate the partial aggregate value in mapper **without** changing the
|
||||
outcome
|
||||
- Must be Optional, to save network traffic
|
||||
|
||||
## Inverted Index Pattern
|
||||
|
||||
### Intro
|
||||
|
||||
- Goal: to generate an index from dataset, to allow faster searches for specific
|
||||
features
|
||||
- Motivation: improve search efficiency
|
||||
- Examples:
|
||||
- Find all websites that match the search term
|
||||
|
||||
### Writing:
|
||||
|
||||
- Mapper
|
||||
- Find feature in input
|
||||
- Emit keyword-document_identifier as output
|
||||
- Reducer
|
||||
- identity function, since sorting and partitioning is done in the _shuffle
|
||||
and sort_ step.
|
||||
|
||||
## Data Filtering
|
||||
|
||||
### Intro
|
||||
|
||||
- Goal: Filter out useless records
|
||||
- Motivation: to speed up computation
|
||||
- Examples:
|
||||
- Distributed text search for many documents
|
||||
- Track a thread of event
|
||||
- Data cleaning
|
||||
- Can be mapper only: don't even need _shuffle and sort_ step.
|
||||
|
||||
### Top Ten: a variant of filtering
|
||||
|
||||
- Get a small number of records, relative to a ranking function, like top 10
|
||||
- Focus on most important record
|
||||
- Writing
|
||||
- Mapper: emit <null-(ranking, record)> for each record, _null_ is used so
|
||||
that every data go to one partition
|
||||
- Combiner: sort values by ranking, emit top k
|
||||
- Reducer: same as combiner, but can emit key as rank integer
|
||||
- Performance depends on number of element, without combiner, the performance is
|
||||
worse
|
||||
- Requirement:
|
||||
- Splited ranking data must fit into the memory of mapper
|
||||
|
||||
### Writing
|
||||
|
||||
- Mapper: filter the data
|
||||
- Combiner and reducer optional, depends on the scenario
|
||||
|
||||
## Data Joins
|
||||
|
||||
### Intro
|
||||
|
||||
- Goal: to combine together related data, and relate information
|
||||
- Examples:
|
||||
- Relate purchase habits to demographics
|
||||
- Send reminder to inactive user
|
||||
- Recommendation system
|
||||
- Types of RDB joins
|
||||
- Inner join: element is both in L and R
|
||||
- Outer join: Element in L or R (Full Outer)
|
||||
- TODO: review Relational Database joins in p47
|
||||
- Types of Hadoop joins TODO: review these in p52
|
||||
- Replication join: outer _map-side_ join, useful when one data is small and
|
||||
other is big
|
||||
- Re-partition join: _reduce-side_ join for joining two or more datasets,
|
||||
works with two or more big datasets
|
||||
- Semi join: _map-side_ join where one of several datasets are **filtered**
|
||||
so it fits in memory, works with large datasets
|
||||
|
||||
### Replication join
|
||||
|
||||
- Replicate smallest dataset, too all the map hosts, using Hadoop's distributed
|
||||
cache
|
||||
- Writing:
|
||||
- Map:
|
||||
- Load the smallest dataset to a hashtable
|
||||
- Use key from each input slice to look up the hashtable
|
||||
- Join between the dataset record and hashtable value
|
||||
- No reducer needed.
|
||||
|
||||
### Re-partition join
|
||||
|
||||
- Process both datasets in mapper, then emit the join key: <join_id, value> as pairs
|
||||
- Performs join at the reducer, among all the elements
|
||||
- This spit the load among **all** nodes
|
||||
- Writing
|
||||
- Map:
|
||||
- Writer emit key-value pairs with join_id as the key, and the other columns as value
|
||||
- Reducer:
|
||||
- Joins the value for the same key
|
||||
|
||||
### Semi join
|
Loading…
Reference in a new issue