diff --git a/3-3-map-reduce-algorithms.md b/3-3-map-reduce-algorithms.md new file mode 100644 index 0000000..3d87a23 --- /dev/null +++ b/3-3-map-reduce-algorithms.md @@ -0,0 +1,146 @@ +# Map Reduce Algorithms + +## Map Reduce Design Patterns + +### Design Patterns + +- Definition of design patterns: the template for solving specific problems +- Used by Programmers +- MP patterns are tools to solve MP problems + +### Decision Points + +- Mapper's algorithm +- Mapper's output key-value pairs +- Reducer's algorithm +- Reducer's output key value pairs + +## Numerical Summarization pattern + +### Intro + +- **Goal**: Calculate aggregate statistical values over a large dataset + - Extract **features** from dataset, and compute the same **function** for + each feature +- **Motivation**: To provide a **top** level view of large input datasets to + identify trends or anomanies +- **Examples**: + - Count occurences + - Max / min values + - Average / median/ Standard deviation + +### Examples + +- Max PM2.5 for each location in the dataset +- Average AQI for each week + - Average is **NOT** an associative operation, so can't be executed + partially + - Change mapper result to solve this, like emitting occurences +- Number of locations with PM2.5 exceeding 150 + +### Writing + +- Mapper: + - Find feature in input, like words + - Set partial aggregate value +- Reducer: + - Complete the final aggregate result +- Combiner: + - Aggregate the partial aggregate value in mapper **without** changing the + outcome + - Must be Optional, to save network traffic + +## Inverted Index Pattern + +### Intro + +- Goal: to generate an index from dataset, to allow faster searches for specific + features +- Motivation: improve search efficiency +- Examples: + - Find all websites that match the search term + +### Writing: + +- Mapper + - Find feature in input + - Emit keyword-document_identifier as output +- Reducer + - identity function, since sorting and partitioning is done in the _shuffle + and sort_ step. + +## Data Filtering + +### Intro + +- Goal: Filter out useless records +- Motivation: to speed up computation +- Examples: + - Distributed text search for many documents + - Track a thread of event + - Data cleaning +- Can be mapper only: don't even need _shuffle and sort_ step. + +### Top Ten: a variant of filtering + +- Get a small number of records, relative to a ranking function, like top 10 +- Focus on most important record +- Writing + - Mapper: emit for each record, _null_ is used so + that every data go to one partition + - Combiner: sort values by ranking, emit top k + - Reducer: same as combiner, but can emit key as rank integer +- Performance depends on number of element, without combiner, the performance is + worse +- Requirement: + - Splited ranking data must fit into the memory of mapper + +### Writing + +- Mapper: filter the data +- Combiner and reducer optional, depends on the scenario + +## Data Joins + +### Intro + +- Goal: to combine together related data, and relate information +- Examples: + - Relate purchase habits to demographics + - Send reminder to inactive user + - Recommendation system +- Types of RDB joins + - Inner join: element is both in L and R + - Outer join: Element in L or R (Full Outer) + - TODO: review Relational Database joins in p47 +- Types of Hadoop joins TODO: review these in p52 + - Replication join: outer _map-side_ join, useful when one data is small and + other is big + - Re-partition join: _reduce-side_ join for joining two or more datasets, + works with two or more big datasets + - Semi join: _map-side_ join where one of several datasets are **filtered** + so it fits in memory, works with large datasets + +### Replication join + +- Replicate smallest dataset, too all the map hosts, using Hadoop's distributed + cache +- Writing: + - Map: + - Load the smallest dataset to a hashtable + - Use key from each input slice to look up the hashtable + - Join between the dataset record and hashtable value + - No reducer needed. + +### Re-partition join + +- Process both datasets in mapper, then emit the join key: as pairs +- Performs join at the reducer, among all the elements +- This spit the load among **all** nodes +- Writing + - Map: + - Writer emit key-value pairs with join_id as the key, and the other columns as value + - Reducer: + - Joins the value for the same key + +### Semi join