3-3, 1hr

2024-12-30 16:27:14 +08:00 · 2024-12-30 16:27:14 +08:00 · 28b481f821
parent b56bb2278e
commit 28b481f821
1 changed files with 146 additions and 0 deletions
--- a/3-3-map-reduce-algorithms.md
+++ b/3-3-map-reduce-algorithms.md
@ -0,0 +1,146 @@
+# Map Reduce Algorithms
+
+## Map Reduce Design Patterns
+
+### Design Patterns
+
+- Definition of design patterns: the template for solving specific problems
+- Used by Programmers
+- MP patterns are tools to solve MP problems
+
+### Decision Points
+
+- Mapper's algorithm
+- Mapper's output key-value pairs
+- Reducer's algorithm
+- Reducer's output key value pairs
+
+## Numerical Summarization pattern
+
+### Intro
+
+- **Goal**: Calculate aggregate statistical values over a large dataset
+    - Extract **features** from dataset, and compute the same **function** for
+      each feature
+- **Motivation**: To provide a **top** level view of large input datasets to
+  identify trends or anomanies
+- **Examples**:
+    - Count occurences
+    - Max / min values
+    - Average / median/ Standard deviation
+
+### Examples
+
+- Max PM2.5 for each location in the dataset
+- Average AQI for each week
+    - Average is **NOT** an associative operation, so can't be executed
+      partially
+    - Change mapper result to solve this, like emitting occurences
+- Number of locations with PM2.5 exceeding 150
+
+### Writing
+
+- Mapper:
+    - Find feature in input, like words
+    - Set partial aggregate value
+- Reducer:
+    - Complete the final aggregate result
+- Combiner:
+    - Aggregate the partial aggregate value in mapper **without** changing the
+      outcome
+    - Must be Optional, to save network traffic
+
+## Inverted Index Pattern
+
+### Intro
+
+- Goal: to generate an index from dataset, to allow faster searches for specific
+  features
+- Motivation: improve search efficiency
+- Examples:
+    - Find all websites that match the search term
+
+### Writing:
+
+- Mapper
+    - Find feature in input
+    - Emit keyword-document_identifier as output
+- Reducer
+    - identity function, since sorting and partitioning is done in the _shuffle
+      and sort_ step.
+
+## Data Filtering
+
+### Intro
+
+- Goal: Filter out useless records
+- Motivation: to speed up computation
+- Examples:
+    - Distributed text search for many documents
+    - Track a thread of event
+    - Data cleaning
+- Can be mapper only: don't even need _shuffle and sort_ step.
+
+### Top Ten: a variant of filtering
+
+- Get a small number of records, relative to a ranking function, like top 10
+- Focus on most important record
+- Writing
+    - Mapper: emit <null-(ranking, record)> for each record, _null_ is used so
+      that every data go to one partition
+    - Combiner: sort values by ranking, emit top k
+    - Reducer: same as combiner, but can emit key as rank integer
+- Performance depends on number of element, without combiner, the performance is
+  worse
+- Requirement:
+    - Splited ranking data must fit into the memory of mapper
+
+### Writing
+
+- Mapper: filter the data
+- Combiner and reducer optional, depends on the scenario
+
+## Data Joins
+
+### Intro
+
+- Goal: to combine together related data, and relate information
+- Examples:
+    - Relate purchase habits to demographics
+    - Send reminder to inactive user
+    - Recommendation system
+- Types of RDB joins
+    - Inner join: element is both in L and R
+    - Outer join: Element in L or R (Full Outer)
+    - TODO: review Relational Database joins in p47
+- Types of Hadoop joins TODO: review these in p52
+    - Replication join: outer _map-side_ join, useful when one data is small and
+      other is big
+    - Re-partition join: _reduce-side_ join for joining two or more datasets,
+      works with two or more big datasets
+    - Semi join: _map-side_ join where one of several datasets are **filtered**
+      so it fits in memory, works with large datasets
+
+### Replication join
+
+- Replicate smallest dataset, too all the map hosts, using Hadoop's distributed
+  cache
+- Writing:
+    - Map:
+        - Load the smallest dataset to a hashtable
+        - Use key from each input slice to look up the hashtable
+        - Join between the dataset record and hashtable value
+    - No reducer needed.
+
+### Re-partition join
+
+- Process both datasets in mapper, then emit the join key: <join_id, value> as pairs
+- Performs join at the reducer, among all the elements
+- This spit the load among **all** nodes
+- Writing
+    - Map:
+        - Writer emit key-value pairs with join_id as the key, and the other columns as value
+    - Reducer:
+        - Joins the value for the same key
+
+### Semi join