3-3, 1hr

2024-12-30 16:27:14 +08:00 · 2024-12-30 16:27:14 +08:00 · 28b481f821
parent b56bb2278e
commit 28b481f821
1 changed files with 146 additions and 0 deletions
--- a/3-3-map-reduce-algorithms.md
+++ b/3-3-map-reduce-algorithms.md
@ -0,0 +1,146 @@
 # Map Reduce Algorithms
 ## Map Reduce Design Patterns
 ### Design Patterns
 - Definition of design patterns: the template for solving specific problems
 - Used by Programmers
 - MP patterns are tools to solve MP problems
 ### Decision Points
 - Mapper's algorithm
 - Mapper's output key-value pairs
 - Reducer's algorithm
 - Reducer's output key value pairs
 ## Numerical Summarization pattern
 ### Intro
 - **Goal**: Calculate aggregate statistical values over a large dataset
    - Extract **features** from dataset, and compute the same **function** for
      each feature
 - **Motivation**: To provide a **top** level view of large input datasets to
  identify trends or anomanies
 - **Examples**:
    - Count occurences
    - Max / min values
    - Average / median/ Standard deviation
 ### Examples
 - Max PM2.5 for each location in the dataset
 - Average AQI for each week
    - Average is **NOT** an associative operation, so can't be executed
      partially
    - Change mapper result to solve this, like emitting occurences
 - Number of locations with PM2.5 exceeding 150
 ### Writing
 - Mapper:
    - Find feature in input, like words
    - Set partial aggregate value
 - Reducer:
    - Complete the final aggregate result
 - Combiner:
    - Aggregate the partial aggregate value in mapper **without** changing the
      outcome
    - Must be Optional, to save network traffic
 ## Inverted Index Pattern
 ### Intro
 - Goal: to generate an index from dataset, to allow faster searches for specific
  features
 - Motivation: improve search efficiency
 - Examples:
    - Find all websites that match the search term
 ### Writing:
 - Mapper
    - Find feature in input
    - Emit keyword-document_identifier as output
 - Reducer
    - identity function, since sorting and partitioning is done in the _shuffle
      and sort_ step.
 ## Data Filtering
 ### Intro
 - Goal: Filter out useless records
 - Motivation: to speed up computation
 - Examples:
    - Distributed text search for many documents
    - Track a thread of event
    - Data cleaning
 - Can be mapper only: don't even need _shuffle and sort_ step.
 ### Top Ten: a variant of filtering
 - Get a small number of records, relative to a ranking function, like top 10
 - Focus on most important record
 - Writing
    - Mapper: emit <null-(ranking, record)> for each record, _null_ is used so
      that every data go to one partition
    - Combiner: sort values by ranking, emit top k
    - Reducer: same as combiner, but can emit key as rank integer
 - Performance depends on number of element, without combiner, the performance is
  worse
 - Requirement:
    - Splited ranking data must fit into the memory of mapper
 ### Writing
 - Mapper: filter the data
 - Combiner and reducer optional, depends on the scenario
 ## Data Joins
 ### Intro
 - Goal: to combine together related data, and relate information
 - Examples:
    - Relate purchase habits to demographics
    - Send reminder to inactive user
    - Recommendation system
 - Types of RDB joins
    - Inner join: element is both in L and R
    - Outer join: Element in L or R (Full Outer)
    - TODO: review Relational Database joins in p47
 - Types of Hadoop joins TODO: review these in p52
    - Replication join: outer _map-side_ join, useful when one data is small and
      other is big
    - Re-partition join: _reduce-side_ join for joining two or more datasets,
      works with two or more big datasets
    - Semi join: _map-side_ join where one of several datasets are **filtered**
      so it fits in memory, works with large datasets
 ### Replication join
 - Replicate smallest dataset, too all the map hosts, using Hadoop's distributed
  cache
 - Writing:
    - Map:
        - Load the smallest dataset to a hashtable
        - Use key from each input slice to look up the hashtable
        - Join between the dataset record and hashtable value
    - No reducer needed.
 ### Re-partition join
 - Process both datasets in mapper, then emit the join key: <join_id, value> as pairs
 - Performs join at the reducer, among all the elements
 - This spit the load among **all** nodes
 - Writing
    - Map:
        - Writer emit key-value pairs with join_id as the key, and the other columns as value
    - Reducer:
        - Joins the value for the same key
 ### Semi join