4.2 KiB
4.2 KiB
Map Reduce Algorithms
Map Reduce Design Patterns
Design Patterns
- Definition of design patterns: the template for solving specific problems
- Used by Programmers
- MP patterns are tools to solve MP problems
Decision Points
- Mapper's algorithm
- Mapper's output key-value pairs
- Reducer's algorithm
- Reducer's output key value pairs
Numerical Summarization pattern
Intro
- Goal: Calculate aggregate statistical values over a large dataset
- Extract features from dataset, and compute the same function for each feature
- Motivation: To provide a top level view of large input datasets to identify trends or anomanies
- Examples:
- Count occurences
- Max / min values
- Average / median/ Standard deviation
Examples
- Max PM2.5 for each location in the dataset
- Average AQI for each week
- Average is NOT an associative operation, so can't be executed partially
- Change mapper result to solve this, like emitting occurences
- Number of locations with PM2.5 exceeding 150
Writing
- Mapper:
- Find feature in input, like words
- Set partial aggregate value
- Reducer:
- Complete the final aggregate result
- Combiner:
- Aggregate the partial aggregate value in mapper without changing the outcome
- Must be Optional, to save network traffic
Inverted Index Pattern
Intro
- Goal: to generate an index from dataset, to allow faster searches for specific features
- Motivation: improve search efficiency
- Examples:
- Find all websites that match the search term
Writing:
- Mapper
- Find feature in input
- Emit keyword-document_identifier as output
- Reducer
- identity function, since sorting and partitioning is done in the shuffle and sort step.
Data Filtering
Intro
- Goal: Filter out useless records
- Motivation: to speed up computation
- Examples:
- Distributed text search for many documents
- Track a thread of event
- Data cleaning
- Can be mapper only: don't even need shuffle and sort step.
Top Ten: a variant of filtering
- Get a small number of records, relative to a ranking function, like top 10
- Focus on most important record
- Writing
- Mapper: emit <null-(ranking, record)> for each record, null is used so that every data go to one partition
- Combiner: sort values by ranking, emit top k
- Reducer: same as combiner, but can emit key as rank integer
- Performance depends on number of element, without combiner, the performance is worse
- Requirement:
- Splited ranking data must fit into the memory of mapper
Writing
- Mapper: filter the data
- Combiner and reducer optional, depends on the scenario
Data Joins
Intro
- Goal: to combine together related data, and relate information
- Examples:
- Relate purchase habits to demographics
- Send reminder to inactive user
- Recommendation system
- Types of RDB joins
- Inner join: element is both in L and R
- Outer join: Element in L or R (Full Outer)
- TODO: review Relational Database joins in p47
- Types of Hadoop joins TODO: review these in p52
- Replication join: outer map-side join, useful when one data is small and other is big
- Re-partition join: reduce-side join for joining two or more datasets, works with two or more big datasets
- Semi join: map-side join where one of several datasets are filtered so it fits in memory, works with large datasets
Replication join
- Replicate smallest dataset, too all the map hosts, using Hadoop's distributed cache
- Writing:
- Map:
- Load the smallest dataset to a hashtable
- Use key from each input slice to look up the hashtable
- Join between the dataset record and hashtable value
- No reducer needed.
- Map:
Re-partition join
- Process both datasets in mapper, then emit the join key: <join_id, value> as pairs
- Performs join at the reducer, among all the elements
- This spit the load among all nodes
- Writing
- Map:
- Writer emit key-value pairs with join_id as the key, and the other columns as value
- Reducer:
- Joins the value for the same key
- Map: