root/EBU6502_cloud_computing_notes

Fork 0

Ryan 28b481f821 3-3, 1hr

2024-12-30 16:27:46 +08:00

4.2 KiB

Raw Permalink Blame History

Map Reduce Algorithms

Map Reduce Design Patterns

Design Patterns

Definition of design patterns: the template for solving specific problems
Used by Programmers
MP patterns are tools to solve MP problems

Decision Points

Mapper's algorithm
Mapper's output key-value pairs
Reducer's algorithm
Reducer's output key value pairs

Numerical Summarization pattern

Intro

Goal: Calculate aggregate statistical values over a large dataset
- Extract features from dataset, and compute the same function for each feature
Motivation: To provide a top level view of large input datasets to identify trends or anomanies
Examples:
- Count occurences
- Max / min values
- Average / median/ Standard deviation

Examples

Max PM2.5 for each location in the dataset
Average AQI for each week
- Average is NOT an associative operation, so can't be executed partially
- Change mapper result to solve this, like emitting occurences
Number of locations with PM2.5 exceeding 150

Writing

Mapper:
- Find feature in input, like words
- Set partial aggregate value
Reducer:
- Complete the final aggregate result
Combiner:
- Aggregate the partial aggregate value in mapper without changing the outcome
- Must be Optional, to save network traffic

Inverted Index Pattern

Intro

Goal: to generate an index from dataset, to allow faster searches for specific features
Motivation: improve search efficiency
Examples:
- Find all websites that match the search term

Writing:

Mapper
- Find feature in input
- Emit keyword-document_identifier as output
Reducer
- identity function, since sorting and partitioning is done in the shuffle and sort step.

Data Filtering

Intro

Goal: Filter out useless records
Motivation: to speed up computation
Examples:
- Distributed text search for many documents
- Track a thread of event
- Data cleaning
Can be mapper only: don't even need shuffle and sort step.

Top Ten: a variant of filtering

Get a small number of records, relative to a ranking function, like top 10
Focus on most important record
Writing
- Mapper: emit <null-(ranking, record)> for each record, null is used so that every data go to one partition
- Combiner: sort values by ranking, emit top k
- Reducer: same as combiner, but can emit key as rank integer
Performance depends on number of element, without combiner, the performance is worse
Requirement:
- Splited ranking data must fit into the memory of mapper

Writing

Mapper: filter the data
Combiner and reducer optional, depends on the scenario

Data Joins

Intro

Goal: to combine together related data, and relate information
Examples:
- Relate purchase habits to demographics
- Send reminder to inactive user
- Recommendation system
Types of RDB joins
- Inner join: element is both in L and R
- Outer join: Element in L or R (Full Outer)
- TODO: review Relational Database joins in p47
Types of Hadoop joins TODO: review these in p52
- Replication join: outer map-side join, useful when one data is small and other is big
- Re-partition join: reduce-side join for joining two or more datasets, works with two or more big datasets
- Semi join: map-side join where one of several datasets are filtered so it fits in memory, works with large datasets

Replication join

Replicate smallest dataset, too all the map hosts, using Hadoop's distributed cache
Writing:
- Map:
  - Load the smallest dataset to a hashtable
  - Use key from each input slice to look up the hashtable
  - Join between the dataset record and hashtable value
- No reducer needed.

Re-partition join

Process both datasets in mapper, then emit the join key: <join_id, value> as pairs
Performs join at the reducer, among all the elements
This spit the load among all nodes
Writing
- Map:
  - Writer emit key-value pairs with join_id as the key, and the other columns as value
- Reducer:
  - Joins the value for the same key

4.2 KiB Raw Permalink Blame History

Map Reduce Algorithms

Map Reduce Design Patterns

Design Patterns

Decision Points

Numerical Summarization pattern

Intro

Examples

Writing

Inverted Index Pattern

Intro

Writing:

Data Filtering

Intro

Top Ten: a variant of filtering

Writing

Data Joins

Intro

Replication join

Re-partition join

Semi join

4.2 KiB

Raw Permalink Blame History