EBU6502_cloud_computing_notes/3-3-map-reduce-algorithms.md
2024-12-30 16:27:46 +08:00

4.2 KiB

Map Reduce Algorithms

Map Reduce Design Patterns

Design Patterns

  • Definition of design patterns: the template for solving specific problems
  • Used by Programmers
  • MP patterns are tools to solve MP problems

Decision Points

  • Mapper's algorithm
  • Mapper's output key-value pairs
  • Reducer's algorithm
  • Reducer's output key value pairs

Numerical Summarization pattern

Intro

  • Goal: Calculate aggregate statistical values over a large dataset
    • Extract features from dataset, and compute the same function for each feature
  • Motivation: To provide a top level view of large input datasets to identify trends or anomanies
  • Examples:
    • Count occurences
    • Max / min values
    • Average / median/ Standard deviation

Examples

  • Max PM2.5 for each location in the dataset
  • Average AQI for each week
    • Average is NOT an associative operation, so can't be executed partially
    • Change mapper result to solve this, like emitting occurences
  • Number of locations with PM2.5 exceeding 150

Writing

  • Mapper:
    • Find feature in input, like words
    • Set partial aggregate value
  • Reducer:
    • Complete the final aggregate result
  • Combiner:
    • Aggregate the partial aggregate value in mapper without changing the outcome
    • Must be Optional, to save network traffic

Inverted Index Pattern

Intro

  • Goal: to generate an index from dataset, to allow faster searches for specific features
  • Motivation: improve search efficiency
  • Examples:
    • Find all websites that match the search term

Writing:

  • Mapper
    • Find feature in input
    • Emit keyword-document_identifier as output
  • Reducer
    • identity function, since sorting and partitioning is done in the shuffle and sort step.

Data Filtering

Intro

  • Goal: Filter out useless records
  • Motivation: to speed up computation
  • Examples:
    • Distributed text search for many documents
    • Track a thread of event
    • Data cleaning
  • Can be mapper only: don't even need shuffle and sort step.

Top Ten: a variant of filtering

  • Get a small number of records, relative to a ranking function, like top 10
  • Focus on most important record
  • Writing
    • Mapper: emit <null-(ranking, record)> for each record, null is used so that every data go to one partition
    • Combiner: sort values by ranking, emit top k
    • Reducer: same as combiner, but can emit key as rank integer
  • Performance depends on number of element, without combiner, the performance is worse
  • Requirement:
    • Splited ranking data must fit into the memory of mapper

Writing

  • Mapper: filter the data
  • Combiner and reducer optional, depends on the scenario

Data Joins

Intro

  • Goal: to combine together related data, and relate information
  • Examples:
    • Relate purchase habits to demographics
    • Send reminder to inactive user
    • Recommendation system
  • Types of RDB joins
    • Inner join: element is both in L and R
    • Outer join: Element in L or R (Full Outer)
    • TODO: review Relational Database joins in p47
  • Types of Hadoop joins TODO: review these in p52
    • Replication join: outer map-side join, useful when one data is small and other is big
    • Re-partition join: reduce-side join for joining two or more datasets, works with two or more big datasets
    • Semi join: map-side join where one of several datasets are filtered so it fits in memory, works with large datasets

Replication join

  • Replicate smallest dataset, too all the map hosts, using Hadoop's distributed cache
  • Writing:
    • Map:
      • Load the smallest dataset to a hashtable
      • Use key from each input slice to look up the hashtable
      • Join between the dataset record and hashtable value
    • No reducer needed.

Re-partition join

  • Process both datasets in mapper, then emit the join key: <join_id, value> as pairs
  • Performs join at the reducer, among all the elements
  • This spit the load among all nodes
  • Writing
    • Map:
      • Writer emit key-value pairs with join_id as the key, and the other columns as value
    • Reducer:
      • Joins the value for the same key

Semi join