# Map Reduce Algorithms ## Map Reduce Design Patterns ### Design Patterns - Definition of design patterns: the template for solving specific problems - Used by Programmers - MP patterns are tools to solve MP problems ### Decision Points - Mapper's algorithm - Mapper's output key-value pairs - Reducer's algorithm - Reducer's output key value pairs ## Numerical Summarization pattern ### Intro - **Goal**: Calculate aggregate statistical values over a large dataset - Extract **features** from dataset, and compute the same **function** for each feature - **Motivation**: To provide a **top** level view of large input datasets to identify trends or anomanies - **Examples**: - Count occurences - Max / min values - Average / median/ Standard deviation ### Examples - Max PM2.5 for each location in the dataset - Average AQI for each week - Average is **NOT** an associative operation, so can't be executed partially - Change mapper result to solve this, like emitting occurences - Number of locations with PM2.5 exceeding 150 ### Writing - Mapper: - Find feature in input, like words - Set partial aggregate value - Reducer: - Complete the final aggregate result - Combiner: - Aggregate the partial aggregate value in mapper **without** changing the outcome - Must be Optional, to save network traffic ## Inverted Index Pattern ### Intro - Goal: to generate an index from dataset, to allow faster searches for specific features - Motivation: improve search efficiency - Examples: - Find all websites that match the search term ### Writing: - Mapper - Find feature in input - Emit keyword-document_identifier as output - Reducer - identity function, since sorting and partitioning is done in the _shuffle and sort_ step. ## Data Filtering ### Intro - Goal: Filter out useless records - Motivation: to speed up computation - Examples: - Distributed text search for many documents - Track a thread of event - Data cleaning - Can be mapper only: don't even need _shuffle and sort_ step. ### Top Ten: a variant of filtering - Get a small number of records, relative to a ranking function, like top 10 - Focus on most important record - Writing - Mapper: emit for each record, _null_ is used so that every data go to one partition - Combiner: sort values by ranking, emit top k - Reducer: same as combiner, but can emit key as rank integer - Performance depends on number of element, without combiner, the performance is worse - Requirement: - Splited ranking data must fit into the memory of mapper ### Writing - Mapper: filter the data - Combiner and reducer optional, depends on the scenario ## Data Joins ### Intro - Goal: to combine together related data, and relate information - Examples: - Relate purchase habits to demographics - Send reminder to inactive user - Recommendation system - Types of RDB joins - Inner join: element is both in L and R - Outer join: Element in L or R (Full Outer) - TODO: review Relational Database joins in p47 - Types of Hadoop joins TODO: review these in p52 - Replication join: outer _map-side_ join, useful when one data is small and other is big - Re-partition join: _reduce-side_ join for joining two or more datasets, works with two or more big datasets - Semi join: _map-side_ join where one of several datasets are **filtered** so it fits in memory, works with large datasets ### Replication join - Replicate smallest dataset, too all the map hosts, using Hadoop's distributed cache - Writing: - Map: - Load the smallest dataset to a hashtable - Use key from each input slice to look up the hashtable - Join between the dataset record and hashtable value - No reducer needed. ### Re-partition join - Process both datasets in mapper, then emit the join key: as pairs - Performs join at the reducer, among all the elements - This spit the load among **all** nodes - Writing - Map: - Writer emit key-value pairs with join_id as the key, and the other columns as value - Reducer: - Joins the value for the same key ### Semi join