# Principles of Map Reduce ## Big Data - Definitions: collecting and processing (interpretation) a large amount of data, to our own benefits. - Key Features: - Volume: scale of petabytes - Variety: Structured, semi-structured, unstructured - Velocity: Social media, sensor, high throughput - Veracity: Unclean, Imprecise, Unclear ## Distributed System - Simple definition: any system that works across many computers - Features: - Concurrency - No guarantee on sequence of events - No guarantee on reliability - Price Performance ratio is better - Challenges - Assigning workers - Failure handling - Exchanging results - Synchronization - Examples: [link](/1-1-intro.md#examples) ## Map Reduce ### Parallelism #### Intro - number of processors working together to calculate or solve a problem [earlier slides](/1-4-scalability.md) - Calculation will be divided into tasks, and sent to different processors - Can be different cores or machines - Challenges - Hard to divide - The subtasks may use results from each other - To solve this, we can rank task based on the difficulty to parallelize - Easy: Image processing - Hard: Path Search #### Benefits - Solve larger or complex problems - Serial computation will take too long: too large or complex - Also cheaper, since not less powerful computer it requires - Serial computation, a retrospective: - FDE cycle: one instruction is fetched, decoded and executed, based on Von Neumann - As processor speed increases, more instructions can be executed at the same time - We have reached the limit of single processor processing, where it's too hard to make a processor much faster - We now have more cores #### Real life examples - Simulation: where directly testing is difficult - Prediction: challenging in algorithm design and implementation - Data Analysis: Google search, biologist on DNA, Marketers on twitter #### Platforms for parallel computing - Spreadsheets, but can't handle true big data - R language: powerful statistical features - Python: General purpose with R like addons - Databases - RDBMS, Relational Database Management Systems - NoSQL: Key value storage, like Amazon DynamoDB - Document databases: MongoDB - Graph databases: Neo4j, Giraph - Specialized language - MapReduce / Hadoop - Spark - Data preparation and visualization ### Introduction to the Map Reduce Model #### Example problem - Counting words - input: text - output: key-pair lists: word - count #### Serial solution - Split into lines - Iterate over each line, and count each word #### Parallelized solution using Divide and Conquer - Split sentences or lines into words - Count words on separate machines, in parallel - Partition the problem and combine the result - Challenges - Assigning work units to workers - Having more work units than workers - Shared partial results may be needed by workers - Aggregate partial results - Knowing workers finished or stopped ### Map Reduce #### Definitions - A parallel programming model, and associated implementation, that scales well and has auto parallelization - Features: - Used by big companies - User specify the map function and reduce function - Runtime automatically parallelize the computation - Runtime handles failures, communication and perf issues - **Not** suitable for every possible algorithms - Can also refer to the runtime (execution framework), or the specific code implementation #### Pattern - Input: usually from file - Map task: aggregate key-value pairs - Reduce task: Process and reduce the key-value pairs - Output: can write to file system #### Usage - Google: Index building for search, Article clustering, Statistical machine translation - Yahoo: Index building for search, spam detection for mail - Facebook: Data Mining, Advertisement, Spam #### History - Inspired by functional programming like Lisp: - `map()`: apply function to each individual value of a array, to modify the array - `reduce()`: combine the values from an array to get a reduced number #### Model - Input data is partitioned into processable chunks - Map: Called on every item in input, and **emit** a series of **intermediate** pairs (list(ki, vi)) - One map job per chunk, can be parallelized - Reduce: Called on every unique key (ki, list(vi)), and it's value list, to emit a value to the output (ki, vi' or simply vi') - One reduce job for each distinct key emitted by mapper, can be parallelized - Procedure: nodes work on **map** job first, after map is completed, they **synchronize** to aggregate the intermediate values by output key, then run **reduce** #### Example - TODO: review the lab code on word count #### Benefits - High level abstraction - Framework is efficient, leading to good performance - Scalability is close to linear, since map and reduce is fully parallelized - Reduces the complexity of parallel programming #### Running - Shuffle, moving data from mappers to reducers and sort, ordering outputs before being processed by reducer - Shuffle: - Every intermediate key-value pairs generated by mapper are **collected** - Pairs are **partitioned** to a list of values, each partition sorted by key - Data is shuffled and sent to each reducer - Sort: - Reducer copy data from mapper - Downloaded data are **merged** and **sorted** as input for reducer, key-list of values, sorted by keys - Runtime - Partitions input - Schedules executions - Handles load balancing - Shuffle, partition, sort intermediate data - Handle failure - Manage IPC (Inter Process Communication) - TODO: see page 66 for diagram example #### Implementations - Google MapReduce: proprietary and private - Hadoop: Open Source, written in Java, first used by Yahoo, then merged into Apache project. - Used by most major companies: Amazon, Facebook, Google, IBM, last.fm, Yahoo - Custom implementation