6 KiB
6 KiB
Principles of Map Reduce
Big Data
- Definitions: collecting and processing (interpretation) a large amount of data, to our own benefits.
- Key Features:
- Volume: scale of petabytes
- Variety: Structured, semi-structured, unstructured
- Velocity: Social media, sensor, high throughput
- Veracity: Unclean, Imprecise, Unclear
Distributed System
- Simple definition: any system that works across many computers
- Features:
- Concurrency
- No guarantee on sequence of events
- No guarantee on reliability
- Price Performance ratio is better
- Challenges
- Assigning workers
- Failure handling
- Exchanging results
- Synchronization
- Examples: link
Map Reduce
Parallelism
Intro
- number of processors working together to calculate or solve a problem earlier slides
- Calculation will be divided into tasks, and sent to different processors
- Can be different cores or machines
- Challenges
- Hard to divide
- The subtasks may use results from each other
- To solve this, we can rank task based on the difficulty to parallelize
- Easy: Image processing
- Hard: Path Search
Benefits
- Solve larger or complex problems
- Serial computation will take too long: too large or complex
- Also cheaper, since not less powerful computer it requires
- Serial computation, a retrospective:
- FDE cycle: one instruction is fetched, decoded and executed, based on Von Neumann
- As processor speed increases, more instructions can be executed at the same time
- We have reached the limit of single processor processing, where it's too hard to make a processor much faster
- We now have more cores
Real life examples
- Simulation: where directly testing is difficult
- Prediction: challenging in algorithm design and implementation
- Data Analysis: Google search, biologist on DNA, Marketers on twitter
Platforms for parallel computing
- Spreadsheets, but can't handle true big data
- R language: powerful statistical features
- Python: General purpose with R like addons
- Databases
- RDBMS, Relational Database Management Systems
- NoSQL: Key value storage, like Amazon DynamoDB
- Document databases: MongoDB
- Graph databases: Neo4j, Giraph
- Specialized language
- MapReduce / Hadoop
- Spark
- Data preparation and visualization
Introduction to the Map Reduce Model
Example problem
- Counting words
- input: text
- output: key-pair lists: word - count
Serial solution
- Split into lines
- Iterate over each line, and count each word
Parallelized solution using Divide and Conquer
- Split sentences or lines into words
- Count words on separate machines, in parallel
- Partition the problem and combine the result
- Challenges
- Assigning work units to workers
- Having more work units than workers
- Shared partial results may be needed by workers
- Aggregate partial results
- Knowing workers finished or stopped
Map Reduce
Definitions
- A parallel programming model, and associated implementation, that scales well and has auto parallelization
- Features:
- Used by big companies
- User specify the map function and reduce function
- Runtime automatically parallelize the computation
- Runtime handles failures, communication and perf issues
- Not suitable for every possible algorithms
- Can also refer to the runtime (execution framework), or the specific code implementation
Pattern
- Input: usually from file
- Map task: aggregate key-value pairs
- Reduce task: Process and reduce the key-value pairs
- Output: can write to file system
Usage
- Google: Index building for search, Article clustering, Statistical machine translation
- Yahoo: Index building for search, spam detection for mail
- Facebook: Data Mining, Advertisement, Spam
History
- Inspired by functional programming like Lisp:
map()
: apply function to each individual value of a array, to modify the arrayreduce()
: combine the values from an array to get a reduced number
Model
- Input data is partitioned into processable chunks
- Map: Called on every item in input, and emit a series of intermediate
pairs (list(ki, vi))
- One map job per chunk, can be parallelized
- Reduce: Called on every unique key (ki, list(vi)), and it's value list, to
emit a value to the output (ki, vi' or simply vi')
- One reduce job for each distinct key emitted by mapper, can be parallelized
- Procedure: nodes work on map job first, after map is completed, they synchronize to aggregate the intermediate values by output key, then run reduce
Example
- TODO: review the lab code on word count
Benefits
- High level abstraction
- Framework is efficient, leading to good performance
- Scalability is close to linear, since map and reduce is fully parallelized
- Reduces the complexity of parallel programming
Running
- Shuffle, moving data from mappers to reducers and sort, ordering outputs
before being processed by reducer
- Shuffle:
- Every intermediate key-value pairs generated by mapper are collected
- Pairs are partitioned to a list of values, each partition sorted by key
- Data is shuffled and sent to each reducer
- Sort:
- Reducer copy data from mapper
- Downloaded data are merged and sorted as input for reducer, key-list of values, sorted by keys
- Shuffle:
- Runtime
- Partitions input
- Schedules executions
- Handles load balancing
- Shuffle, partition, sort intermediate data
- Handle failure
- Manage IPC (Inter Process Communication)
- TODO: see page 66 for diagram example
Implementations
- Google MapReduce: proprietary and private
- Hadoop: Open Source, written in Java, first used by Yahoo, then merged into
Apache project.
- Used by most major companies: Amazon, Facebook, Google, IBM, last.fm, Yahoo
- Custom implementation