map reduce intro, took 1.5 hr

This commit is contained in:
Ryan 2024-12-29 21:17:17 +08:00
parent 83254e49ec
commit dd6c4dad47
2 changed files with 194 additions and 4 deletions

View file

@ -50,6 +50,7 @@
- Internet - Internet
- Intranets - Intranets
- Domain name service
- Grid computing - Grid computing
- Peer to peer (p2p) computing - Peer to peer (p2p) computing
- Cloud computing - Cloud computing
@ -116,7 +117,6 @@
- Interacts with heterogeneous systems - Interacts with heterogeneous systems
- The above gave birth to cloud computing - The above gave birth to cloud computing
### Definition ### Definition
- A computing infrastructure, that consists of shared pool of **virtualized** - A computing infrastructure, that consists of shared pool of **virtualized**
@ -145,7 +145,7 @@
## Cloud Deployment Models ## Cloud Deployment Models
### Public Cloud ### Public Cloud
- Advantages: - Advantages:
- Flexible - Flexible
@ -158,6 +158,7 @@
- Not **customizable** - Not **customizable**
### Private cloud ### Private cloud
- Advantages: - Advantages:
- Highly **private** and **secured** - Highly **private** and **secured**
- More **control** - More **control**

189
3-1-map-reduce.md Normal file
View file

@ -0,0 +1,189 @@
# Principles of Map Reduce
## Big Data
- Definitions: collecting and processing (interpretation) a large amount of
data, to our own benefits.
- Key Features:
- Volume: scale of petabytes
- Variety: Structured, semi-structured, unstructured
- Velocity: Social media, sensor, high throughput
- Veracity: Unclean, Imprecise, Unclear
## Distributed System
- Simple definition: any system that works across many computers
- Features:
- Concurrency
- No guarantee on sequence of events
- No guarantee on reliability
- Price Performance ratio is better
- Challenges
- Assigning workers
- Failure handling
- Exchanging results
- Synchronization
- Examples: [link](/1-1-intro.md#examples)
## Map Reduce
### Parallelism
#### Intro
- number of processors working together to calculate or solve a problem
[earlier slides](/1-4-scalability.md)
- Calculation will be divided into tasks, and sent to different processors
- Can be different cores or machines
- Challenges
- Hard to divide
- The subtasks may use results from each other
- To solve this, we can rank task based on the difficulty to parallelize
- Easy: Image processing
- Hard: Path Search
#### Benefits
- Solve larger or complex problems
- Serial computation will take too long: too large or complex
- Also cheaper, since not less powerful computer it requires
- Serial computation, a retrospective:
- FDE cycle: one instruction is fetched, decoded and executed, based on Von
Neumann
- As processor speed increases, more instructions can be executed at the
same time
- We have reached the limit of single processor processing, where it's too
hard to make a processor much faster
- We now have more cores
#### Real life examples
- Simulation: where directly testing is difficult
- Prediction: challenging in algorithm design and implementation
- Data Analysis: Google search, biologist on DNA, Marketers on twitter
#### Platforms for parallel computing
- Spreadsheets, but can't handle true big data
- R language: powerful statistical features
- Python: General purpose with R like addons
- Databases
- RDBMS, Relational Database Management Systems
- NoSQL: Key value storage, like Amazon DynamoDB
- Document databases: MongoDB
- Graph databases: Neo4j, Giraph
- Specialized language
- MapReduce / Hadoop
- Spark
- Data preparation and visualization
### Introduction to the Map Reduce Model
#### Example problem
- Counting words
- input: text
- output: key-pair lists: word - count
#### Serial solution
- Split into lines
- Iterate over each line, and count each word
#### Parallelized solution using Divide and Conquer
- Split sentences or lines into words
- Count words on separate machines, in parallel
- Partition the problem and combine the result
- Challenges
- Assigning work units to workers
- Having more work units than workers
- Shared partial results may be needed by workers
- Aggregate partial results
- Knowing workers finished or stopped
### Map Reduce
#### Definitions
- A parallel programming model, and associated implementation, that scales well
and has auto parallelization
- Features:
- Used by big companies
- User specify the map function and reduce function
- Runtime automatically parallelize the computation
- Runtime handles failures, communication and perf issues
- **Not** suitable for every possible algorithms
- Can also refer to the runtime (execution framework), or the specific code
implementation
#### Pattern
- Input: usually from file
- Map task: aggregate key-value pairs
- Reduce task: Process and reduce the key-value pairs
- Output: can write to file system
#### Usage
- Google: Index building for search, Article clustering, Statistical machine
translation
- Yahoo: Index building for search, spam detection for mail
- Facebook: Data Mining, Advertisement, Spam
#### History
- Inspired by functional programming like Lisp:
- `map()`: apply function to each individual value of a array, to modify the
array
- `reduce()`: combine the values from an array to get a reduced number
#### Model
- Input data is partitioned into processable chunks
- Map: Called on every item in input, and **emit** a series of **intermediate**
<key-value> pairs (list(ki, vi))
- One map job per chunk, can be parallelized
- Reduce: Called on every unique key (ki, list(vi)), and it's value list, to
emit a value to the output (ki, vi' or simply vi')
- One reduce job for each distinct key emitted by mapper, can be
parallelized
- Procedure: nodes work on **map** job first, after map is completed, they
**synchronize** to aggregate the intermediate values by output key, then run
**reduce**
#### Example
- TODO: review the lab code on word count
#### Benefits
- High level abstraction
- Framework is efficient, leading to good performance
- Scalability is close to linear, since map and reduce is fully parallelized
- Reduces the complexity of parallel programming
#### Running
- Shuffle and sort
- Every intermediate key-value pairs generated by mapper are collected
- Same key items are grouped to a list of values
- Data is shuffled and sent to each reducer
- Data provided to each reducer is sorted by keys
- Runtime
- Partitions input
- schedules executions
- Handles load balancing
- Shuffle, partition, sort intermediate data
- Handle failure
- Manage IPC (Inter Process Communication)
- TODO: see page 66 for diagram example
#### Implementations
- Google MapReduce: proprietary and private
- Hadoop: Open Source, written in Java, first used by Yahoo, then merged into
Apache project.
- Used by most major companies: Amazon, Facebook, Google, IBM, last.fm,
Yahoo
- Custom implementation