From dd6c4dad474fae1b397051b464a7ba1fef8b2bd3 Mon Sep 17 00:00:00 2001 From: Ryan Date: Sun, 29 Dec 2024 21:17:17 +0800 Subject: [PATCH] map reduce intro, took 1.5 hr --- 1-1-intro.md | 9 ++- 3-1-map-reduce.md | 189 ++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 194 insertions(+), 4 deletions(-) create mode 100644 3-1-map-reduce.md diff --git a/1-1-intro.md b/1-1-intro.md index 09c18fa..cf979fe 100644 --- a/1-1-intro.md +++ b/1-1-intro.md @@ -50,6 +50,7 @@ - Internet - Intranets +- Domain name service - Grid computing - Peer to peer (p2p) computing - Cloud computing @@ -116,7 +117,6 @@ - Interacts with heterogeneous systems - The above gave birth to cloud computing - ### Definition - A computing infrastructure, that consists of shared pool of **virtualized** @@ -145,7 +145,7 @@ ## Cloud Deployment Models -### Public Cloud +### Public Cloud - Advantages: - Flexible @@ -158,13 +158,14 @@ - Not **customizable** ### Private cloud + - Advantages: - Highly **private** and **secured** - - More **control** + - More **control** - Disadvantages: - Poor scalability - Scaled within hosted resources - - Costly: + - Costly: - secured - More features - Inflexible pricing diff --git a/3-1-map-reduce.md b/3-1-map-reduce.md new file mode 100644 index 0000000..b91e453 --- /dev/null +++ b/3-1-map-reduce.md @@ -0,0 +1,189 @@ +# Principles of Map Reduce + +## Big Data + +- Definitions: collecting and processing (interpretation) a large amount of + data, to our own benefits. +- Key Features: + - Volume: scale of petabytes + - Variety: Structured, semi-structured, unstructured + - Velocity: Social media, sensor, high throughput + - Veracity: Unclean, Imprecise, Unclear + +## Distributed System + +- Simple definition: any system that works across many computers +- Features: + - Concurrency + - No guarantee on sequence of events + - No guarantee on reliability + - Price Performance ratio is better +- Challenges + - Assigning workers + - Failure handling + - Exchanging results + - Synchronization +- Examples: [link](/1-1-intro.md#examples) + +## Map Reduce + +### Parallelism + +#### Intro + +- number of processors working together to calculate or solve a problem + [earlier slides](/1-4-scalability.md) +- Calculation will be divided into tasks, and sent to different processors + - Can be different cores or machines +- Challenges + - Hard to divide + - The subtasks may use results from each other +- To solve this, we can rank task based on the difficulty to parallelize + - Easy: Image processing + - Hard: Path Search + +#### Benefits + +- Solve larger or complex problems +- Serial computation will take too long: too large or complex +- Also cheaper, since not less powerful computer it requires +- Serial computation, a retrospective: + - FDE cycle: one instruction is fetched, decoded and executed, based on Von + Neumann + - As processor speed increases, more instructions can be executed at the + same time + - We have reached the limit of single processor processing, where it's too + hard to make a processor much faster + - We now have more cores + +#### Real life examples + +- Simulation: where directly testing is difficult +- Prediction: challenging in algorithm design and implementation +- Data Analysis: Google search, biologist on DNA, Marketers on twitter + +#### Platforms for parallel computing + +- Spreadsheets, but can't handle true big data +- R language: powerful statistical features +- Python: General purpose with R like addons +- Databases + - RDBMS, Relational Database Management Systems + - NoSQL: Key value storage, like Amazon DynamoDB + - Document databases: MongoDB + - Graph databases: Neo4j, Giraph +- Specialized language + - MapReduce / Hadoop + - Spark +- Data preparation and visualization + +### Introduction to the Map Reduce Model + +#### Example problem + +- Counting words +- input: text +- output: key-pair lists: word - count + +#### Serial solution + +- Split into lines +- Iterate over each line, and count each word + +#### Parallelized solution using Divide and Conquer + +- Split sentences or lines into words +- Count words on separate machines, in parallel +- Partition the problem and combine the result +- Challenges + - Assigning work units to workers + - Having more work units than workers + - Shared partial results may be needed by workers + - Aggregate partial results + - Knowing workers finished or stopped + +### Map Reduce + +#### Definitions + +- A parallel programming model, and associated implementation, that scales well + and has auto parallelization +- Features: + - Used by big companies + - User specify the map function and reduce function + - Runtime automatically parallelize the computation + - Runtime handles failures, communication and perf issues + - **Not** suitable for every possible algorithms +- Can also refer to the runtime (execution framework), or the specific code + implementation + +#### Pattern + +- Input: usually from file +- Map task: aggregate key-value pairs +- Reduce task: Process and reduce the key-value pairs +- Output: can write to file system + +#### Usage + +- Google: Index building for search, Article clustering, Statistical machine + translation +- Yahoo: Index building for search, spam detection for mail +- Facebook: Data Mining, Advertisement, Spam + +#### History + +- Inspired by functional programming like Lisp: + - `map()`: apply function to each individual value of a array, to modify the + array + - `reduce()`: combine the values from an array to get a reduced number + +#### Model + +- Input data is partitioned into processable chunks +- Map: Called on every item in input, and **emit** a series of **intermediate** + pairs (list(ki, vi)) + - One map job per chunk, can be parallelized +- Reduce: Called on every unique key (ki, list(vi)), and it's value list, to + emit a value to the output (ki, vi' or simply vi') + - One reduce job for each distinct key emitted by mapper, can be + parallelized +- Procedure: nodes work on **map** job first, after map is completed, they + **synchronize** to aggregate the intermediate values by output key, then run + **reduce** + +#### Example + +- TODO: review the lab code on word count + +#### Benefits + +- High level abstraction +- Framework is efficient, leading to good performance +- Scalability is close to linear, since map and reduce is fully parallelized +- Reduces the complexity of parallel programming + +#### Running + +- Shuffle and sort + - Every intermediate key-value pairs generated by mapper are collected + - Same key items are grouped to a list of values + - Data is shuffled and sent to each reducer + - Data provided to each reducer is sorted by keys +- Runtime + - Partitions input + - schedules executions + - Handles load balancing + - Shuffle, partition, sort intermediate data + - Handle failure + - Manage IPC (Inter Process Communication) +- TODO: see page 66 for diagram example + +#### Implementations + +- Google MapReduce: proprietary and private +- Hadoop: Open Source, written in Java, first used by Yahoo, then merged into + Apache project. + - Used by most major companies: Amazon, Facebook, Google, IBM, last.fm, + Yahoo +- Custom implementation