map reduce intro, took 1.5 hr

2024-12-29 21:17:17 +08:00 · 2024-12-29 21:17:17 +08:00 · dd6c4dad47
parent 83254e49ec
commit dd6c4dad47
2 changed files with 194 additions and 4 deletions
--- a/1-1-intro.md
+++ b/1-1-intro.md
@ -50,6 +50,7 @@

 - Internet
 - Intranets
+- Domain name service
 - Grid computing
 - Peer to peer (p2p) computing
 - Cloud computing
@ -116,7 +117,6 @@
        - Interacts with heterogeneous systems
 - The above gave birth to cloud computing

-
 ### Definition

 - A computing infrastructure, that consists of shared pool of **virtualized**
@ -145,7 +145,7 @@

 ## Cloud Deployment Models

-###  Public Cloud
+### Public Cloud

 - Advantages:
    - Flexible
@ -158,6 +158,7 @@
    - Not **customizable**

 ### Private cloud
+
 - Advantages:
    - Highly **private** and **secured**
    - More **control**
--- a/3-1-map-reduce.md
+++ b/3-1-map-reduce.md
@ -0,0 +1,189 @@
+# Principles of Map Reduce
+
+## Big Data
+
+- Definitions: collecting and processing (interpretation) a large amount of
+  data, to our own benefits.
+- Key Features:
+    - Volume: scale of petabytes
+    - Variety: Structured, semi-structured, unstructured
+    - Velocity: Social media, sensor, high throughput
+    - Veracity: Unclean, Imprecise, Unclear
+
+## Distributed System
+
+- Simple definition: any system that works across many computers
+- Features:
+    - Concurrency
+    - No guarantee on sequence of events
+    - No guarantee on reliability
+    - Price Performance ratio is better
+- Challenges
+    - Assigning workers
+    - Failure handling
+    - Exchanging results
+    - Synchronization
+- Examples: [link](/1-1-intro.md#examples)
+
+## Map Reduce
+
+### Parallelism
+
+#### Intro
+
+- number of processors working together to calculate or solve a problem
+  [earlier slides](/1-4-scalability.md)
+- Calculation will be divided into tasks, and sent to different processors
+    - Can be different cores or machines
+- Challenges
+    - Hard to divide
+    - The subtasks may use results from each other
+- To solve this, we can rank task based on the difficulty to parallelize
+    - Easy: Image processing
+    - Hard: Path Search
+
+#### Benefits
+
+- Solve larger or complex problems
+- Serial computation will take too long: too large or complex
+- Also cheaper, since not less powerful computer it requires
+- Serial computation, a retrospective:
+    - FDE cycle: one instruction is fetched, decoded and executed, based on Von
+      Neumann
+    - As processor speed increases, more instructions can be executed at the
+      same time
+    - We have reached the limit of single processor processing, where it's too
+      hard to make a processor much faster
+    - We now have more cores
+
+#### Real life examples
+
+- Simulation: where directly testing is difficult
+- Prediction: challenging in algorithm design and implementation
+- Data Analysis: Google search, biologist on DNA, Marketers on twitter
+
+#### Platforms for parallel computing
+
+- Spreadsheets, but can't handle true big data
+- R language: powerful statistical features
+- Python: General purpose with R like addons
+- Databases
+    - RDBMS, Relational Database Management Systems
+    - NoSQL: Key value storage, like Amazon DynamoDB
+    - Document databases: MongoDB
+    - Graph databases: Neo4j, Giraph
+- Specialized language
+    - MapReduce / Hadoop
+    - Spark
+- Data preparation and visualization
+
+### Introduction to the Map Reduce Model
+
+#### Example problem
+
+- Counting words
+- input: text
+- output: key-pair lists: word - count
+
+#### Serial solution
+
+- Split into lines
+- Iterate over each line, and count each word
+
+#### Parallelized solution using Divide and Conquer
+
+- Split sentences or lines into words
+- Count words on separate machines, in parallel
+- Partition the problem and combine the result
+- Challenges
+    - Assigning work units to workers
+    - Having more work units than workers
+    - Shared partial results may be needed by workers
+    - Aggregate partial results
+    - Knowing workers finished or stopped
+
+### Map Reduce
+
+#### Definitions
+
+- A parallel programming model, and associated implementation, that scales well
+  and has auto parallelization
+- Features:
+    - Used by big companies
+    - User specify the map function and reduce function
+    - Runtime automatically parallelize the computation
+    - Runtime handles failures, communication and perf issues
+    - **Not** suitable for every possible algorithms
+- Can also refer to the runtime (execution framework), or the specific code
+  implementation
+
+#### Pattern
+
+- Input: usually from file
+- Map task: aggregate key-value pairs
+- Reduce task: Process and reduce the key-value pairs
+- Output: can write to file system
+
+#### Usage
+
+- Google: Index building for search, Article clustering, Statistical machine
+  translation
+- Yahoo: Index building for search, spam detection for mail
+- Facebook: Data Mining, Advertisement, Spam
+
+#### History
+
+- Inspired by functional programming like Lisp:
+    - `map()`: apply function to each individual value of a array, to modify the
+      array
+    - `reduce()`: combine the values from an array to get a reduced number
+
+#### Model
+
+- Input data is partitioned into processable chunks
+- Map: Called on every item in input, and **emit** a series of **intermediate**
+  <key-value> pairs (list(ki, vi))
+    - One map job per chunk, can be parallelized
+- Reduce: Called on every unique key (ki, list(vi)), and it's value list, to
+  emit a value to the output (ki, vi' or simply vi')
+    - One reduce job for each distinct key emitted by mapper, can be
+      parallelized
+- Procedure: nodes work on **map** job first, after map is completed, they
+  **synchronize** to aggregate the intermediate values by output key, then run
+  **reduce**
+
+#### Example
+
+- TODO: review the lab code on word count
+
+#### Benefits
+
+- High level abstraction
+- Framework is efficient, leading to good performance
+- Scalability is close to linear, since map and reduce is fully parallelized
+- Reduces the complexity of parallel programming
+
+#### Running
+
+- Shuffle and sort
+    - Every intermediate key-value pairs generated by mapper are collected
+    - Same key items are grouped to a list of values
+    - Data is shuffled and sent to each reducer
+    - Data provided to each reducer is sorted by keys
+- Runtime
+    - Partitions input
+    - schedules executions
+    - Handles load balancing
+    - Shuffle, partition, sort intermediate data
+    - Handle failure
+    - Manage IPC (Inter Process Communication)
+- TODO: see page 66 for diagram example
+
+#### Implementations
+
+- Google MapReduce: proprietary and private
+- Hadoop: Open Source, written in Java, first used by Yahoo, then merged into
+  Apache project.
+    - Used by most major companies: Amazon, Facebook, Google, IBM, last.fm,
+      Yahoo
+- Custom implementation