add 3-2, took 2hr and more

2024-12-30 15:14:36 +08:00 · 2024-12-30 15:14:36 +08:00 · b56bb2278e
parent dd6c4dad47
commit b56bb2278e
2 changed files with 226 additions and 6 deletions
--- a/3-1-map-reduce.md
+++ b/3-1-map-reduce.md
@ -165,14 +165,23 @@

 #### Running

- Shuffle and sort
-    - Every intermediate key-value pairs generated by mapper are collected
-    - Same key items are grouped to a list of values
-    - Data is shuffled and sent to each reducer
-    - Data provided to each reducer is sorted by keys
+- Shuffle, moving data from mappers to reducers and sort, ordering outputs
+  before being processed by reducer
+    - Shuffle:
+        - Every intermediate key-value pairs generated by mapper are
+          **collected**
+        - Pairs are **partitioned** to a list of values, each partition sorted
+          by key
+        - Combiner run on each partition, to combine the key-value pairs with
+          key- list of values
+        - Data is shuffled and sent to each reducer
+    - Sort:
+        - Reducer copy data from mapper
+        - Downloaded data are **merged** and **sorted** as input for reducer, key-list
+          of values, sorted by keys
 - Runtime
    - Partitions input
-    - schedules executions
+    - Schedules executions
    - Handles load balancing
    - Shuffle, partition, sort intermediate data
    - Handle failure
--- a/3-2-hadoop.md
+++ b/3-2-hadoop.md
@ -0,0 +1,211 @@
+# Hadoop Architecture
+
+## Features
+
+- Designed to run in clusters of pcs
+- Scales up linearly
+- Suitable for local networks or data centers
+- Design principles
+    - Data is distributed around the network
+    - Computation is sent to data: code is sent to run on nodes
+    - Basic architecture is mater / worker
+- Offers the following:
+    - Redundant, fault tolerant data storage
+    - Parallel computation framework
+    - Job coordination
+
+## Structure of a MapReduce job
+
+- **Job**: a program to be executed across the **entire** dataset
+    - Packaged as a jar file, with all the code needed
+    - Job is assigned a cluster-unique ID
+    - Data attached to job is replicated over the entire internet
+- **Task**: an execution on a slice of data
+- **Task Attempt**: An instance of a local execution
+
+## Job execution flow
+
+- Split data into computing chunks
+- Assign a chunk to node manager
+- Run many mappers
+- [Shuffle and sort](/3-1-map-reduce.md#running)
+- Run many reducers
+- Result from reducers create the job output
+
+## The Optional Combiner
+
+- The bottleneck in map-reduce frameworks:
+    - Map and reduce jobs scale close to linearly, so they are not
+    - Potential bottleneck at the shuffle and sort operations (between map and
+      reduce)
+        - Data need to be copied over network
+        - Lots of keys are emitted by mapper, and sorting they are costly
+- Combiner can be used to execute before shffling and sorting
+- Reason
+    - Acts as a preliminary reducer
+    - Executed at each mapper node, just before sending all the pairs for
+      shuffling
+    - **Reduces** amount of data emitted by mapper, to improve efficiency
+- Restrictions and rules:
+    - **Cannot** be mandatory, the job should work the correctly without it
+    - **Idempotent**: The number of time the combiner is applied shouldn't
+      change the output
+        - No **Side effect**, or they won't be idempotent
+    - **Preserve** the keys: can't change the keys to disrupt the **sort**
+      order, or changing the **partitioning**
+- Example of using reducer code as the combiner:
+    ```java
+    public void Combine(String key, List<Integer> values) {
+        int sum = 0;
+        for (Integer count: values){
+            sum+=count;
+        }
+        emit(key, sum);
+    }
+    ```
+- TODO review the combiner diagram
+
+## Apache Hadoop
+
+### Architecture of Hadoop
+
+- Executes on nodes connected by network
+- Each node runs a set of daemons
+    - Computing:
+        - `ResourceManager`
+        - `NodeManager`
+    - Storage:
+        - `NameNode`
+        - `SecondaryNameNode` as backup
+        - `DataNode`
+- Nodes are in Master Slave architecture
+    - Master node: `NameNode`, `ResourceManager`
+        - Aware of slave nodes
+        - Receives external requests
+        - Decide the work split of slaves
+        - Notify slaves
+    - Slave node, also called Worker node: `DataNode`, `NodeManager`
+        - Executes the tasks, received from master
+
+### What Hadoop Does
+
+- Resource Management: the existence and availability of resources
+- Job Allocation: needed resources for job, and the split of work
+- Job Execution: Run job, make sure it's completed, deal with failures
+
+## Job execution: YARN
+
+### Intro
+
+- Estimate how many map and reduce tasks are needed for a job, based on _input
+  dataset_ and _job definition_
+- Ideally, one different node for each map / reduce tasks
+
+### Deciding the number of workers
+
+#### Mapper Parallelization
+
+- Different input split are processed on each mapper
+- Input data size is known
+- Number of mappers: Input size $/$ Split Size
+    - If input size is small, and has many files, they won't be splitted and
+      will use more mappers
+
+#### Reducer
+
+- Number of reducer is **user defined**, since it's hard to figure out
+  automatically
+    - Keys are partitioned, partitioning too much lead to overhead in shuffle
+      and sort
+
+### Execution Daemons
+
+#### `ResourceManager`
+
+- On master: one per cluster
+- Responsibility:
+    - Receive job requests from client
+    - Create a `ApplicationMaster` per job to manage it
+    - Allocate **container** in slave nodes, with the assigned resources
+    - Provision the health of `NodeManager` nodes
+
+#### `NodeManager`
+
+- Responsibility:
+    - Coordinate the execution of tasks on node
+    - Send health information to `ResourceManager`
+
+#### `ApplicationMaster`
+
+- Only one per job
+- Responsibility: Job allocation and job execution
+    - Implements specific framework, for example MapReduce
+    - Negitiates with `ResourceManager` on the resources required
+    - Decides which node will run which job, in the container, given by
+      `ResourceManager`
+    - Destroyed when job completed
+
+## Storage: HDFS
+
+### Definition
+
+- Hadoop Distributed File System
+- This is the storage for Hadoop's input and output
+- Features:
+    - Tailored for MapReduce jobs
+    - Large Block size (64MB)
+    - Not a POSIX compliant file system
+
+### Data distribution: key element of map reduce
+
+- Job code (jars) moved to where data is stored
+- Blocks are replicated on the cluster, by default **three** times, to ensure
+  **reliability**
+
+### Storage daemon
+
+- `DataNode`: many per cluster
+    - Stores block from HDFS
+    - Report the blocks stored to `NameNode`
+- `NameNode`: one per cluster
+    - Keep index and location for every block
+    - Don't do computation, because this is heavy
+    - Single point of failure
+- `SecondaryNameNode`
+    - Communicates directly with a `NameNode`
+    - Store backup of index table
+
+### Data Replication
+
+- Format: csv, example:
+
+    | `Filename` | `numReplicas` | `block-ids` |
+    | ---------- | ------------- | ----------- |
+    | part-0     | r:2           | {1,3}       |
+    | part-1     | r:3           | {2,4,5}     |
+
+- Definition: Creating and maintaining multiple copies of data, across different
+  nodes in the HDFS
+- Significance
+    - Fault tolerance
+    - Data availability
+    - System Reliability
+    - Support Parallel Processing
+
+### Failure recovery
+
+- Identifying failure: by missing the hearbeat signal
+- Replication: NameNode initializes replication once a node failure is detected
+- Maintaining Integrity: use a predetermined replication factor
+- Mitigating Potential Disruptions: Dynamic data management
+
+### Operation
+
+- TODO: See the graph in p59
+- Input data:
+    - Mappers are assigned input splits from HDFS input path (default 64MB)
+    - Data locality: `ApplicationMaster` try to assign mapper where data is
+      stored
+- Output data:
+    - Copied to HDFS, one file per reducer
+    - Replicate