add 3-2, took 2hr and more

2024-12-30 15:14:36 +08:00 · 2024-12-30 15:14:36 +08:00 · b56bb2278e
parent dd6c4dad47
commit b56bb2278e
2 changed files with 226 additions and 6 deletions
--- a/3-1-map-reduce.md
+++ b/3-1-map-reduce.md
@ -165,14 +165,23 @@
 #### Running
- Shuffle and sort
+- Shuffle, moving data from mappers to reducers and sort, ordering outputs
-    - Every intermediate key-value pairs generated by mapper are collected
+  before being processed by reducer
-    - Same key items are grouped to a list of values
+    - Shuffle:
-    - Data is shuffled and sent to each reducer
+        - Every intermediate key-value pairs generated by mapper are
-    - Data provided to each reducer is sorted by keys
+          **collected**
        - Pairs are **partitioned** to a list of values, each partition sorted
          by key
        - Combiner run on each partition, to combine the key-value pairs with
          key- list of values
        - Data is shuffled and sent to each reducer
    - Sort:
        - Reducer copy data from mapper
        - Downloaded data are **merged** and **sorted** as input for reducer, key-list
          of values, sorted by keys
 - Runtime
    - Partitions input
-    - schedules executions
+    - Schedules executions
    - Handles load balancing
    - Shuffle, partition, sort intermediate data
    - Handle failure
--- a/3-2-hadoop.md
+++ b/3-2-hadoop.md
@ -0,0 +1,211 @@
 # Hadoop Architecture
 ## Features
 - Designed to run in clusters of pcs
 - Scales up linearly
 - Suitable for local networks or data centers
 - Design principles
    - Data is distributed around the network
    - Computation is sent to data: code is sent to run on nodes
    - Basic architecture is mater / worker
 - Offers the following:
    - Redundant, fault tolerant data storage
    - Parallel computation framework
    - Job coordination
 ## Structure of a MapReduce job
 - **Job**: a program to be executed across the **entire** dataset
    - Packaged as a jar file, with all the code needed
    - Job is assigned a cluster-unique ID
    - Data attached to job is replicated over the entire internet
 - **Task**: an execution on a slice of data
 - **Task Attempt**: An instance of a local execution
 ## Job execution flow
 - Split data into computing chunks
 - Assign a chunk to node manager
 - Run many mappers
 - [Shuffle and sort](/3-1-map-reduce.md#running)
 - Run many reducers
 - Result from reducers create the job output
 ## The Optional Combiner
 - The bottleneck in map-reduce frameworks:
    - Map and reduce jobs scale close to linearly, so they are not
    - Potential bottleneck at the shuffle and sort operations (between map and
      reduce)
        - Data need to be copied over network
        - Lots of keys are emitted by mapper, and sorting they are costly
 - Combiner can be used to execute before shffling and sorting
 - Reason
    - Acts as a preliminary reducer
    - Executed at each mapper node, just before sending all the pairs for
      shuffling
    - **Reduces** amount of data emitted by mapper, to improve efficiency
 - Restrictions and rules:
    - **Cannot** be mandatory, the job should work the correctly without it
    - **Idempotent**: The number of time the combiner is applied shouldn't
      change the output
        - No **Side effect**, or they won't be idempotent
    - **Preserve** the keys: can't change the keys to disrupt the **sort**
      order, or changing the **partitioning**
 - Example of using reducer code as the combiner:
    ```java
    public void Combine(String key, List<Integer> values) {
        int sum = 0;
        for (Integer count: values){
            sum+=count;
        }
        emit(key, sum);
    }
    ```
 - TODO review the combiner diagram
 ## Apache Hadoop
 ### Architecture of Hadoop
 - Executes on nodes connected by network
 - Each node runs a set of daemons
    - Computing:
        - `ResourceManager`
        - `NodeManager`
    - Storage:
        - `NameNode`
        - `SecondaryNameNode` as backup
        - `DataNode`
 - Nodes are in Master Slave architecture
    - Master node: `NameNode`, `ResourceManager`
        - Aware of slave nodes
        - Receives external requests
        - Decide the work split of slaves
        - Notify slaves
    - Slave node, also called Worker node: `DataNode`, `NodeManager`
        - Executes the tasks, received from master
 ### What Hadoop Does
 - Resource Management: the existence and availability of resources
 - Job Allocation: needed resources for job, and the split of work
 - Job Execution: Run job, make sure it's completed, deal with failures
 ## Job execution: YARN
 ### Intro
 - Estimate how many map and reduce tasks are needed for a job, based on _input
  dataset_ and _job definition_
 - Ideally, one different node for each map / reduce tasks
 ### Deciding the number of workers
 #### Mapper Parallelization
 - Different input split are processed on each mapper
 - Input data size is known
 - Number of mappers: Input size $/$ Split Size
    - If input size is small, and has many files, they won't be splitted and
      will use more mappers
 #### Reducer
 - Number of reducer is **user defined**, since it's hard to figure out
  automatically
    - Keys are partitioned, partitioning too much lead to overhead in shuffle
      and sort
 ### Execution Daemons
 #### `ResourceManager`
 - On master: one per cluster
 - Responsibility:
    - Receive job requests from client
    - Create a `ApplicationMaster` per job to manage it
    - Allocate **container** in slave nodes, with the assigned resources
    - Provision the health of `NodeManager` nodes
 #### `NodeManager`
 - Responsibility:
    - Coordinate the execution of tasks on node
    - Send health information to `ResourceManager`
 #### `ApplicationMaster`
 - Only one per job
 - Responsibility: Job allocation and job execution
    - Implements specific framework, for example MapReduce
    - Negitiates with `ResourceManager` on the resources required
    - Decides which node will run which job, in the container, given by
      `ResourceManager`
    - Destroyed when job completed
 ## Storage: HDFS
 ### Definition
 - Hadoop Distributed File System
 - This is the storage for Hadoop's input and output
 - Features:
    - Tailored for MapReduce jobs
    - Large Block size (64MB)
    - Not a POSIX compliant file system
 ### Data distribution: key element of map reduce
 - Job code (jars) moved to where data is stored
 - Blocks are replicated on the cluster, by default **three** times, to ensure
  **reliability**
 ### Storage daemon
 - `DataNode`: many per cluster
    - Stores block from HDFS
    - Report the blocks stored to `NameNode`
 - `NameNode`: one per cluster
    - Keep index and location for every block
    - Don't do computation, because this is heavy
    - Single point of failure
 - `SecondaryNameNode`
    - Communicates directly with a `NameNode`
    - Store backup of index table
 ### Data Replication
 - Format: csv, example:
    | `Filename` | `numReplicas` | `block-ids` |
    | ---------- | ------------- | ----------- |
    | part-0     | r:2           | {1,3}       |
    | part-1     | r:3           | {2,4,5}     |
 - Definition: Creating and maintaining multiple copies of data, across different
  nodes in the HDFS
 - Significance
    - Fault tolerance
    - Data availability
    - System Reliability
    - Support Parallel Processing
 ### Failure recovery
 - Identifying failure: by missing the hearbeat signal
 - Replication: NameNode initializes replication once a node failure is detected
 - Maintaining Integrity: use a predetermined replication factor
 - Mitigating Potential Disruptions: Dynamic data management
 ### Operation
 - TODO: See the graph in p59
 - Input data:
    - Mappers are assigned input splits from HDFS input path (default 64MB)
    - Data locality: `ApplicationMaster` try to assign mapper where data is
      stored
 - Output data:
    - Copied to HDFS, one file per reducer
    - Replicate