From b56bb2278e36b8c938634e3cd31519de1513d96b Mon Sep 17 00:00:00 2001 From: Ryan Date: Mon, 30 Dec 2024 15:14:36 +0800 Subject: [PATCH] add 3-2, took 2hr and more --- 3-1-map-reduce.md | 21 +++-- 3-2-hadoop.md | 211 ++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 226 insertions(+), 6 deletions(-) create mode 100644 3-2-hadoop.md diff --git a/3-1-map-reduce.md b/3-1-map-reduce.md index b91e453..baa1ec8 100644 --- a/3-1-map-reduce.md +++ b/3-1-map-reduce.md @@ -165,14 +165,23 @@ #### Running -- Shuffle and sort - - Every intermediate key-value pairs generated by mapper are collected - - Same key items are grouped to a list of values - - Data is shuffled and sent to each reducer - - Data provided to each reducer is sorted by keys +- Shuffle, moving data from mappers to reducers and sort, ordering outputs + before being processed by reducer + - Shuffle: + - Every intermediate key-value pairs generated by mapper are + **collected** + - Pairs are **partitioned** to a list of values, each partition sorted + by key + - Combiner run on each partition, to combine the key-value pairs with + key- list of values + - Data is shuffled and sent to each reducer + - Sort: + - Reducer copy data from mapper + - Downloaded data are **merged** and **sorted** as input for reducer, key-list + of values, sorted by keys - Runtime - Partitions input - - schedules executions + - Schedules executions - Handles load balancing - Shuffle, partition, sort intermediate data - Handle failure diff --git a/3-2-hadoop.md b/3-2-hadoop.md new file mode 100644 index 0000000..b0c49d4 --- /dev/null +++ b/3-2-hadoop.md @@ -0,0 +1,211 @@ +# Hadoop Architecture + +## Features + +- Designed to run in clusters of pcs +- Scales up linearly +- Suitable for local networks or data centers +- Design principles + - Data is distributed around the network + - Computation is sent to data: code is sent to run on nodes + - Basic architecture is mater / worker +- Offers the following: + - Redundant, fault tolerant data storage + - Parallel computation framework + - Job coordination + +## Structure of a MapReduce job + +- **Job**: a program to be executed across the **entire** dataset + - Packaged as a jar file, with all the code needed + - Job is assigned a cluster-unique ID + - Data attached to job is replicated over the entire internet +- **Task**: an execution on a slice of data +- **Task Attempt**: An instance of a local execution + +## Job execution flow + +- Split data into computing chunks +- Assign a chunk to node manager +- Run many mappers +- [Shuffle and sort](/3-1-map-reduce.md#running) +- Run many reducers +- Result from reducers create the job output + +## The Optional Combiner + +- The bottleneck in map-reduce frameworks: + - Map and reduce jobs scale close to linearly, so they are not + - Potential bottleneck at the shuffle and sort operations (between map and + reduce) + - Data need to be copied over network + - Lots of keys are emitted by mapper, and sorting they are costly +- Combiner can be used to execute before shffling and sorting +- Reason + - Acts as a preliminary reducer + - Executed at each mapper node, just before sending all the pairs for + shuffling + - **Reduces** amount of data emitted by mapper, to improve efficiency +- Restrictions and rules: + - **Cannot** be mandatory, the job should work the correctly without it + - **Idempotent**: The number of time the combiner is applied shouldn't + change the output + - No **Side effect**, or they won't be idempotent + - **Preserve** the keys: can't change the keys to disrupt the **sort** + order, or changing the **partitioning** +- Example of using reducer code as the combiner: + ```java + public void Combine(String key, List values) { + int sum = 0; + for (Integer count: values){ + sum+=count; + } + emit(key, sum); + } + ``` +- TODO review the combiner diagram + +## Apache Hadoop + +### Architecture of Hadoop + +- Executes on nodes connected by network +- Each node runs a set of daemons + - Computing: + - `ResourceManager` + - `NodeManager` + - Storage: + - `NameNode` + - `SecondaryNameNode` as backup + - `DataNode` +- Nodes are in Master Slave architecture + - Master node: `NameNode`, `ResourceManager` + - Aware of slave nodes + - Receives external requests + - Decide the work split of slaves + - Notify slaves + - Slave node, also called Worker node: `DataNode`, `NodeManager` + - Executes the tasks, received from master + +### What Hadoop Does + +- Resource Management: the existence and availability of resources +- Job Allocation: needed resources for job, and the split of work +- Job Execution: Run job, make sure it's completed, deal with failures + +## Job execution: YARN + +### Intro + +- Estimate how many map and reduce tasks are needed for a job, based on _input + dataset_ and _job definition_ +- Ideally, one different node for each map / reduce tasks + +### Deciding the number of workers + +#### Mapper Parallelization + +- Different input split are processed on each mapper +- Input data size is known +- Number of mappers: Input size $/$ Split Size + - If input size is small, and has many files, they won't be splitted and + will use more mappers + +#### Reducer + +- Number of reducer is **user defined**, since it's hard to figure out + automatically + - Keys are partitioned, partitioning too much lead to overhead in shuffle + and sort + +### Execution Daemons + +#### `ResourceManager` + +- On master: one per cluster +- Responsibility: + - Receive job requests from client + - Create a `ApplicationMaster` per job to manage it + - Allocate **container** in slave nodes, with the assigned resources + - Provision the health of `NodeManager` nodes + +#### `NodeManager` + +- Responsibility: + - Coordinate the execution of tasks on node + - Send health information to `ResourceManager` + +#### `ApplicationMaster` + +- Only one per job +- Responsibility: Job allocation and job execution + - Implements specific framework, for example MapReduce + - Negitiates with `ResourceManager` on the resources required + - Decides which node will run which job, in the container, given by + `ResourceManager` + - Destroyed when job completed + +## Storage: HDFS + +### Definition + +- Hadoop Distributed File System +- This is the storage for Hadoop's input and output +- Features: + - Tailored for MapReduce jobs + - Large Block size (64MB) + - Not a POSIX compliant file system + +### Data distribution: key element of map reduce + +- Job code (jars) moved to where data is stored +- Blocks are replicated on the cluster, by default **three** times, to ensure + **reliability** + +### Storage daemon + +- `DataNode`: many per cluster + - Stores block from HDFS + - Report the blocks stored to `NameNode` +- `NameNode`: one per cluster + - Keep index and location for every block + - Don't do computation, because this is heavy + - Single point of failure +- `SecondaryNameNode` + - Communicates directly with a `NameNode` + - Store backup of index table + +### Data Replication + +- Format: csv, example: + + | `Filename` | `numReplicas` | `block-ids` | + | ---------- | ------------- | ----------- | + | part-0 | r:2 | {1,3} | + | part-1 | r:3 | {2,4,5} | + +- Definition: Creating and maintaining multiple copies of data, across different + nodes in the HDFS +- Significance + - Fault tolerance + - Data availability + - System Reliability + - Support Parallel Processing + +### Failure recovery + +- Identifying failure: by missing the hearbeat signal +- Replication: NameNode initializes replication once a node failure is detected +- Maintaining Integrity: use a predetermined replication factor +- Mitigating Potential Disruptions: Dynamic data management + +### Operation + +- TODO: See the graph in p59 +- Input data: + - Mappers are assigned input splits from HDFS input path (default 64MB) + - Data locality: `ApplicationMaster` try to assign mapper where data is + stored +- Output data: + - Copied to HDFS, one file per reducer + - Replicate