add 3-2, took 2hr and more

This commit is contained in:
Ryan 2024-12-30 15:14:36 +08:00
parent dd6c4dad47
commit b56bb2278e
2 changed files with 226 additions and 6 deletions

View file

@ -165,14 +165,23 @@
#### Running #### Running
- Shuffle and sort - Shuffle, moving data from mappers to reducers and sort, ordering outputs
- Every intermediate key-value pairs generated by mapper are collected before being processed by reducer
- Same key items are grouped to a list of values - Shuffle:
- Data is shuffled and sent to each reducer - Every intermediate key-value pairs generated by mapper are
- Data provided to each reducer is sorted by keys **collected**
- Pairs are **partitioned** to a list of values, each partition sorted
by key
- Combiner run on each partition, to combine the key-value pairs with
key- list of values
- Data is shuffled and sent to each reducer
- Sort:
- Reducer copy data from mapper
- Downloaded data are **merged** and **sorted** as input for reducer, key-list
of values, sorted by keys
- Runtime - Runtime
- Partitions input - Partitions input
- schedules executions - Schedules executions
- Handles load balancing - Handles load balancing
- Shuffle, partition, sort intermediate data - Shuffle, partition, sort intermediate data
- Handle failure - Handle failure

211
3-2-hadoop.md Normal file
View file

@ -0,0 +1,211 @@
# Hadoop Architecture
## Features
- Designed to run in clusters of pcs
- Scales up linearly
- Suitable for local networks or data centers
- Design principles
- Data is distributed around the network
- Computation is sent to data: code is sent to run on nodes
- Basic architecture is mater / worker
- Offers the following:
- Redundant, fault tolerant data storage
- Parallel computation framework
- Job coordination
## Structure of a MapReduce job
- **Job**: a program to be executed across the **entire** dataset
- Packaged as a jar file, with all the code needed
- Job is assigned a cluster-unique ID
- Data attached to job is replicated over the entire internet
- **Task**: an execution on a slice of data
- **Task Attempt**: An instance of a local execution
## Job execution flow
- Split data into computing chunks
- Assign a chunk to node manager
- Run many mappers
- [Shuffle and sort](/3-1-map-reduce.md#running)
- Run many reducers
- Result from reducers create the job output
## The Optional Combiner
- The bottleneck in map-reduce frameworks:
- Map and reduce jobs scale close to linearly, so they are not
- Potential bottleneck at the shuffle and sort operations (between map and
reduce)
- Data need to be copied over network
- Lots of keys are emitted by mapper, and sorting they are costly
- Combiner can be used to execute before shffling and sorting
- Reason
- Acts as a preliminary reducer
- Executed at each mapper node, just before sending all the pairs for
shuffling
- **Reduces** amount of data emitted by mapper, to improve efficiency
- Restrictions and rules:
- **Cannot** be mandatory, the job should work the correctly without it
- **Idempotent**: The number of time the combiner is applied shouldn't
change the output
- No **Side effect**, or they won't be idempotent
- **Preserve** the keys: can't change the keys to disrupt the **sort**
order, or changing the **partitioning**
- Example of using reducer code as the combiner:
```java
public void Combine(String key, List<Integer> values) {
int sum = 0;
for (Integer count: values){
sum+=count;
}
emit(key, sum);
}
```
- TODO review the combiner diagram
## Apache Hadoop
### Architecture of Hadoop
- Executes on nodes connected by network
- Each node runs a set of daemons
- Computing:
- `ResourceManager`
- `NodeManager`
- Storage:
- `NameNode`
- `SecondaryNameNode` as backup
- `DataNode`
- Nodes are in Master Slave architecture
- Master node: `NameNode`, `ResourceManager`
- Aware of slave nodes
- Receives external requests
- Decide the work split of slaves
- Notify slaves
- Slave node, also called Worker node: `DataNode`, `NodeManager`
- Executes the tasks, received from master
### What Hadoop Does
- Resource Management: the existence and availability of resources
- Job Allocation: needed resources for job, and the split of work
- Job Execution: Run job, make sure it's completed, deal with failures
## Job execution: YARN
### Intro
- Estimate how many map and reduce tasks are needed for a job, based on _input
dataset_ and _job definition_
- Ideally, one different node for each map / reduce tasks
### Deciding the number of workers
#### Mapper Parallelization
- Different input split are processed on each mapper
- Input data size is known
- Number of mappers: Input size $/$ Split Size
- If input size is small, and has many files, they won't be splitted and
will use more mappers
#### Reducer
- Number of reducer is **user defined**, since it's hard to figure out
automatically
- Keys are partitioned, partitioning too much lead to overhead in shuffle
and sort
### Execution Daemons
#### `ResourceManager`
- On master: one per cluster
- Responsibility:
- Receive job requests from client
- Create a `ApplicationMaster` per job to manage it
- Allocate **container** in slave nodes, with the assigned resources
- Provision the health of `NodeManager` nodes
#### `NodeManager`
- Responsibility:
- Coordinate the execution of tasks on node
- Send health information to `ResourceManager`
#### `ApplicationMaster`
- Only one per job
- Responsibility: Job allocation and job execution
- Implements specific framework, for example MapReduce
- Negitiates with `ResourceManager` on the resources required
- Decides which node will run which job, in the container, given by
`ResourceManager`
- Destroyed when job completed
## Storage: HDFS
### Definition
- Hadoop Distributed File System
- This is the storage for Hadoop's input and output
- Features:
- Tailored for MapReduce jobs
- Large Block size (64MB)
- Not a POSIX compliant file system
### Data distribution: key element of map reduce
- Job code (jars) moved to where data is stored
- Blocks are replicated on the cluster, by default **three** times, to ensure
**reliability**
### Storage daemon
- `DataNode`: many per cluster
- Stores block from HDFS
- Report the blocks stored to `NameNode`
- `NameNode`: one per cluster
- Keep index and location for every block
- Don't do computation, because this is heavy
- Single point of failure
- `SecondaryNameNode`
- Communicates directly with a `NameNode`
- Store backup of index table
### Data Replication
- Format: csv, example:
| `Filename` | `numReplicas` | `block-ids` |
| ---------- | ------------- | ----------- |
| part-0 | r:2 | {1,3} |
| part-1 | r:3 | {2,4,5} |
- Definition: Creating and maintaining multiple copies of data, across different
nodes in the HDFS
- Significance
- Fault tolerance
- Data availability
- System Reliability
- Support Parallel Processing
### Failure recovery
- Identifying failure: by missing the hearbeat signal
- Replication: NameNode initializes replication once a node failure is detected
- Maintaining Integrity: use a predetermined replication factor
- Mitigating Potential Disruptions: Dynamic data management
### Operation
- TODO: See the graph in p59
- Input data:
- Mappers are assigned input splits from HDFS input path (default 64MB)
- Data locality: `ApplicationMaster` try to assign mapper where data is
stored
- Output data:
- Copied to HDFS, one file per reducer
- Replicate