2024-12-30 17:26:57 +08:00
|
|
|
# Map Reduce Reliability and Performance
|
|
|
|
|
|
|
|
## Hadoop Performance
|
|
|
|
|
|
|
|
### The concept of speedup:
|
|
|
|
|
2025-01-06 13:42:47 +08:00
|
|
|
- $$S(n) = \frac{TimeTakenWith1Processor}{TimeTakenWithNProcessor}$$
|
2024-12-30 17:26:57 +08:00
|
|
|
- Speedup is problem dependent, as well as architecture dependent
|
|
|
|
- If a job is fully parallelized, it the speed up is linear:
|
|
|
|
[Amdahl's law](/1-4-scalability.md#amdahls-law-important)
|
|
|
|
|
|
|
|
### Calculating speedup
|
|
|
|
|
|
|
|
- Calculating the maximum possible theoretical speedup(ignore IPC and workload
|
|
|
|
imbalance):
|
|
|
|
- According to Amdahl's law, we set $n$ to infinity, thus the speedup
|
|
|
|
becomes: $$S = \frac{1}{\alpha}$$
|
|
|
|
- Which means, if most of the parts are serialized, the speedup is low
|
|
|
|
- The real speedup is lower, because of the communication overhead and workload
|
|
|
|
imbalance
|
|
|
|
|
|
|
|
### Hadoop performane metrics
|
|
|
|
|
|
|
|
- Latency: time between starting a job and delivering output: execution time
|
|
|
|
- Job setup
|
|
|
|
- Reading from HDFS
|
|
|
|
- Lock contention from concurrent tasks
|
|
|
|
- Throughput: the amount of data per second (byte per second)
|
|
|
|
- HDFS throughput(IO or network bound, usually not CPU)
|
|
|
|
- Disk throughput
|
|
|
|
|
|
|
|
### Optimizing for the performance
|
|
|
|
|
|
|
|
- Focus on bottlenecks
|
|
|
|
- Reduce number of key-value pairs, emitted by mapper
|
|
|
|
- Reduce network traffic
|
|
|
|
- Simplify [shuffle and sort](/3-1-map-reduce.md#running)
|
|
|
|
- Sorting in the _shuffle and sort_ stage
|
|
|
|
- Avoid unnecessary java object creation, reuse Writable when possible
|
|
|
|
|
|
|
|
### Problem in load balancing
|
|
|
|
|
|
|
|
- Data skew
|
|
|
|
- Not every task proceess the same amount of data
|
|
|
|
- In general, mappers are less likely to have skew, the splits should be
|
|
|
|
balanced
|
|
|
|
- Reducer are more likely to have skew, since the number of values for a key
|
|
|
|
is hard to predict.
|
|
|
|
- The partition can be trained to provide a balanced speed
|
|
|
|
- For example use a initial sampling of data to train it
|
|
|
|
|
|
|
|
### Summary:
|
|
|
|
|
|
|
|
- Input dataset: size, and number of records
|
|
|
|
- Mapper: number of records, effect of combiner
|
|
|
|
- Reducer: data skew: key with too many records
|
|
|
|
|
|
|
|
## Hadoop Reliability
|
|
|
|
|
|
|
|
### High Availability (HA)
|
|
|
|
|
|
|
|
- A characteristic of a system
|
|
|
|
- To ensure agreed level of operational performance for a higher than normal
|
|
|
|
period
|
|
|
|
- Features
|
|
|
|
- Fault tolerance: System continue to operate correctly on the event of
|
|
|
|
failure
|
|
|
|
- No single point of failure
|
|
|
|
- Graceful degradation: When some component fail, the system temporarily
|
|
|
|
works with worse performance
|
|
|
|
- Calculating availability:
|
|
|
|
[Calculation](/1-4-scalability.md#availaility-important)
|
|
|
|
- Testing: Chaos monkey: open source tool to simulate real-world failure, to
|
|
|
|
test the resilience of it infra
|
|
|
|
- To help testing the potential problems before they become actual problems
|
|
|
|
|
|
|
|
### Error Management
|
|
|
|
|
|
|
|
- Errors are part of a job execution
|
|
|
|
- Examples:
|
|
|
|
- Data integrity
|
|
|
|
- `DataNode` verifies the checksum before writing
|
|
|
|
- clients verify checksum of read blocks
|
|
|
|
- When error found, report it to `NameNode`,
|
|
|
|
- mark the block as corrupt
|
|
|
|
- Clients redirected to other replicas
|
|
|
|
- Replication scheduled
|
|
|
|
- Task failure
|
|
|
|
- Environment problem: java version
|
|
|
|
- Error and hanging occur, mark task as failed, report back to
|
|
|
|
`NodeManager`
|
|
|
|
- `ApplicationMaster` or `ResourceManager` try to re-schedule the task
|
|
|
|
on a different node
|
|
|
|
- Reach maximum number of retries before declaring job failure
|
|
|
|
- `NodeManager` failure
|
|
|
|
- `NodeManager` check the health of node, and report to
|
|
|
|
`ResourceManager`
|
|
|
|
- When it fail, RM can't detect any heartbeat from it
|
|
|
|
- mark as killed, report back to AM
|
|
|
|
- AM try to re-run all the hosted containers in other nodes, after
|
|
|
|
negotiating with RM
|
|
|
|
- Completed map tasks are also rescheduled, since map results are
|
|
|
|
not stored in HDFS
|
|
|
|
- `ResourceManager` or `NameNode` failures
|
|
|
|
- Single point of failure
|
|
|
|
- No way of automatic recovery
|
|
|
|
- Need manual intervention:
|
|
|
|
- Secondary `NameNode` has a backup copy of index table
|
|
|
|
- `ResourceManager` Stores state of jobs, and can re-launch when
|
|
|
|
restarted
|