From 40f6f6b4361aa2d047da9c6f24cf944bd34ddc48 Mon Sep 17 00:00:00 2001 From: Ryan Date: Mon, 30 Dec 2024 17:26:57 +0800 Subject: [PATCH] 3-4, 1hr --- 3-1-map-reduce.md | 2 - 3-4-map-reduce-reliability-perf.md | 110 +++++++++++++++++++++++++++++ 2 files changed, 110 insertions(+), 2 deletions(-) create mode 100644 3-4-map-reduce-reliability-perf.md diff --git a/3-1-map-reduce.md b/3-1-map-reduce.md index baa1ec8..b1819ec 100644 --- a/3-1-map-reduce.md +++ b/3-1-map-reduce.md @@ -172,8 +172,6 @@ **collected** - Pairs are **partitioned** to a list of values, each partition sorted by key - - Combiner run on each partition, to combine the key-value pairs with - key- list of values - Data is shuffled and sent to each reducer - Sort: - Reducer copy data from mapper diff --git a/3-4-map-reduce-reliability-perf.md b/3-4-map-reduce-reliability-perf.md new file mode 100644 index 0000000..2dd09aa --- /dev/null +++ b/3-4-map-reduce-reliability-perf.md @@ -0,0 +1,110 @@ +# Map Reduce Reliability and Performance + +## Hadoop Performance + +### The concept of speedup: + +- $$S(n) = \frac{time_taken_with_1_processor}{time_taken_with_n_processor}$$ +- Speedup is problem dependent, as well as architecture dependent +- If a job is fully parallelized, it the speed up is linear: + [Amdahl's law](/1-4-scalability.md#amdahls-law-important) + +### Calculating speedup + +- Calculating the maximum possible theoretical speedup(ignore IPC and workload + imbalance): + - According to Amdahl's law, we set $n$ to infinity, thus the speedup + becomes: $$S = \frac{1}{\alpha}$$ + - Which means, if most of the parts are serialized, the speedup is low +- The real speedup is lower, because of the communication overhead and workload + imbalance + +### Hadoop performane metrics + +- Latency: time between starting a job and delivering output: execution time + - Job setup + - Reading from HDFS + - Lock contention from concurrent tasks +- Throughput: the amount of data per second (byte per second) + - HDFS throughput(IO or network bound, usually not CPU) + - Disk throughput + +### Optimizing for the performance + +- Focus on bottlenecks + - Reduce number of key-value pairs, emitted by mapper + - Reduce network traffic + - Simplify [shuffle and sort](/3-1-map-reduce.md#running) + - Sorting in the _shuffle and sort_ stage + - Avoid unnecessary java object creation, reuse Writable when possible + +### Problem in load balancing + +- Data skew + - Not every task proceess the same amount of data + - In general, mappers are less likely to have skew, the splits should be + balanced + - Reducer are more likely to have skew, since the number of values for a key + is hard to predict. + - The partition can be trained to provide a balanced speed + - For example use a initial sampling of data to train it + +### Summary: + +- Input dataset: size, and number of records +- Mapper: number of records, effect of combiner +- Reducer: data skew: key with too many records + +## Hadoop Reliability + +### High Availability (HA) + +- A characteristic of a system +- To ensure agreed level of operational performance for a higher than normal + period +- Features + - Fault tolerance: System continue to operate correctly on the event of + failure + - No single point of failure + - Graceful degradation: When some component fail, the system temporarily + works with worse performance +- Calculating availability: + [Calculation](/1-4-scalability.md#availaility-important) +- Testing: Chaos monkey: open source tool to simulate real-world failure, to + test the resilience of it infra + - To help testing the potential problems before they become actual problems + +### Error Management + +- Errors are part of a job execution +- Examples: + - Data integrity + - `DataNode` verifies the checksum before writing + - clients verify checksum of read blocks + - When error found, report it to `NameNode`, + - mark the block as corrupt + - Clients redirected to other replicas + - Replication scheduled + - Task failure + - Environment problem: java version + - Error and hanging occur, mark task as failed, report back to + `NodeManager` + - `ApplicationMaster` or `ResourceManager` try to re-schedule the task + on a different node + - Reach maximum number of retries before declaring job failure + - `NodeManager` failure + - `NodeManager` check the health of node, and report to + `ResourceManager` + - When it fail, RM can't detect any heartbeat from it + - mark as killed, report back to AM + - AM try to re-run all the hosted containers in other nodes, after + negotiating with RM + - Completed map tasks are also rescheduled, since map results are + not stored in HDFS + - `ResourceManager` or `NameNode` failures + - Single point of failure + - No way of automatic recovery + - Need manual intervention: + - Secondary `NameNode` has a backup copy of index table + - `ResourceManager` Stores state of jobs, and can re-launch when + restarted