EBU6502_cloud_computing_notes/3-4-map-reduce-reliability-perf.md

# Map Reduce Reliability and Performance

## Hadoop Performance

### The concept of speedup:

- $$S(n) = \frac{TimeTakenWith1Processor}{TimeTakenWithNProcessor}$$
- Speedup is problem dependent, as well as architecture dependent
- If a job is fully parallelized, it the speed up is linear:
  [Amdahl's law](/1-4-scalability.md#amdahls-law-important)

### Calculating speedup

- Calculating the maximum possible theoretical speedup(ignore IPC and workload
  imbalance):
    - According to Amdahl's law, we set $n$ to infinity, thus the speedup
      becomes: $$S = \frac{1}{\alpha}$$
    - Which means, if most of the parts are serialized, the speedup is low
- The real speedup is lower, because of the communication overhead and workload
  imbalance

### Hadoop performane metrics

- Latency: time between starting a job and delivering output: execution time
    - Job setup
    - Reading from HDFS
    - Lock contention from concurrent tasks
- Throughput: the amount of data per second (byte per second)
    - HDFS throughput(IO or network bound, usually not CPU)
    - Disk throughput

### Optimizing for the performance

- Focus on bottlenecks
    - Reduce number of key-value pairs, emitted by mapper
        - Reduce network traffic
        - Simplify [shuffle and sort](/3-1-map-reduce.md#running)
    - Sorting in the _shuffle and sort_ stage
    - Avoid unnecessary java object creation, reuse Writable when possible

### Problem in load balancing

- Data skew
    - Not every task proceess the same amount of data
    - In general, mappers are less likely to have skew, the splits should be
      balanced
    - Reducer are more likely to have skew, since the number of values for a key
      is hard to predict.
        - The partition can be trained to provide a balanced speed
        - For example use a initial sampling of data to train it

### Summary:

- Input dataset: size, and number of records
- Mapper: number of records, effect of combiner
- Reducer: data skew: key with too many records

## Hadoop Reliability

### High Availability (HA)

- A characteristic of a system
- To ensure agreed level of operational performance for a higher than normal
  period
- Features
    - Fault tolerance: System continue to operate correctly on the event of
      failure
    - No single point of failure
    - Graceful degradation: When some component fail, the system temporarily
      works with worse performance
- Calculating availability:
  [Calculation](/1-4-scalability.md#availaility-important)
- Testing: Chaos monkey: open source tool to simulate real-world failure, to
  test the resilience of it infra
    - To help testing the potential problems before they become actual problems

### Error Management

- Errors are part of a job execution
- Examples:
    - Data integrity
        - `DataNode` verifies the checksum before writing
        - clients verify checksum of read blocks
        - When error found, report it to `NameNode`,
            - mark the block as corrupt
            - Clients redirected to other replicas
            - Replication scheduled
    - Task failure
        - Environment problem: java version
        - Error and hanging occur, mark task as failed, report back to
          `NodeManager`
        - `ApplicationMaster` or `ResourceManager` try to re-schedule the task
          on a different node
        - Reach maximum number of retries before declaring job failure
    - `NodeManager` failure
        - `NodeManager` check the health of node, and report to
          `ResourceManager`
        - When it fail, RM can't detect any heartbeat from it
            - mark as killed, report back to AM
            - AM try to re-run all the hosted containers in other nodes, after
              negotiating with RM
            - Completed map tasks are also rescheduled, since map results are
              not stored in HDFS
    - `ResourceManager` or `NameNode` failures
        - Single point of failure
        - No way of automatic recovery
        - Need manual intervention:
            - Secondary `NameNode` has a backup copy of index table
            - `ResourceManager` Stores state of jobs, and can re-launch when
              restarted
3-4, 1hr 2024-12-30 17:26:57 +08:00			`# Map Reduce Reliability and Performance`

			`## Hadoop Performance`

			`### The concept of speedup:`

Try to fix mathjax formatting 2025-01-06 13:42:47 +08:00			`- $$S(n) = \frac{TimeTakenWith1Processor}{TimeTakenWithNProcessor}$$`
3-4, 1hr 2024-12-30 17:26:57 +08:00			`- Speedup is problem dependent, as well as architecture dependent`
			`- If a job is fully parallelized, it the speed up is linear:`
			`[Amdahl's law](/1-4-scalability.md#amdahls-law-important)`

			`### Calculating speedup`

			`- Calculating the maximum possible theoretical speedup(ignore IPC and workload`
			`imbalance):`
			`- According to Amdahl's law, we set $n$ to infinity, thus the speedup`
			`becomes: $$S = \frac{1}{\alpha}$$`
			`- Which means, if most of the parts are serialized, the speedup is low`
			`- The real speedup is lower, because of the communication overhead and workload`
			`imbalance`

			`### Hadoop performane metrics`

			`- Latency: time between starting a job and delivering output: execution time`
			`- Job setup`
			`- Reading from HDFS`
			`- Lock contention from concurrent tasks`
			`- Throughput: the amount of data per second (byte per second)`
			`- HDFS throughput(IO or network bound, usually not CPU)`
			`- Disk throughput`

			`### Optimizing for the performance`

			`- Focus on bottlenecks`
			`- Reduce number of key-value pairs, emitted by mapper`
			`- Reduce network traffic`
			`- Simplify [shuffle and sort](/3-1-map-reduce.md#running)`
			`- Sorting in the _shuffle and sort_ stage`
			`- Avoid unnecessary java object creation, reuse Writable when possible`

			`### Problem in load balancing`

			`- Data skew`
			`- Not every task proceess the same amount of data`
			`- In general, mappers are less likely to have skew, the splits should be`
			`balanced`
			`- Reducer are more likely to have skew, since the number of values for a key`
			`is hard to predict.`
			`- The partition can be trained to provide a balanced speed`
			`- For example use a initial sampling of data to train it`

			`### Summary:`

			`- Input dataset: size, and number of records`
			`- Mapper: number of records, effect of combiner`
			`- Reducer: data skew: key with too many records`

			`## Hadoop Reliability`

			`### High Availability (HA)`

			`- A characteristic of a system`
			`- To ensure agreed level of operational performance for a higher than normal`
			`period`
			`- Features`
			`- Fault tolerance: System continue to operate correctly on the event of`
			`failure`
			`- No single point of failure`
			`- Graceful degradation: When some component fail, the system temporarily`
			`works with worse performance`
			`- Calculating availability:`
			`[Calculation](/1-4-scalability.md#availaility-important)`
			`- Testing: Chaos monkey: open source tool to simulate real-world failure, to`
			`test the resilience of it infra`
			`- To help testing the potential problems before they become actual problems`

			`### Error Management`

			`- Errors are part of a job execution`
			`- Examples:`
			`- Data integrity`
			- `DataNode` verifies the checksum before writing
			`- clients verify checksum of read blocks`
			- When error found, report it to `NameNode`,
			`- mark the block as corrupt`
			`- Clients redirected to other replicas`
			`- Replication scheduled`
			`- Task failure`
			`- Environment problem: java version`
			`- Error and hanging occur, mark task as failed, report back to`
			`NodeManager`
			- `ApplicationMaster` or `ResourceManager` try to re-schedule the task
			`on a different node`
			`- Reach maximum number of retries before declaring job failure`
			- `NodeManager` failure
			- `NodeManager` check the health of node, and report to
			`ResourceManager`
			`- When it fail, RM can't detect any heartbeat from it`
			`- mark as killed, report back to AM`
			`- AM try to re-run all the hosted containers in other nodes, after`
			`negotiating with RM`
			`- Completed map tasks are also rescheduled, since map results are`
			`not stored in HDFS`
			- `ResourceManager` or `NameNode` failures
			`- Single point of failure`
			`- No way of automatic recovery`
			`- Need manual intervention:`
			- Secondary `NameNode` has a backup copy of index table
			- `ResourceManager` Stores state of jobs, and can re-launch when
			`restarted`