EBU6502_cloud_computing_notes/3-4-map-reduce-reliability-perf.md

# Map Reduce Reliability and Performance

## Hadoop Performance

### The concept of speedup:

- $$S(n) = \frac{TimeTakenWith1Processor}{TimeTakenWithNProcessor}$$
- Speedup is problem dependent, as well as architecture dependent
- If a job is fully parallelized, it the speed up is linear:
  [Amdahl's law](/1-4-scalability.md#amdahls-law-important)

### Calculating speedup

- Calculating the maximum possible theoretical speedup(ignore IPC and workload
  imbalance):
    - According to Amdahl's law, we set $n$ to infinity, thus the speedup
      becomes: $$S = \frac{1}{\alpha}$$
    - Which means, if most of the parts are serialized, the speedup is low
- The real speedup is lower, because of the communication overhead and workload
  imbalance

### Hadoop performane metrics

- Latency: time between starting a job and delivering output: execution time
    - Job setup
    - Reading from HDFS
    - Lock contention from concurrent tasks
- Throughput: the amount of data per second (byte per second)
    - HDFS throughput(IO or network bound, usually not CPU)
    - Disk throughput

### Optimizing for the performance

- Focus on bottlenecks
    - Reduce number of key-value pairs, emitted by mapper
        - Reduce network traffic
        - Simplify [shuffle and sort](/3-1-map-reduce.md#running)
    - Sorting in the _shuffle and sort_ stage
    - Avoid unnecessary java object creation, reuse Writable when possible

### Problem in load balancing

- Data skew
    - Not every task proceess the same amount of data
    - In general, mappers are less likely to have skew, the splits should be
      balanced
    - Reducer are more likely to have skew, since the number of values for a key
      is hard to predict.
        - The partition can be trained to provide a balanced speed
        - For example use a initial sampling of data to train it

### Summary:

- Input dataset: size, and number of records
- Mapper: number of records, effect of combiner
- Reducer: data skew: key with too many records

## Hadoop Reliability

### High Availability (HA)

- A characteristic of a system
- To ensure agreed level of operational performance for a higher than normal
  period
- Features
    - Fault tolerance: System continue to operate correctly on the event of
      failure
    - No single point of failure
    - Graceful degradation: When some component fail, the system temporarily
      works with worse performance
- Calculating availability:
  [Calculation](/1-4-scalability.md#availaility-important)
- Testing: Chaos monkey: open source tool to simulate real-world failure, to
  test the resilience of it infra
    - To help testing the potential problems before they become actual problems

### Error Management

- Errors are part of a job execution
- Examples:
    - Data integrity
        - `DataNode` verifies the checksum before writing
        - clients verify checksum of read blocks
        - When error found, report it to `NameNode`,
            - mark the block as corrupt
            - Clients redirected to other replicas
            - Replication scheduled
    - Task failure
        - Environment problem: java version
        - Error and hanging occur, mark task as failed, report back to
          `NodeManager`
        - `ApplicationMaster` or `ResourceManager` try to re-schedule the task
          on a different node
        - Reach maximum number of retries before declaring job failure
    - `NodeManager` failure
        - `NodeManager` check the health of node, and report to
          `ResourceManager`
        - When it fail, RM can't detect any heartbeat from it
            - mark as killed, report back to AM
            - AM try to re-run all the hosted containers in other nodes, after
              negotiating with RM
            - Completed map tasks are also rescheduled, since map results are
              not stored in HDFS
    - `ResourceManager` or `NameNode` failures
        - Single point of failure
        - No way of automatic recovery
        - Need manual intervention:
            - Secondary `NameNode` has a backup copy of index table
            - `ResourceManager` Stores state of jobs, and can re-launch when
              restarted