root/EBU6502_cloud_computing_notes

Ryan 6eb0c0c058 Try to fix mathjax formatting

2025-01-06 13:42:47 +08:00

4.1 KiB

Raw Blame History

Map Reduce Reliability and Performance

Hadoop Performance

The concept of speedup:

S(n) = \frac{TimeTakenWith1Processor}{TimeTakenWithNProcessor}

Speedup is problem dependent, as well as architecture dependent
If a job is fully parallelized, it the speed up is linear: Amdahl's law

Calculating speedup

Calculating the maximum possible theoretical speedup(ignore IPC and workload imbalance):
- According to Amdahl's law, we set n to infinity, thus the speedup becomes: $$S = \frac{1}{\alpha}
- Which means, if most of the parts are serialized, the speedup is low
The real speedup is lower, because of the communication overhead and workload imbalance

Hadoop performane metrics

Latency: time between starting a job and delivering output: execution time
- Job setup
- Reading from HDFS
- Lock contention from concurrent tasks
Throughput: the amount of data per second (byte per second)
- HDFS throughput(IO or network bound, usually not CPU)
- Disk throughput

Optimizing for the performance

Focus on bottlenecks
- Reduce number of key-value pairs, emitted by mapper
  - Reduce network traffic
  - Simplify shuffle and sort
- Sorting in the shuffle and sort stage
- Avoid unnecessary java object creation, reuse Writable when possible

Problem in load balancing

Data skew
- Not every task proceess the same amount of data
- In general, mappers are less likely to have skew, the splits should be balanced
- Reducer are more likely to have skew, since the number of values for a key is hard to predict.
  - The partition can be trained to provide a balanced speed
  - For example use a initial sampling of data to train it

Summary:

Input dataset: size, and number of records
Mapper: number of records, effect of combiner
Reducer: data skew: key with too many records

Hadoop Reliability

High Availability (HA)

A characteristic of a system
To ensure agreed level of operational performance for a higher than normal period
Features
- Fault tolerance: System continue to operate correctly on the event of failure
- No single point of failure
- Graceful degradation: When some component fail, the system temporarily works with worse performance
Calculating availability: Calculation
Testing: Chaos monkey: open source tool to simulate real-world failure, to test the resilience of it infra
- To help testing the potential problems before they become actual problems

Error Management

Errors are part of a job execution
Examples:
- Data integrity
  - DataNode verifies the checksum before writing
  - clients verify checksum of read blocks
  - When error found, report it to NameNode,
    - mark the block as corrupt
    - Clients redirected to other replicas
    - Replication scheduled
- Task failure
  - Environment problem: java version
  - Error and hanging occur, mark task as failed, report back to NodeManager
  - ApplicationMaster or ResourceManager try to re-schedule the task on a different node
  - Reach maximum number of retries before declaring job failure
- NodeManager failure
  - NodeManager check the health of node, and report to ResourceManager
  - When it fail, RM can't detect any heartbeat from it
    - mark as killed, report back to AM
    - AM try to re-run all the hosted containers in other nodes, after negotiating with RM
    - Completed map tasks are also rescheduled, since map results are not stored in HDFS
- ResourceManager or NameNode failures
  - Single point of failure
  - No way of automatic recovery
  - Need manual intervention:
    - Secondary NameNode has a backup copy of index table
    - ResourceManager Stores state of jobs, and can re-launch when restarted