3-4, 1hr

2024-12-30 17:26:57 +08:00 · 2024-12-30 17:26:57 +08:00 · 40f6f6b436
parent 28b481f821
commit 40f6f6b436
2 changed files with 110 additions and 2 deletions
--- a/3-1-map-reduce.md
+++ b/3-1-map-reduce.md
@ -172,8 +172,6 @@
          **collected**
        - Pairs are **partitioned** to a list of values, each partition sorted
          by key
        - Combiner run on each partition, to combine the key-value pairs with
          key- list of values
        - Data is shuffled and sent to each reducer
    - Sort:
        - Reducer copy data from mapper
--- a/3-4-map-reduce-reliability-perf.md
+++ b/3-4-map-reduce-reliability-perf.md
@ -0,0 +1,110 @@
 # Map Reduce Reliability and Performance
 ## Hadoop Performance
 ### The concept of speedup:
 - $$S(n) = \frac{time_taken_with_1_processor}{time_taken_with_n_processor}$$
 - Speedup is problem dependent, as well as architecture dependent
 - If a job is fully parallelized, it the speed up is linear:
  [Amdahl's law](/1-4-scalability.md#amdahls-law-important)
 ### Calculating speedup
 - Calculating the maximum possible theoretical speedup(ignore IPC and workload
  imbalance):
    - According to Amdahl's law, we set $n$ to infinity, thus the speedup
      becomes: $$S = \frac{1}{\alpha}$$
    - Which means, if most of the parts are serialized, the speedup is low
 - The real speedup is lower, because of the communication overhead and workload
  imbalance
 ### Hadoop performane metrics
 - Latency: time between starting a job and delivering output: execution time
    - Job setup
    - Reading from HDFS
    - Lock contention from concurrent tasks
 - Throughput: the amount of data per second (byte per second)
    - HDFS throughput(IO or network bound, usually not CPU)
    - Disk throughput
 ### Optimizing for the performance
 - Focus on bottlenecks
    - Reduce number of key-value pairs, emitted by mapper
        - Reduce network traffic
        - Simplify [shuffle and sort](/3-1-map-reduce.md#running)
    - Sorting in the _shuffle and sort_ stage
    - Avoid unnecessary java object creation, reuse Writable when possible
 ### Problem in load balancing
 - Data skew
    - Not every task proceess the same amount of data
    - In general, mappers are less likely to have skew, the splits should be
      balanced
    - Reducer are more likely to have skew, since the number of values for a key
      is hard to predict.
        - The partition can be trained to provide a balanced speed
        - For example use a initial sampling of data to train it
 ### Summary:
 - Input dataset: size, and number of records
 - Mapper: number of records, effect of combiner
 - Reducer: data skew: key with too many records
 ## Hadoop Reliability
 ### High Availability (HA)
 - A characteristic of a system
 - To ensure agreed level of operational performance for a higher than normal
  period
 - Features
    - Fault tolerance: System continue to operate correctly on the event of
      failure
    - No single point of failure
    - Graceful degradation: When some component fail, the system temporarily
      works with worse performance
 - Calculating availability:
  [Calculation](/1-4-scalability.md#availaility-important)
 - Testing: Chaos monkey: open source tool to simulate real-world failure, to
  test the resilience of it infra
    - To help testing the potential problems before they become actual problems
 ### Error Management
 - Errors are part of a job execution
 - Examples:
    - Data integrity
        - `DataNode` verifies the checksum before writing
        - clients verify checksum of read blocks
        - When error found, report it to `NameNode`,
            - mark the block as corrupt
            - Clients redirected to other replicas
            - Replication scheduled
    - Task failure
        - Environment problem: java version
        - Error and hanging occur, mark task as failed, report back to
          `NodeManager`
        - `ApplicationMaster` or `ResourceManager` try to re-schedule the task
          on a different node
        - Reach maximum number of retries before declaring job failure
    - `NodeManager` failure
        - `NodeManager` check the health of node, and report to
          `ResourceManager`
        - When it fail, RM can't detect any heartbeat from it
            - mark as killed, report back to AM
            - AM try to re-run all the hosted containers in other nodes, after
              negotiating with RM
            - Completed map tasks are also rescheduled, since map results are
              not stored in HDFS
    - `ResourceManager` or `NameNode` failures
        - Single point of failure
        - No way of automatic recovery
        - Need manual intervention:
            - Secondary `NameNode` has a backup copy of index table
            - `ResourceManager` Stores state of jobs, and can re-launch when
              restarted