4.1 KiB
4.1 KiB
Map Reduce Reliability and Performance
Hadoop Performance
The concept of speedup:
-
S(n) = \frac{TimeTakenWith1Processor}{TimeTakenWithNProcessor}
- Speedup is problem dependent, as well as architecture dependent
- If a job is fully parallelized, it the speed up is linear: Amdahl's law
Calculating speedup
- Calculating the maximum possible theoretical speedup(ignore IPC and workload
imbalance):
- According to Amdahl's law, we set
n
to infinity, thus the speedup becomes: $$S = \frac{1}{\alpha} - Which means, if most of the parts are serialized, the speedup is low
- According to Amdahl's law, we set
- The real speedup is lower, because of the communication overhead and workload imbalance
Hadoop performane metrics
- Latency: time between starting a job and delivering output: execution time
- Job setup
- Reading from HDFS
- Lock contention from concurrent tasks
- Throughput: the amount of data per second (byte per second)
- HDFS throughput(IO or network bound, usually not CPU)
- Disk throughput
Optimizing for the performance
- Focus on bottlenecks
- Reduce number of key-value pairs, emitted by mapper
- Reduce network traffic
- Simplify shuffle and sort
- Sorting in the shuffle and sort stage
- Avoid unnecessary java object creation, reuse Writable when possible
- Reduce number of key-value pairs, emitted by mapper
Problem in load balancing
- Data skew
- Not every task proceess the same amount of data
- In general, mappers are less likely to have skew, the splits should be balanced
- Reducer are more likely to have skew, since the number of values for a key
is hard to predict.
- The partition can be trained to provide a balanced speed
- For example use a initial sampling of data to train it
Summary:
- Input dataset: size, and number of records
- Mapper: number of records, effect of combiner
- Reducer: data skew: key with too many records
Hadoop Reliability
High Availability (HA)
- A characteristic of a system
- To ensure agreed level of operational performance for a higher than normal period
- Features
- Fault tolerance: System continue to operate correctly on the event of failure
- No single point of failure
- Graceful degradation: When some component fail, the system temporarily works with worse performance
- Calculating availability: Calculation
- Testing: Chaos monkey: open source tool to simulate real-world failure, to
test the resilience of it infra
- To help testing the potential problems before they become actual problems
Error Management
- Errors are part of a job execution
- Examples:
- Data integrity
DataNode
verifies the checksum before writing- clients verify checksum of read blocks
- When error found, report it to
NameNode
,- mark the block as corrupt
- Clients redirected to other replicas
- Replication scheduled
- Task failure
- Environment problem: java version
- Error and hanging occur, mark task as failed, report back to
NodeManager
ApplicationMaster
orResourceManager
try to re-schedule the task on a different node- Reach maximum number of retries before declaring job failure
NodeManager
failureNodeManager
check the health of node, and report toResourceManager
- When it fail, RM can't detect any heartbeat from it
- mark as killed, report back to AM
- AM try to re-run all the hosted containers in other nodes, after negotiating with RM
- Completed map tasks are also rescheduled, since map results are not stored in HDFS
ResourceManager
orNameNode
failures- Single point of failure
- No way of automatic recovery
- Need manual intervention:
- Secondary
NameNode
has a backup copy of index table ResourceManager
Stores state of jobs, and can re-launch when restarted
- Secondary
- Data integrity