6.4 KiB
6.4 KiB
Hadoop Architecture
Features
- Designed to run in clusters of pcs
- Scales up linearly
- Suitable for local networks or data centers
- Design principles
- Data is distributed around the network
- Computation is sent to data: code is sent to run on nodes
- Basic architecture is mater / worker
- Offers the following:
- Redundant, fault tolerant data storage
- Parallel computation framework
- Job coordination
Structure of a MapReduce job
- Job: a program to be executed across the entire dataset
- Packaged as a jar file, with all the code needed
- Job is assigned a cluster-unique ID
- Data attached to job is replicated over the entire internet
- Task: an execution on a slice of data
- Task Attempt: An instance of a local execution
Job execution flow
- Split data into computing chunks
- Assign a chunk to node manager
- Run many mappers
- Shuffle and sort
- Run many reducers
- Result from reducers create the job output
The Optional Combiner
- The bottleneck in map-reduce frameworks:
- Map and reduce jobs scale close to linearly, so they are not
- Potential bottleneck at the shuffle and sort operations (between map and
reduce)
- Data need to be copied over network
- Lots of keys are emitted by mapper, and sorting they are costly
- Combiner can be used to execute before shffling and sorting
- Reason
- Acts as a preliminary reducer
- Executed at each mapper node, just before sending all the pairs for shuffling
- Reduces amount of data emitted by mapper, to improve efficiency
- Restrictions and rules:
- Cannot be mandatory, the job should work the correctly without it
- Idempotent: The number of time the combiner is applied shouldn't
change the output
- No Side effect, or they won't be idempotent
- Preserve the keys: can't change the keys to disrupt the sort order, or changing the partitioning
- Example of using reducer code as the combiner:
public void Combine(String key, List<Integer> values) { int sum = 0; for (Integer count: values){ sum+=count; } emit(key, sum); }
- TODO review the combiner diagram
Apache Hadoop
Architecture of Hadoop
- Executes on nodes connected by network
- Each node runs a set of daemons
- Computing:
ResourceManager
NodeManager
- Storage:
NameNode
SecondaryNameNode
as backupDataNode
- Computing:
- Nodes are in Master Slave architecture
- Master node:
NameNode
,ResourceManager
- Aware of slave nodes
- Receives external requests
- Decide the work split of slaves
- Notify slaves
- Slave node, also called Worker node:
DataNode
,NodeManager
- Executes the tasks, received from master
- Master node:
What Hadoop Does
- Resource Management: the existence and availability of resources
- Job Allocation: needed resources for job, and the split of work
- Job Execution: Run job, make sure it's completed, deal with failures
Job execution: YARN
Intro
- Estimate how many map and reduce tasks are needed for a job, based on input dataset and job definition
- Ideally, one different node for each map / reduce tasks
Deciding the number of workers
Mapper Parallelization
- Different input split are processed on each mapper
- Input data size is known
- Number of mappers: Input size
/
Split Size- If input size is small, and has many files, they won't be splitted and will use more mappers
Reducer
- Number of reducer is user defined, since it's hard to figure out
automatically
- Keys are partitioned, partitioning too much lead to overhead in shuffle and sort
Execution Daemons
ResourceManager
- On master: one per cluster
- Responsibility:
- Receive job requests from client
- Create a
ApplicationMaster
per job to manage it - Allocate container in slave nodes, with the assigned resources
- Provision the health of
NodeManager
nodes
NodeManager
- Responsibility:
- Coordinate the execution of tasks on node
- Send health information to
ResourceManager
ApplicationMaster
- Only one per job
- Responsibility: Job allocation and job execution
- Implements specific framework, for example MapReduce
- Negitiates with
ResourceManager
on the resources required - Decides which node will run which job, in the container, given by
ResourceManager
- Destroyed when job completed
Storage: HDFS
Definition
- Hadoop Distributed File System
- This is the storage for Hadoop's input and output
- Features:
- Tailored for MapReduce jobs
- Large Block size (64MB)
- Not a POSIX compliant file system
Data distribution: key element of map reduce
- Job code (jars) moved to where data is stored
- Blocks are replicated on the cluster, by default three times, to ensure reliability
Storage daemon
DataNode
: many per cluster- Stores block from HDFS
- Report the blocks stored to
NameNode
NameNode
: one per cluster- Keep index and location for every block
- Don't do computation, because this is heavy
- Single point of failure
SecondaryNameNode
- Communicates directly with a
NameNode
- Store backup of index table
- Communicates directly with a
Data Replication
-
Format: csv, example:
Filename
numReplicas
block-ids
part-0 r:2 {1,3} part-1 r:3 {2,4,5} -
Definition: Creating and maintaining multiple copies of data, across different nodes in the HDFS
-
Significance
- Fault tolerance
- Data availability
- System Reliability
- Support Parallel Processing
Failure recovery
- Identifying failure: by missing the hearbeat signal
- Replication: NameNode initializes replication once a node failure is detected
- Maintaining Integrity: use a predetermined replication factor
- Mitigating Potential Disruptions: Dynamic data management
Operation
- TODO: See the graph in p59
- Input data:
- Mappers are assigned input splits from HDFS input path (default 64MB)
- Data locality:
ApplicationMaster
try to assign mapper where data is stored
- Output data:
- Copied to HDFS, one file per reducer
- Replicate