root/EBU6502_cloud_computing_notes

Fork 0

Ryan b56bb2278e add 3-2, took 2hr and more

2024-12-30 15:14:36 +08:00

6.4 KiB

Raw Permalink Blame History

Hadoop Architecture

Features

Designed to run in clusters of pcs
Scales up linearly
Suitable for local networks or data centers
Design principles
- Data is distributed around the network
- Computation is sent to data: code is sent to run on nodes
- Basic architecture is mater / worker
Offers the following:
- Redundant, fault tolerant data storage
- Parallel computation framework
- Job coordination

Structure of a MapReduce job

Job: a program to be executed across the entire dataset
- Packaged as a jar file, with all the code needed
- Job is assigned a cluster-unique ID
- Data attached to job is replicated over the entire internet
Task: an execution on a slice of data
Task Attempt: An instance of a local execution

Job execution flow

Split data into computing chunks
Assign a chunk to node manager
Run many mappers
Shuffle and sort
Run many reducers
Result from reducers create the job output

The Optional Combiner

The bottleneck in map-reduce frameworks:
- Map and reduce jobs scale close to linearly, so they are not
- Potential bottleneck at the shuffle and sort operations (between map and reduce)
  - Data need to be copied over network
  - Lots of keys are emitted by mapper, and sorting they are costly
Combiner can be used to execute before shffling and sorting
Reason
- Acts as a preliminary reducer
- Executed at each mapper node, just before sending all the pairs for shuffling
- Reduces amount of data emitted by mapper, to improve efficiency
Restrictions and rules:
- Cannot be mandatory, the job should work the correctly without it
- Idempotent: The number of time the combiner is applied shouldn't change the output
  - No Side effect, or they won't be idempotent
- Preserve the keys: can't change the keys to disrupt the sort order, or changing the partitioning

Example of using reducer code as the combiner:

public void Combine(String key, List<Integer> values) {
    int sum = 0;
    for (Integer count: values){
        sum+=count;
    }
    emit(key, sum);
}

TODO review the combiner diagram

Apache Hadoop

Architecture of Hadoop

Executes on nodes connected by network
Each node runs a set of daemons
- Computing:
  - ResourceManager
  - NodeManager
- Storage:
  - NameNode
  - SecondaryNameNode as backup
  - DataNode
Nodes are in Master Slave architecture
- Master node: NameNode, ResourceManager
  - Aware of slave nodes
  - Receives external requests
  - Decide the work split of slaves
  - Notify slaves
- Slave node, also called Worker node: DataNode, NodeManager
  - Executes the tasks, received from master

What Hadoop Does

Resource Management: the existence and availability of resources
Job Allocation: needed resources for job, and the split of work
Job Execution: Run job, make sure it's completed, deal with failures

Job execution: YARN

Intro

Estimate how many map and reduce tasks are needed for a job, based on input dataset and job definition
Ideally, one different node for each map / reduce tasks

Deciding the number of workers

Mapper Parallelization

Different input split are processed on each mapper
Input data size is known
Number of mappers: Input size / Split Size
- If input size is small, and has many files, they won't be splitted and will use more mappers

Reducer

Number of reducer is user defined, since it's hard to figure out automatically
- Keys are partitioned, partitioning too much lead to overhead in shuffle and sort

Execution Daemons

`ResourceManager`

On master: one per cluster
Responsibility:
- Receive job requests from client
- Create a ApplicationMaster per job to manage it
- Allocate container in slave nodes, with the assigned resources
- Provision the health of NodeManager nodes

`NodeManager`

Responsibility:
- Coordinate the execution of tasks on node
- Send health information to ResourceManager

`ApplicationMaster`

Only one per job
Responsibility: Job allocation and job execution
- Implements specific framework, for example MapReduce
- Negitiates with ResourceManager on the resources required
- Decides which node will run which job, in the container, given by ResourceManager
- Destroyed when job completed

Storage: HDFS

Definition

Hadoop Distributed File System
This is the storage for Hadoop's input and output
Features:
- Tailored for MapReduce jobs
- Large Block size (64MB)
- Not a POSIX compliant file system

Data distribution: key element of map reduce

Job code (jars) moved to where data is stored
Blocks are replicated on the cluster, by default three times, to ensure reliability

Storage daemon

DataNode: many per cluster
- Stores block from HDFS
- Report the blocks stored to NameNode
NameNode: one per cluster
- Keep index and location for every block
- Don't do computation, because this is heavy
- Single point of failure
SecondaryNameNode
- Communicates directly with a NameNode
- Store backup of index table

Data Replication

Format: csv, example:

Filename numReplicas block-ids

part-0 r:2 {1,3}

part-1 r:3 {2,4,5}
Definition: Creating and maintaining multiple copies of data, across different nodes in the HDFS
Significance
- Fault tolerance
- Data availability
- System Reliability
- Support Parallel Processing

`Filename`	`numReplicas`	`block-ids`
part-0	r:2	{1,3}
part-1	r:3	{2,4,5}

Failure recovery

Identifying failure: by missing the hearbeat signal
Replication: NameNode initializes replication once a node failure is detected
Maintaining Integrity: use a predetermined replication factor
Mitigating Potential Disruptions: Dynamic data management

Operation

TODO: See the graph in p59
Input data:
- Mappers are assigned input splits from HDFS input path (default 64MB)
- Data locality: ApplicationMaster try to assign mapper where data is stored
Output data:
- Copied to HDFS, one file per reducer
- Replicate

6.4 KiB Raw Permalink Blame History

Hadoop Architecture

Features

Structure of a MapReduce job

Job execution flow

The Optional Combiner

Apache Hadoop

Architecture of Hadoop

What Hadoop Does

Job execution: YARN

Intro

Deciding the number of workers

Mapper Parallelization

Reducer

Execution Daemons

ResourceManager

NodeManager

ApplicationMaster

Storage: HDFS

Definition

Data distribution: key element of map reduce

Storage daemon

Data Replication

Failure recovery

Operation

6.4 KiB

Raw Permalink Blame History

`ResourceManager`

`NodeManager`

`ApplicationMaster`