EBU6502_cloud_computing_notes/3-1-map-reduce.md

# Principles of Map Reduce

## Big Data

- Definitions: collecting and processing (interpretation) a large amount of
  data, to our own benefits.
- Key Features:
    - Volume: scale of petabytes
    - Variety: Structured, semi-structured, unstructured
    - Velocity: Social media, sensor, high throughput
    - Veracity: Unclean, Imprecise, Unclear

## Distributed System

- Simple definition: any system that works across many computers
- Features:
    - Concurrency
    - No guarantee on sequence of events
    - No guarantee on reliability
    - Price Performance ratio is better
- Challenges
    - Assigning workers
    - Failure handling
    - Exchanging results
    - Synchronization
- Examples: [link](/1-1-intro.md#examples)

## Map Reduce

### Parallelism

#### Intro

- number of processors working together to calculate or solve a problem
  [earlier slides](/1-4-scalability.md)
- Calculation will be divided into tasks, and sent to different processors
    - Can be different cores or machines
- Challenges
    - Hard to divide
    - The subtasks may use results from each other
- To solve this, we can rank task based on the difficulty to parallelize
    - Easy: Image processing
    - Hard: Path Search

#### Benefits

- Solve larger or complex problems
- Serial computation will take too long: too large or complex
- Also cheaper, since not less powerful computer it requires
- Serial computation, a retrospective:
    - FDE cycle: one instruction is fetched, decoded and executed, based on Von
      Neumann
    - As processor speed increases, more instructions can be executed at the
      same time
    - We have reached the limit of single processor processing, where it's too
      hard to make a processor much faster
    - We now have more cores

#### Real life examples

- Simulation: where directly testing is difficult
- Prediction: challenging in algorithm design and implementation
- Data Analysis: Google search, biologist on DNA, Marketers on twitter

#### Platforms for parallel computing

- Spreadsheets, but can't handle true big data
- R language: powerful statistical features
- Python: General purpose with R like addons
- Databases
    - RDBMS, Relational Database Management Systems
    - NoSQL: Key value storage, like Amazon DynamoDB
    - Document databases: MongoDB
    - Graph databases: Neo4j, Giraph
- Specialized language
    - MapReduce / Hadoop
    - Spark
- Data preparation and visualization

### Introduction to the Map Reduce Model

#### Example problem

- Counting words
- input: text
- output: key-pair lists: word - count

#### Serial solution

- Split into lines
- Iterate over each line, and count each word

#### Parallelized solution using Divide and Conquer

- Split sentences or lines into words
- Count words on separate machines, in parallel
- Partition the problem and combine the result
- Challenges
    - Assigning work units to workers
    - Having more work units than workers
    - Shared partial results may be needed by workers
    - Aggregate partial results
    - Knowing workers finished or stopped

### Map Reduce

#### Definitions

- A parallel programming model, and associated implementation, that scales well
  and has auto parallelization
- Features:
    - Used by big companies
    - User specify the map function and reduce function
    - Runtime automatically parallelize the computation
    - Runtime handles failures, communication and perf issues
    - **Not** suitable for every possible algorithms
- Can also refer to the runtime (execution framework), or the specific code
  implementation

#### Pattern

- Input: usually from file
- Map task: aggregate key-value pairs
- Reduce task: Process and reduce the key-value pairs
- Output: can write to file system

#### Usage

- Google: Index building for search, Article clustering, Statistical machine
  translation
- Yahoo: Index building for search, spam detection for mail
- Facebook: Data Mining, Advertisement, Spam

#### History

- Inspired by functional programming like Lisp:
    - `map()`: apply function to each individual value of a array, to modify the
      array
    - `reduce()`: combine the values from an array to get a reduced number

#### Model

- Input data is partitioned into processable chunks
- Map: Called on every item in input, and **emit** a series of **intermediate**
  <key-value> pairs (list(ki, vi))
    - One map job per chunk, can be parallelized
- Reduce: Called on every unique key (ki, list(vi)), and it's value list, to
  emit a value to the output (ki, vi' or simply vi')
    - One reduce job for each distinct key emitted by mapper, can be
      parallelized
- Procedure: nodes work on **map** job first, after map is completed, they
  **synchronize** to aggregate the intermediate values by output key, then run
  **reduce**

#### Example

- TODO: review the lab code on word count

#### Benefits

- High level abstraction
- Framework is efficient, leading to good performance
- Scalability is close to linear, since map and reduce is fully parallelized
- Reduces the complexity of parallel programming

#### Running

- Shuffle, moving data from mappers to reducers and sort, ordering outputs
  before being processed by reducer
    - Shuffle:
        - Every intermediate key-value pairs generated by mapper are
          **collected**
        - Pairs are **partitioned** to a list of values, each partition sorted
          by key
        - Data is shuffled and sent to each reducer
    - Sort:
        - Reducer copy data from mapper
        - Downloaded data are **merged** and **sorted** as input for reducer, key-list
          of values, sorted by keys
- Runtime
    - Partitions input
    - Schedules executions
    - Handles load balancing
    - Shuffle, partition, sort intermediate data
    - Handle failure
    - Manage IPC (Inter Process Communication)
- TODO: see page 66 for diagram example

#### Implementations

- Google MapReduce: proprietary and private
- Hadoop: Open Source, written in Java, first used by Yahoo, then merged into
  Apache project.
    - Used by most major companies: Amazon, Facebook, Google, IBM, last.fm,
      Yahoo
- Custom implementation