EBU6502_cloud_computing_notes/3-1-map-reduce.md
2024-12-30 17:26:57 +08:00

6 KiB

Principles of Map Reduce

Big Data

  • Definitions: collecting and processing (interpretation) a large amount of data, to our own benefits.
  • Key Features:
    • Volume: scale of petabytes
    • Variety: Structured, semi-structured, unstructured
    • Velocity: Social media, sensor, high throughput
    • Veracity: Unclean, Imprecise, Unclear

Distributed System

  • Simple definition: any system that works across many computers
  • Features:
    • Concurrency
    • No guarantee on sequence of events
    • No guarantee on reliability
    • Price Performance ratio is better
  • Challenges
    • Assigning workers
    • Failure handling
    • Exchanging results
    • Synchronization
  • Examples: link

Map Reduce

Parallelism

Intro

  • number of processors working together to calculate or solve a problem earlier slides
  • Calculation will be divided into tasks, and sent to different processors
    • Can be different cores or machines
  • Challenges
    • Hard to divide
    • The subtasks may use results from each other
  • To solve this, we can rank task based on the difficulty to parallelize
    • Easy: Image processing
    • Hard: Path Search

Benefits

  • Solve larger or complex problems
  • Serial computation will take too long: too large or complex
  • Also cheaper, since not less powerful computer it requires
  • Serial computation, a retrospective:
    • FDE cycle: one instruction is fetched, decoded and executed, based on Von Neumann
    • As processor speed increases, more instructions can be executed at the same time
    • We have reached the limit of single processor processing, where it's too hard to make a processor much faster
    • We now have more cores

Real life examples

  • Simulation: where directly testing is difficult
  • Prediction: challenging in algorithm design and implementation
  • Data Analysis: Google search, biologist on DNA, Marketers on twitter

Platforms for parallel computing

  • Spreadsheets, but can't handle true big data
  • R language: powerful statistical features
  • Python: General purpose with R like addons
  • Databases
    • RDBMS, Relational Database Management Systems
    • NoSQL: Key value storage, like Amazon DynamoDB
    • Document databases: MongoDB
    • Graph databases: Neo4j, Giraph
  • Specialized language
    • MapReduce / Hadoop
    • Spark
  • Data preparation and visualization

Introduction to the Map Reduce Model

Example problem

  • Counting words
  • input: text
  • output: key-pair lists: word - count

Serial solution

  • Split into lines
  • Iterate over each line, and count each word

Parallelized solution using Divide and Conquer

  • Split sentences or lines into words
  • Count words on separate machines, in parallel
  • Partition the problem and combine the result
  • Challenges
    • Assigning work units to workers
    • Having more work units than workers
    • Shared partial results may be needed by workers
    • Aggregate partial results
    • Knowing workers finished or stopped

Map Reduce

Definitions

  • A parallel programming model, and associated implementation, that scales well and has auto parallelization
  • Features:
    • Used by big companies
    • User specify the map function and reduce function
    • Runtime automatically parallelize the computation
    • Runtime handles failures, communication and perf issues
    • Not suitable for every possible algorithms
  • Can also refer to the runtime (execution framework), or the specific code implementation

Pattern

  • Input: usually from file
  • Map task: aggregate key-value pairs
  • Reduce task: Process and reduce the key-value pairs
  • Output: can write to file system

Usage

  • Google: Index building for search, Article clustering, Statistical machine translation
  • Yahoo: Index building for search, spam detection for mail
  • Facebook: Data Mining, Advertisement, Spam

History

  • Inspired by functional programming like Lisp:
    • map(): apply function to each individual value of a array, to modify the array
    • reduce(): combine the values from an array to get a reduced number

Model

  • Input data is partitioned into processable chunks
  • Map: Called on every item in input, and emit a series of intermediate pairs (list(ki, vi))
    • One map job per chunk, can be parallelized
  • Reduce: Called on every unique key (ki, list(vi)), and it's value list, to emit a value to the output (ki, vi' or simply vi')
    • One reduce job for each distinct key emitted by mapper, can be parallelized
  • Procedure: nodes work on map job first, after map is completed, they synchronize to aggregate the intermediate values by output key, then run reduce

Example

  • TODO: review the lab code on word count

Benefits

  • High level abstraction
  • Framework is efficient, leading to good performance
  • Scalability is close to linear, since map and reduce is fully parallelized
  • Reduces the complexity of parallel programming

Running

  • Shuffle, moving data from mappers to reducers and sort, ordering outputs before being processed by reducer
    • Shuffle:
      • Every intermediate key-value pairs generated by mapper are collected
      • Pairs are partitioned to a list of values, each partition sorted by key
      • Data is shuffled and sent to each reducer
    • Sort:
      • Reducer copy data from mapper
      • Downloaded data are merged and sorted as input for reducer, key-list of values, sorted by keys
  • Runtime
    • Partitions input
    • Schedules executions
    • Handles load balancing
    • Shuffle, partition, sort intermediate data
    • Handle failure
    • Manage IPC (Inter Process Communication)
  • TODO: see page 66 for diagram example

Implementations

  • Google MapReduce: proprietary and private
  • Hadoop: Open Source, written in Java, first used by Yahoo, then merged into Apache project.
    • Used by most major companies: Amazon, Facebook, Google, IBM, last.fm, Yahoo
  • Custom implementation