finish map reduce and cdn

2024-12-31 20:24:24 +08:00 · 2024-12-31 20:24:24 +08:00 · 3b7ee26a8c
parent 17f4b90596
commit 3b7ee26a8c
2 changed files with 486 additions and 0 deletions
--- a/4-1-beyond-map-reduce.md
+++ b/4-1-beyond-map-reduce.md
@ -0,0 +1,198 @@
 # Beyond map reduce
 ## In memory processing
 ### Hadoop's problems
 - Batch processing system
    - Does not process streamed data, thus the performance is slower
 - Designed to process very large data, not suited for many small files
 - Efficient at map stage, bad at IO:
    - Data loaded and written from HDFS
    - Shuffle and sort use a lot of net traffic
 - Job startup and finish takes seconds, regardless of the size
 - Not a good fit for every case
    - The structure is rigid: Map, Combiner, Shuffle and sort, Reduce
    - No support for iterations
    - Only one sync barrier
    - Bad at e.g. graph processing
 ### Intro to in memory Processing
 - Definition: Load data in memory, before starting process
 - Advantages:
    - More flexible computation
    - Iteration is supported
    - No slow IO required
 - Disadvantages:
    - Data must fit in memory of distributed storage
    - Need additional measures for persistence
    - Mandatory fault-tolerant
 - Major frameworks:
    - Apache spark
    - Graph-centric: Pregel
    - SQL focused read only: Cloudera Impala
 ### Spark
 - Open source large and general engine for large scale distributed data
  processing
 - is a **cluster computing** platform that has API for distributed programming
 - In memory processing and storage engine
    - Load data from HDFS,Cassandra
    - Resource management via Spark, EC2, YARN
    - Can work with Hadoop, or standalone
 - Runs on local or clusters
 - Goal: to provide distributed datasets, that users can use as if they are local
 - Has the shiny bits as MapReduce:
    - Fault tolerance
    - Data locality
    - Scalability
 - Approach: argument data flow with RDD
 ### RDD: Resilient Distributed Datasets
 - Basic level of abstraction in spark
 - Distributed memory model: RDDs
    - Immutable collections of data **Distributed** across the nodes of cluster
    - New RDD is created by:
        - Loading data from input
        - transform existing collection to generate a new one
    - Can be saved to HDFS or other programs with action
 - Operations:
    - Transformation: Define new RDD from existing one
        - `map`
        - `filter`
        - `sample`
        - `union`
        - `groupByKey`
        - `reduceByKey`
        - `join`
        - `cache`
    - Action: Take RDD and return a result to driver
        - `reduce`
        - `collect`
        - `count`
        - `save`
        - `lookupKey`
 ### Scala (Not in test)
 - Scala is the native language for spark
 - Similar syntax to java, but has powerful type inference
 ### Scala application
 - Consists of a **driver** that executes various parallel operations on
  **RDDs**, partitioned across cluster
 - Driver is on different machine where RDDs are created
    - Use action to retrieve data from RDD
 - TODO: look at diagram at p20
 - Driver program run the user's main function, executes parallel operations on a
  cluster
 ### Components
 - Driver program run the user's main functions, executes parallel operation on a
  cluster
    - Run as **independent** sets of processors, coordinated by a `SparkContext`
      in driver
    - Context run in a cluster manager like YARN, which allocates system
      resources
 - Working in cluster in managed by **executor**, which is managed by
  `SparkContext`
    - **Executor** responsible for executing task and store data
 - Deploying is up to the cluster manager used, like YARN or standalone spark
 ### Computation
 - Using anonymous functions
 - Named functions
 - `map`: create a new RDD, with the original value replaced by new value
  returned in map
 - `filter`: create a new RDD, with less values
 ### Deferred execution
 - Only executes the transformation, the moment they are needed
 - Only the invocation of action triggers the execution chain
    - This allows internal optimization: combine the operations
 ### Spark performance
 #### Issues
 - Because spark has freedom, task allocation is much more challenging
 - Errors appear more often, and hard to debug
 - Knowledge of basics of map reduce helps
 #### Tuning
 - Memory: Spark uses more memory
 - Partitioning for RDD
 - Performance implication for each operation
 ### Spark ecosystem
 - GraphX: Graph processing RDD
 - MLib: machine learning
 - Spark SQL
 - Spark Streaming: **Stream** processing with D-Stream RDDs
 ## Stream processing
 ### Information streams
 - Data continuously generated from various sources
    - Unbound, the arrival time is not fixed
 - Process the information the moment it's generated
    - Apply a function to each new element
    - Look for **real time** changes and response
 ### Apache Storm
 #### Intro
 - Developed by BlackType, apache project
 - Real time computation of streams
 - Features
    - Scalable
    - No data loss guarantee
    - Extremely robust and fault tolerant
    - Programming language agnostic
    - Distributed Stream Processing: tasks distributed across cluster
 ## Discretized Streams
 - Unlike true streaming processing, we process information in micro batches
    - Input -> Spark Streaming -> batched input data -> Spark Engine ->
      Processed batch data
 - In spark:
    - reuse the spark framework
        - Can use spark transformations on RDDs
    - Construct a RDD every few seconds (defined time) to manage data streams
        - New RDD processed at each time slot
 ### DStream RDD
 - Composes of a series of RDDs, to represent data over time
 - Choosing timer interval:
    - Small interval: quicker response time, at the cost of frequent batching
 ### DStream Transformations
 - Change each RDD in the stream
 ### DStream Streaming Context
 - Create a `StreamingContext` to manage the stream and transformation, and need a action to
  collect the results
 ### DStream Sliding windows
 - Some usage require looking at a set of stream messages, to perform computation
 - Sliding window stores a rolling list with latest items from stream
    - The contents are changed over time, with new items added and old items popped
 - Using in Spark DStream:
    - has API to configure size of window (seconds) and frequency of computation (seconds)
    - Code?: `reduceByWindowAndKey((a,b)=>math.max(a,b), Seconds(60), Seconds(5) )`
--- a/4-2-cdn.md
+++ b/4-2-cdn.md
@ -0,0 +1,288 @@
 # Content Delivery Networks
 ## DNS
 ### Definition:
 - Domain name system
 - Intended use: to translate domain name to IP addresses
 - Other uses: load distribution: replicated web server has many IPs, use DNS to
  redirect client to closest place
 - Distributed system, that servers are interconnected
    - Centralizing is hard, because of the huge traffic, and distance, and
      single point of failure
 - Many applications rely on DNS
 ### Hierarchy
 - Root DNS Server: Root name server
    - First point of contact
    - Directly query authoritative name server
    - Get Domain-name - IP mapping
    - Query for IP address for TLD DNS servers
 - TLD (Top Level Domain) `.com`, `.org`, `.edu` DNS server
    - Query for IP address to Authoritative DNS Server
 - Authoritative DNS Server: Owned by site owner like `amazon.com`
 ### Local DNS Server:
 - Actually a client, not in a part of the Hierarchy
 - Each ISP (Internet Service Provider) has one
 - Workings:
    - When host makes DNS query, it's sent to local DNS server
    - The Server may have local cache of name-to-address pair
    - Otherwise forward the query to the DNS hierarchy
 ### DNS Caching
 - Once the server knows about the mapping, it is **cached**
 - Cache entry timeout after time (TTL): on the other hand it may be out of date
 - TLD servers are typically cached in local, since root names are not frequently
  visited
 - Benefits
    - Reduce network traffic on: **Root servers**, **across the internet**
    - This increases network performance because DNS response is much faster.
 ## P2P
 ### Definition
 - A **Distributed** network architecture
 - Every node is both the **Client** and the **Server**
 - Advantages:
    - Scalable:
        - As the number of clients increase, the number of servers also
          increases
        - Both consume and donate resource
    - Less cost: Cost at the edge of network
    - More privacy: No centralized source of data
    - Reliability:
        - Distributed geographically
        - Has Replicas
        - No single point of failure
    - All of above made it easy to share content
 ### Categories
 - Unstructured:
    - No restriction on overlay structures and data placement
    - Examples:
        - Napster, BitTorrent, FreeNet
 - Structured
    - Uses Distributed Hash Table, that use an interface like `put(k, v)`, and
      `get(k)`
    - Has restriction on overlay structure, and data placement
    - Examples:
        - Chord, Pastery and CAN
 ### Server Selection
 - For BitTorrent, a Tracker is used, which informs the clients about the peers
  available
    - TODO: See diagram at page 26
 ### Issues with P2P
 - Reliability
 - Performance
 - Control: have a lot of copyrighted content
 ## Content Delivery Networks
 ### History of Content Delivery
 - Web 1.0: Pre-CDN, Infrastructure development
 - CDN 1.0: First generation of CDN, replication, intelligent routing, edge
  computing
 - CDN 2.0: P2P, Cloud Computing, Energy Awareness
 - CDN 3.0: Autonomic composition
 ### Web Caches
 - The precursor to CDN
 - Improve efficiency by caching
 - Caching proxy:
    - Receive HTTP request from client
    - If object in cache, then send cached content
    - Otherwise request the object from origin server
 - Works as both client and server:
    - Client: request content from origin
    - Server: serve content to downstream client
 - Usually installed by ISP
 - Reason:
    - Reduce response time for client request
    - Reduce traffic across network
 - Problem:
    - Can't serve all of the web users, since the web is too large, and
    - Web content is dynamic and customized, which means many of them are not
      cacheable
    - Origin upstream web servers shouldn't rely on downstream caching proxy
    - Upstream web servers can't see the real statistics of their site, since
      the user data is not sent to their servers
 ### Definition
 - Also called _Content Distribution Network_
 - **Infra**: large distributed system of servers deployed in multiple data
  centers across the internet
 - **Goal**: distribute content to end users on a large scale with high
  **availability** and high **performance**
 - Is a mechanism to **replicate** content on multiple servers on the internet,
  providing client a way to choose server that can provide content fast.
 - Content providers are the CDN customers:
    - They pay CDN companies to deliver their content
    - CDN pays ISPs, carriers, and network operators for hosting their servers
 - Usually used by large web platforms
 ### What CDN do
 - Serve a large fraction of internet content
    - Web objects (Text, JavaScript, graphics)
    - Downloadable objects
    - Applications
    - Stream media
 - Most of the web uses CDN
 ### The model
 - TODO: See the slide p41
 ### CDN Deployment
 - CDN company deploy hundreds of servers around the world, often inside ISP
  networks, so that it's close to users
 - CDN Customer side:
    - Replicates customer's content in CDN servers
    - When provider update content, CDN update server with their content
 - User side:
    - Send request to origin server
    - Intercepted by redirection service
    - Forward user's request to best CDN server
    - Content served from CDN server
 ### Companies
 - Akamai
 - Limelight
 - ChinaCache
 - Edgecast
 ### Benefits
 - Reduce latency to users
 - Reduce load on original server
    - Increase security against Denial of Service Attacks
 - Scalability
 - Cheaper, easier to manage
 - Bypass traffic jams on the web:
    - Requested data is close to clients
    - Avoid bottleneck links
 ### Optimizations in CDN side
 - Content is cached at various locations, for faster access
 - Use data compression
 - Use load balancing to reduce traffic
 - Security features like DDoS protection
 - Use network peering, for shorter data paths
 ### Examples and Usage
 - Netflix:
    - Low latency and high defiition media can be played
    - Handles peak traffic
    - Content has consistent quality
 - Alibaba:
    - Rapid page loads for product listing
    - Support large scale events
    - Stability and scalability
 ### CDN Routing
 #### Server Selection
 - Load: To balance load
 - Performance: improve client performance, based on:
    - Geography
    - RTT
    - Throughput
    - Load
 - Any Node Alive: provide fault tolerance
 #### Ways of redirecting
 - As a part of routing: anycast (Single IP address is shared by many devices in
  multiple locations), cluster, load balancing
    - Pros: transparent to clients, works when browser cached failed addresses,
      circumvents many routing problems
    - Cons: Little control over selection of server, complex, scalability, and
      can't recover TCP
 - Part of application: HTTP Redirect
    - Pros: Application level, has more control
    - Cons: Has Additional load and RTT, and is hard to cache
 - Part of naming: DNS
    - Pros: Suitable for caching, dns redirect to any IP
    - Cons: This is implemented in resolver, requesting for a domain not URL,
      and hidden load factor for resolver's population
        - Can estimate the stats
 #### More on DNS redirection
 - DNS redirection is used to redirect client to a nearby server.
 - Based on:
    - Latency to client
    - Load balancing
        - Try to balance client across many servers to avoid hotspot
    - Available servers
 - Process:
    - Client's DNS request come to CDN's nameserver ( See below to how it's
      accessed. )
    - DNS request is being resolved to a nearby server, by accessing CDN
      controlled name servers
    - CDN measures the state of network in the infrastructure
 - Two types of DNS redirection
    - Full:
        - the origin server is controlled by CDN
        - Pro: All requests are automatically redirected
        - Cons: May send a lot of traffic to CDN, so it's expensive
    - Partial:
        - Content provider mark what to provide to CDN
        - usually larger objects
        - Refer to images as `<img src=http://cdn.com/foo/bar/img.gif>`
        - Accessing the website, CDN serve the data
        - Pros: Better control
        - Cons: Have to mark content
 ## Deployment
 ### Hosting your stuff
 - Where: rely on measures
    - Sample popular hostnames on alexa.com
    - Ask DNS from multiple vantage points
    - Categorize by type:
        - Hostnames
        - Files
        - Unpopular
 ### Examples
 - ChinaCache
 ## Future
 ### Challenges
 - Mobile networks: latency to cell is higher, opaque internal network structure
 - Video: Large bandwidth,
    - 16M - 30M bps compressed
    - When Combined can be 25K TBps
    - Even data centers don't have that much
    - Using multicast from end systems as potential solution
 ### CDN2.0
 - Hybrid CDN: Akamai
 - Cloud Based Video: NetFlix
 - Meta CDN: Conviva
 - Virtual CDN: ISP micro-datacenters