finish map reduce and cdn

2024-12-31 20:24:24 +08:00 · 2024-12-31 20:24:24 +08:00 · 3b7ee26a8c
parent 17f4b90596
commit 3b7ee26a8c
2 changed files with 486 additions and 0 deletions
--- a/4-1-beyond-map-reduce.md
+++ b/4-1-beyond-map-reduce.md
@ -0,0 +1,198 @@
+# Beyond map reduce
+
+## In memory processing
+
+### Hadoop's problems
+
+- Batch processing system
+    - Does not process streamed data, thus the performance is slower
+- Designed to process very large data, not suited for many small files
+- Efficient at map stage, bad at IO:
+    - Data loaded and written from HDFS
+    - Shuffle and sort use a lot of net traffic
+- Job startup and finish takes seconds, regardless of the size
+- Not a good fit for every case
+    - The structure is rigid: Map, Combiner, Shuffle and sort, Reduce
+    - No support for iterations
+    - Only one sync barrier
+    - Bad at e.g. graph processing
+
+### Intro to in memory Processing
+
+- Definition: Load data in memory, before starting process
+- Advantages:
+    - More flexible computation
+    - Iteration is supported
+    - No slow IO required
+- Disadvantages:
+    - Data must fit in memory of distributed storage
+    - Need additional measures for persistence
+    - Mandatory fault-tolerant
+- Major frameworks:
+    - Apache spark
+    - Graph-centric: Pregel
+    - SQL focused read only: Cloudera Impala
+
+### Spark
+
+- Open source large and general engine for large scale distributed data
+  processing
+- is a **cluster computing** platform that has API for distributed programming
+- In memory processing and storage engine
+    - Load data from HDFS,Cassandra
+    - Resource management via Spark, EC2, YARN
+    - Can work with Hadoop, or standalone
+- Runs on local or clusters
+- Goal: to provide distributed datasets, that users can use as if they are local
+- Has the shiny bits as MapReduce:
+    - Fault tolerance
+    - Data locality
+    - Scalability
+- Approach: argument data flow with RDD
+
+### RDD: Resilient Distributed Datasets
+
+- Basic level of abstraction in spark
+- Distributed memory model: RDDs
+    - Immutable collections of data **Distributed** across the nodes of cluster
+    - New RDD is created by:
+        - Loading data from input
+        - transform existing collection to generate a new one
+    - Can be saved to HDFS or other programs with action
+- Operations:
+    - Transformation: Define new RDD from existing one
+        - `map`
+        - `filter`
+        - `sample`
+        - `union`
+        - `groupByKey`
+        - `reduceByKey`
+        - `join`
+        - `cache`
+    - Action: Take RDD and return a result to driver
+        - `reduce`
+        - `collect`
+        - `count`
+        - `save`
+        - `lookupKey`
+
+### Scala (Not in test)
+
+- Scala is the native language for spark
+- Similar syntax to java, but has powerful type inference
+
+### Scala application
+
+- Consists of a **driver** that executes various parallel operations on
+  **RDDs**, partitioned across cluster
+- Driver is on different machine where RDDs are created
+    - Use action to retrieve data from RDD
+- TODO: look at diagram at p20
+- Driver program run the user's main function, executes parallel operations on a
+  cluster
+
+### Components
+
+- Driver program run the user's main functions, executes parallel operation on a
+  cluster
+    - Run as **independent** sets of processors, coordinated by a `SparkContext`
+      in driver
+    - Context run in a cluster manager like YARN, which allocates system
+      resources
+- Working in cluster in managed by **executor**, which is managed by
+  `SparkContext`
+    - **Executor** responsible for executing task and store data
+- Deploying is up to the cluster manager used, like YARN or standalone spark
+
+### Computation
+
+- Using anonymous functions
+- Named functions
+- `map`: create a new RDD, with the original value replaced by new value
+  returned in map
+- `filter`: create a new RDD, with less values
+
+### Deferred execution
+
+- Only executes the transformation, the moment they are needed
+- Only the invocation of action triggers the execution chain
+    - This allows internal optimization: combine the operations
+
+### Spark performance
+
+#### Issues
+
+- Because spark has freedom, task allocation is much more challenging
+- Errors appear more often, and hard to debug
+- Knowledge of basics of map reduce helps
+
+#### Tuning
+
+- Memory: Spark uses more memory
+- Partitioning for RDD
+- Performance implication for each operation
+
+### Spark ecosystem
+
+- GraphX: Graph processing RDD
+- MLib: machine learning
+- Spark SQL
+- Spark Streaming: **Stream** processing with D-Stream RDDs
+
+## Stream processing
+
+### Information streams
+
+- Data continuously generated from various sources
+    - Unbound, the arrival time is not fixed
+- Process the information the moment it's generated
+    - Apply a function to each new element
+    - Look for **real time** changes and response
+
+### Apache Storm
+
+#### Intro
+
+- Developed by BlackType, apache project
+- Real time computation of streams
+- Features
+    - Scalable
+    - No data loss guarantee
+    - Extremely robust and fault tolerant
+    - Programming language agnostic
+    - Distributed Stream Processing: tasks distributed across cluster
+
+## Discretized Streams
+
+- Unlike true streaming processing, we process information in micro batches
+    - Input -> Spark Streaming -> batched input data -> Spark Engine ->
+      Processed batch data
+- In spark:
+    - reuse the spark framework
+        - Can use spark transformations on RDDs
+    - Construct a RDD every few seconds (defined time) to manage data streams
+        - New RDD processed at each time slot
+
+### DStream RDD
+
+- Composes of a series of RDDs, to represent data over time
+- Choosing timer interval:
+    - Small interval: quicker response time, at the cost of frequent batching
+
+### DStream Transformations
+
+- Change each RDD in the stream
+
+### DStream Streaming Context
+
+- Create a `StreamingContext` to manage the stream and transformation, and need a action to
+  collect the results
+
+### DStream Sliding windows
+
+- Some usage require looking at a set of stream messages, to perform computation
+- Sliding window stores a rolling list with latest items from stream
+    - The contents are changed over time, with new items added and old items popped
+- Using in Spark DStream:
+    - has API to configure size of window (seconds) and frequency of computation (seconds)
+    - Code?: `reduceByWindowAndKey((a,b)=>math.max(a,b), Seconds(60), Seconds(5) )`
--- a/4-2-cdn.md
+++ b/4-2-cdn.md
@ -0,0 +1,288 @@
+# Content Delivery Networks
+
+## DNS
+
+### Definition:
+
+- Domain name system
+- Intended use: to translate domain name to IP addresses
+- Other uses: load distribution: replicated web server has many IPs, use DNS to
+  redirect client to closest place
+- Distributed system, that servers are interconnected
+    - Centralizing is hard, because of the huge traffic, and distance, and
+      single point of failure
+- Many applications rely on DNS
+
+### Hierarchy
+
+- Root DNS Server: Root name server
+    - First point of contact
+    - Directly query authoritative name server
+    - Get Domain-name - IP mapping
+    - Query for IP address for TLD DNS servers
+- TLD (Top Level Domain) `.com`, `.org`, `.edu` DNS server
+    - Query for IP address to Authoritative DNS Server
+- Authoritative DNS Server: Owned by site owner like `amazon.com`
+
+### Local DNS Server:
+
+- Actually a client, not in a part of the Hierarchy
+- Each ISP (Internet Service Provider) has one
+- Workings:
+    - When host makes DNS query, it's sent to local DNS server
+    - The Server may have local cache of name-to-address pair
+    - Otherwise forward the query to the DNS hierarchy
+
+### DNS Caching
+
+- Once the server knows about the mapping, it is **cached**
+- Cache entry timeout after time (TTL): on the other hand it may be out of date
+- TLD servers are typically cached in local, since root names are not frequently
+  visited
+- Benefits
+    - Reduce network traffic on: **Root servers**, **across the internet**
+    - This increases network performance because DNS response is much faster.
+
+## P2P
+
+### Definition
+
+- A **Distributed** network architecture
+- Every node is both the **Client** and the **Server**
+- Advantages:
+    - Scalable:
+        - As the number of clients increase, the number of servers also
+          increases
+        - Both consume and donate resource
+    - Less cost: Cost at the edge of network
+    - More privacy: No centralized source of data
+    - Reliability:
+        - Distributed geographically
+        - Has Replicas
+        - No single point of failure
+    - All of above made it easy to share content
+
+### Categories
+
+- Unstructured:
+    - No restriction on overlay structures and data placement
+    - Examples:
+        - Napster, BitTorrent, FreeNet
+- Structured
+    - Uses Distributed Hash Table, that use an interface like `put(k, v)`, and
+      `get(k)`
+    - Has restriction on overlay structure, and data placement
+    - Examples:
+        - Chord, Pastery and CAN
+
+### Server Selection
+
+- For BitTorrent, a Tracker is used, which informs the clients about the peers
+  available
+    - TODO: See diagram at page 26
+
+### Issues with P2P
+
+- Reliability
+- Performance
+- Control: have a lot of copyrighted content
+
+## Content Delivery Networks
+
+### History of Content Delivery
+
+- Web 1.0: Pre-CDN, Infrastructure development
+- CDN 1.0: First generation of CDN, replication, intelligent routing, edge
+  computing
+- CDN 2.0: P2P, Cloud Computing, Energy Awareness
+- CDN 3.0: Autonomic composition
+
+### Web Caches
+
+- The precursor to CDN
+- Improve efficiency by caching
+- Caching proxy:
+    - Receive HTTP request from client
+    - If object in cache, then send cached content
+    - Otherwise request the object from origin server
+- Works as both client and server:
+    - Client: request content from origin
+    - Server: serve content to downstream client
+- Usually installed by ISP
+- Reason:
+    - Reduce response time for client request
+    - Reduce traffic across network
+- Problem:
+    - Can't serve all of the web users, since the web is too large, and
+    - Web content is dynamic and customized, which means many of them are not
+      cacheable
+    - Origin upstream web servers shouldn't rely on downstream caching proxy
+    - Upstream web servers can't see the real statistics of their site, since
+      the user data is not sent to their servers
+
+### Definition
+
+- Also called _Content Distribution Network_
+- **Infra**: large distributed system of servers deployed in multiple data
+  centers across the internet
+- **Goal**: distribute content to end users on a large scale with high
+  **availability** and high **performance**
+- Is a mechanism to **replicate** content on multiple servers on the internet,
+  providing client a way to choose server that can provide content fast.
+- Content providers are the CDN customers:
+    - They pay CDN companies to deliver their content
+    - CDN pays ISPs, carriers, and network operators for hosting their servers
+- Usually used by large web platforms
+
+### What CDN do
+
+- Serve a large fraction of internet content
+    - Web objects (Text, JavaScript, graphics)
+    - Downloadable objects
+    - Applications
+    - Stream media
+- Most of the web uses CDN
+
+### The model
+
+- TODO: See the slide p41
+
+### CDN Deployment
+
+- CDN company deploy hundreds of servers around the world, often inside ISP
+  networks, so that it's close to users
+- CDN Customer side:
+    - Replicates customer's content in CDN servers
+    - When provider update content, CDN update server with their content
+- User side:
+    - Send request to origin server
+    - Intercepted by redirection service
+    - Forward user's request to best CDN server
+    - Content served from CDN server
+
+### Companies
+
+- Akamai
+- Limelight
+- ChinaCache
+- Edgecast
+
+### Benefits
+
+- Reduce latency to users
+- Reduce load on original server
+    - Increase security against Denial of Service Attacks
+- Scalability
+- Cheaper, easier to manage
+- Bypass traffic jams on the web:
+    - Requested data is close to clients
+    - Avoid bottleneck links
+
+### Optimizations in CDN side
+
+- Content is cached at various locations, for faster access
+- Use data compression
+- Use load balancing to reduce traffic
+- Security features like DDoS protection
+- Use network peering, for shorter data paths
+
+### Examples and Usage
+
+- Netflix:
+    - Low latency and high defiition media can be played
+    - Handles peak traffic
+    - Content has consistent quality
+- Alibaba:
+    - Rapid page loads for product listing
+    - Support large scale events
+    - Stability and scalability
+
+### CDN Routing
+
+#### Server Selection
+
+- Load: To balance load
+- Performance: improve client performance, based on:
+    - Geography
+    - RTT
+    - Throughput
+    - Load
+- Any Node Alive: provide fault tolerance
+
+#### Ways of redirecting
+
+- As a part of routing: anycast (Single IP address is shared by many devices in
+  multiple locations), cluster, load balancing
+    - Pros: transparent to clients, works when browser cached failed addresses,
+      circumvents many routing problems
+    - Cons: Little control over selection of server, complex, scalability, and
+      can't recover TCP
+- Part of application: HTTP Redirect
+    - Pros: Application level, has more control
+    - Cons: Has Additional load and RTT, and is hard to cache
+- Part of naming: DNS
+    - Pros: Suitable for caching, dns redirect to any IP
+    - Cons: This is implemented in resolver, requesting for a domain not URL,
+      and hidden load factor for resolver's population
+        - Can estimate the stats
+
+#### More on DNS redirection
+
+- DNS redirection is used to redirect client to a nearby server.
+- Based on:
+    - Latency to client
+    - Load balancing
+        - Try to balance client across many servers to avoid hotspot
+    - Available servers
+- Process:
+    - Client's DNS request come to CDN's nameserver ( See below to how it's
+      accessed. )
+    - DNS request is being resolved to a nearby server, by accessing CDN
+      controlled name servers
+    - CDN measures the state of network in the infrastructure
+- Two types of DNS redirection
+    - Full:
+        - the origin server is controlled by CDN
+        - Pro: All requests are automatically redirected
+        - Cons: May send a lot of traffic to CDN, so it's expensive
+    - Partial:
+        - Content provider mark what to provide to CDN
+        - usually larger objects
+        - Refer to images as `<img src=http://cdn.com/foo/bar/img.gif>`
+        - Accessing the website, CDN serve the data
+        - Pros: Better control
+        - Cons: Have to mark content
+
+## Deployment
+
+### Hosting your stuff
+
+- Where: rely on measures
+    - Sample popular hostnames on alexa.com
+    - Ask DNS from multiple vantage points
+    - Categorize by type:
+        - Hostnames
+        - Files
+        - Unpopular
+
+### Examples
+
+- ChinaCache
+
+## Future
+
+### Challenges
+
+- Mobile networks: latency to cell is higher, opaque internal network structure
+- Video: Large bandwidth,
+    - 16M - 30M bps compressed
+    - When Combined can be 25K TBps
+    - Even data centers don't have that much
+    - Using multicast from end systems as potential solution
+
+### CDN2.0
+
+- Hybrid CDN: Akamai
+- Cloud Based Video: NetFlix
+- Meta CDN: Conviva
+- Virtual CDN: ISP micro-datacenters