diff --git a/4-1-beyond-map-reduce.md b/4-1-beyond-map-reduce.md new file mode 100644 index 0000000..d69f8e9 --- /dev/null +++ b/4-1-beyond-map-reduce.md @@ -0,0 +1,198 @@ +# Beyond map reduce + +## In memory processing + +### Hadoop's problems + +- Batch processing system + - Does not process streamed data, thus the performance is slower +- Designed to process very large data, not suited for many small files +- Efficient at map stage, bad at IO: + - Data loaded and written from HDFS + - Shuffle and sort use a lot of net traffic +- Job startup and finish takes seconds, regardless of the size +- Not a good fit for every case + - The structure is rigid: Map, Combiner, Shuffle and sort, Reduce + - No support for iterations + - Only one sync barrier + - Bad at e.g. graph processing + +### Intro to in memory Processing + +- Definition: Load data in memory, before starting process +- Advantages: + - More flexible computation + - Iteration is supported + - No slow IO required +- Disadvantages: + - Data must fit in memory of distributed storage + - Need additional measures for persistence + - Mandatory fault-tolerant +- Major frameworks: + - Apache spark + - Graph-centric: Pregel + - SQL focused read only: Cloudera Impala + +### Spark + +- Open source large and general engine for large scale distributed data + processing +- is a **cluster computing** platform that has API for distributed programming +- In memory processing and storage engine + - Load data from HDFS,Cassandra + - Resource management via Spark, EC2, YARN + - Can work with Hadoop, or standalone +- Runs on local or clusters +- Goal: to provide distributed datasets, that users can use as if they are local +- Has the shiny bits as MapReduce: + - Fault tolerance + - Data locality + - Scalability +- Approach: argument data flow with RDD + +### RDD: Resilient Distributed Datasets + +- Basic level of abstraction in spark +- Distributed memory model: RDDs + - Immutable collections of data **Distributed** across the nodes of cluster + - New RDD is created by: + - Loading data from input + - transform existing collection to generate a new one + - Can be saved to HDFS or other programs with action +- Operations: + - Transformation: Define new RDD from existing one + - `map` + - `filter` + - `sample` + - `union` + - `groupByKey` + - `reduceByKey` + - `join` + - `cache` + - Action: Take RDD and return a result to driver + - `reduce` + - `collect` + - `count` + - `save` + - `lookupKey` + +### Scala (Not in test) + +- Scala is the native language for spark +- Similar syntax to java, but has powerful type inference + +### Scala application + +- Consists of a **driver** that executes various parallel operations on + **RDDs**, partitioned across cluster +- Driver is on different machine where RDDs are created + - Use action to retrieve data from RDD +- TODO: look at diagram at p20 +- Driver program run the user's main function, executes parallel operations on a + cluster + +### Components + +- Driver program run the user's main functions, executes parallel operation on a + cluster + - Run as **independent** sets of processors, coordinated by a `SparkContext` + in driver + - Context run in a cluster manager like YARN, which allocates system + resources +- Working in cluster in managed by **executor**, which is managed by + `SparkContext` + - **Executor** responsible for executing task and store data +- Deploying is up to the cluster manager used, like YARN or standalone spark + +### Computation + +- Using anonymous functions +- Named functions +- `map`: create a new RDD, with the original value replaced by new value + returned in map +- `filter`: create a new RDD, with less values + +### Deferred execution + +- Only executes the transformation, the moment they are needed +- Only the invocation of action triggers the execution chain + - This allows internal optimization: combine the operations + +### Spark performance + +#### Issues + +- Because spark has freedom, task allocation is much more challenging +- Errors appear more often, and hard to debug +- Knowledge of basics of map reduce helps + +#### Tuning + +- Memory: Spark uses more memory +- Partitioning for RDD +- Performance implication for each operation + +### Spark ecosystem + +- GraphX: Graph processing RDD +- MLib: machine learning +- Spark SQL +- Spark Streaming: **Stream** processing with D-Stream RDDs + +## Stream processing + +### Information streams + +- Data continuously generated from various sources + - Unbound, the arrival time is not fixed +- Process the information the moment it's generated + - Apply a function to each new element + - Look for **real time** changes and response + +### Apache Storm + +#### Intro + +- Developed by BlackType, apache project +- Real time computation of streams +- Features + - Scalable + - No data loss guarantee + - Extremely robust and fault tolerant + - Programming language agnostic + - Distributed Stream Processing: tasks distributed across cluster + +## Discretized Streams + +- Unlike true streaming processing, we process information in micro batches + - Input -> Spark Streaming -> batched input data -> Spark Engine -> + Processed batch data +- In spark: + - reuse the spark framework + - Can use spark transformations on RDDs + - Construct a RDD every few seconds (defined time) to manage data streams + - New RDD processed at each time slot + +### DStream RDD + +- Composes of a series of RDDs, to represent data over time +- Choosing timer interval: + - Small interval: quicker response time, at the cost of frequent batching + +### DStream Transformations + +- Change each RDD in the stream + +### DStream Streaming Context + +- Create a `StreamingContext` to manage the stream and transformation, and need a action to + collect the results + +### DStream Sliding windows + +- Some usage require looking at a set of stream messages, to perform computation +- Sliding window stores a rolling list with latest items from stream + - The contents are changed over time, with new items added and old items popped +- Using in Spark DStream: + - has API to configure size of window (seconds) and frequency of computation (seconds) + - Code?: `reduceByWindowAndKey((a,b)=>math.max(a,b), Seconds(60), Seconds(5) )` diff --git a/4-2-cdn.md b/4-2-cdn.md new file mode 100644 index 0000000..81659be --- /dev/null +++ b/4-2-cdn.md @@ -0,0 +1,288 @@ +# Content Delivery Networks + +## DNS + +### Definition: + +- Domain name system +- Intended use: to translate domain name to IP addresses +- Other uses: load distribution: replicated web server has many IPs, use DNS to + redirect client to closest place +- Distributed system, that servers are interconnected + - Centralizing is hard, because of the huge traffic, and distance, and + single point of failure +- Many applications rely on DNS + +### Hierarchy + +- Root DNS Server: Root name server + - First point of contact + - Directly query authoritative name server + - Get Domain-name - IP mapping + - Query for IP address for TLD DNS servers +- TLD (Top Level Domain) `.com`, `.org`, `.edu` DNS server + - Query for IP address to Authoritative DNS Server +- Authoritative DNS Server: Owned by site owner like `amazon.com` + +### Local DNS Server: + +- Actually a client, not in a part of the Hierarchy +- Each ISP (Internet Service Provider) has one +- Workings: + - When host makes DNS query, it's sent to local DNS server + - The Server may have local cache of name-to-address pair + - Otherwise forward the query to the DNS hierarchy + +### DNS Caching + +- Once the server knows about the mapping, it is **cached** +- Cache entry timeout after time (TTL): on the other hand it may be out of date +- TLD servers are typically cached in local, since root names are not frequently + visited +- Benefits + - Reduce network traffic on: **Root servers**, **across the internet** + - This increases network performance because DNS response is much faster. + +## P2P + +### Definition + +- A **Distributed** network architecture +- Every node is both the **Client** and the **Server** +- Advantages: + - Scalable: + - As the number of clients increase, the number of servers also + increases + - Both consume and donate resource + - Less cost: Cost at the edge of network + - More privacy: No centralized source of data + - Reliability: + - Distributed geographically + - Has Replicas + - No single point of failure + - All of above made it easy to share content + +### Categories + +- Unstructured: + - No restriction on overlay structures and data placement + - Examples: + - Napster, BitTorrent, FreeNet +- Structured + - Uses Distributed Hash Table, that use an interface like `put(k, v)`, and + `get(k)` + - Has restriction on overlay structure, and data placement + - Examples: + - Chord, Pastery and CAN + +### Server Selection + +- For BitTorrent, a Tracker is used, which informs the clients about the peers + available + - TODO: See diagram at page 26 + +### Issues with P2P + +- Reliability +- Performance +- Control: have a lot of copyrighted content + +## Content Delivery Networks + +### History of Content Delivery + +- Web 1.0: Pre-CDN, Infrastructure development +- CDN 1.0: First generation of CDN, replication, intelligent routing, edge + computing +- CDN 2.0: P2P, Cloud Computing, Energy Awareness +- CDN 3.0: Autonomic composition + +### Web Caches + +- The precursor to CDN +- Improve efficiency by caching +- Caching proxy: + - Receive HTTP request from client + - If object in cache, then send cached content + - Otherwise request the object from origin server +- Works as both client and server: + - Client: request content from origin + - Server: serve content to downstream client +- Usually installed by ISP +- Reason: + - Reduce response time for client request + - Reduce traffic across network +- Problem: + - Can't serve all of the web users, since the web is too large, and + - Web content is dynamic and customized, which means many of them are not + cacheable + - Origin upstream web servers shouldn't rely on downstream caching proxy + - Upstream web servers can't see the real statistics of their site, since + the user data is not sent to their servers + +### Definition + +- Also called _Content Distribution Network_ +- **Infra**: large distributed system of servers deployed in multiple data + centers across the internet +- **Goal**: distribute content to end users on a large scale with high + **availability** and high **performance** +- Is a mechanism to **replicate** content on multiple servers on the internet, + providing client a way to choose server that can provide content fast. +- Content providers are the CDN customers: + - They pay CDN companies to deliver their content + - CDN pays ISPs, carriers, and network operators for hosting their servers +- Usually used by large web platforms + +### What CDN do + +- Serve a large fraction of internet content + - Web objects (Text, JavaScript, graphics) + - Downloadable objects + - Applications + - Stream media +- Most of the web uses CDN + +### The model + +- TODO: See the slide p41 + +### CDN Deployment + +- CDN company deploy hundreds of servers around the world, often inside ISP + networks, so that it's close to users +- CDN Customer side: + - Replicates customer's content in CDN servers + - When provider update content, CDN update server with their content +- User side: + - Send request to origin server + - Intercepted by redirection service + - Forward user's request to best CDN server + - Content served from CDN server + +### Companies + +- Akamai +- Limelight +- ChinaCache +- Edgecast + +### Benefits + +- Reduce latency to users +- Reduce load on original server + - Increase security against Denial of Service Attacks +- Scalability +- Cheaper, easier to manage +- Bypass traffic jams on the web: + - Requested data is close to clients + - Avoid bottleneck links + +### Optimizations in CDN side + +- Content is cached at various locations, for faster access +- Use data compression +- Use load balancing to reduce traffic +- Security features like DDoS protection +- Use network peering, for shorter data paths + +### Examples and Usage + +- Netflix: + - Low latency and high defiition media can be played + - Handles peak traffic + - Content has consistent quality +- Alibaba: + - Rapid page loads for product listing + - Support large scale events + - Stability and scalability + +### CDN Routing + +#### Server Selection + +- Load: To balance load +- Performance: improve client performance, based on: + - Geography + - RTT + - Throughput + - Load +- Any Node Alive: provide fault tolerance + +#### Ways of redirecting + +- As a part of routing: anycast (Single IP address is shared by many devices in + multiple locations), cluster, load balancing + - Pros: transparent to clients, works when browser cached failed addresses, + circumvents many routing problems + - Cons: Little control over selection of server, complex, scalability, and + can't recover TCP +- Part of application: HTTP Redirect + - Pros: Application level, has more control + - Cons: Has Additional load and RTT, and is hard to cache +- Part of naming: DNS + - Pros: Suitable for caching, dns redirect to any IP + - Cons: This is implemented in resolver, requesting for a domain not URL, + and hidden load factor for resolver's population + - Can estimate the stats + +#### More on DNS redirection + +- DNS redirection is used to redirect client to a nearby server. +- Based on: + - Latency to client + - Load balancing + - Try to balance client across many servers to avoid hotspot + - Available servers +- Process: + - Client's DNS request come to CDN's nameserver ( See below to how it's + accessed. ) + - DNS request is being resolved to a nearby server, by accessing CDN + controlled name servers + - CDN measures the state of network in the infrastructure +- Two types of DNS redirection + - Full: + - the origin server is controlled by CDN + - Pro: All requests are automatically redirected + - Cons: May send a lot of traffic to CDN, so it's expensive + - Partial: + - Content provider mark what to provide to CDN + - usually larger objects + - Refer to images as `` + - Accessing the website, CDN serve the data + - Pros: Better control + - Cons: Have to mark content + +## Deployment + +### Hosting your stuff + +- Where: rely on measures + - Sample popular hostnames on alexa.com + - Ask DNS from multiple vantage points + - Categorize by type: + - Hostnames + - Files + - Unpopular + +### Examples + +- ChinaCache + +## Future + +### Challenges + +- Mobile networks: latency to cell is higher, opaque internal network structure +- Video: Large bandwidth, + - 16M - 30M bps compressed + - When Combined can be 25K TBps + - Even data centers don't have that much + - Using multicast from end systems as potential solution + +### CDN2.0 + +- Hybrid CDN: Akamai +- Cloud Based Video: NetFlix +- Meta CDN: Conviva +- Virtual CDN: ISP micro-datacenters