finish map reduce and cdn
This commit is contained in:
parent
17f4b90596
commit
3b7ee26a8c
198
4-1-beyond-map-reduce.md
Normal file
198
4-1-beyond-map-reduce.md
Normal file
|
@ -0,0 +1,198 @@
|
|||
# Beyond map reduce
|
||||
|
||||
## In memory processing
|
||||
|
||||
### Hadoop's problems
|
||||
|
||||
- Batch processing system
|
||||
- Does not process streamed data, thus the performance is slower
|
||||
- Designed to process very large data, not suited for many small files
|
||||
- Efficient at map stage, bad at IO:
|
||||
- Data loaded and written from HDFS
|
||||
- Shuffle and sort use a lot of net traffic
|
||||
- Job startup and finish takes seconds, regardless of the size
|
||||
- Not a good fit for every case
|
||||
- The structure is rigid: Map, Combiner, Shuffle and sort, Reduce
|
||||
- No support for iterations
|
||||
- Only one sync barrier
|
||||
- Bad at e.g. graph processing
|
||||
|
||||
### Intro to in memory Processing
|
||||
|
||||
- Definition: Load data in memory, before starting process
|
||||
- Advantages:
|
||||
- More flexible computation
|
||||
- Iteration is supported
|
||||
- No slow IO required
|
||||
- Disadvantages:
|
||||
- Data must fit in memory of distributed storage
|
||||
- Need additional measures for persistence
|
||||
- Mandatory fault-tolerant
|
||||
- Major frameworks:
|
||||
- Apache spark
|
||||
- Graph-centric: Pregel
|
||||
- SQL focused read only: Cloudera Impala
|
||||
|
||||
### Spark
|
||||
|
||||
- Open source large and general engine for large scale distributed data
|
||||
processing
|
||||
- is a **cluster computing** platform that has API for distributed programming
|
||||
- In memory processing and storage engine
|
||||
- Load data from HDFS,Cassandra
|
||||
- Resource management via Spark, EC2, YARN
|
||||
- Can work with Hadoop, or standalone
|
||||
- Runs on local or clusters
|
||||
- Goal: to provide distributed datasets, that users can use as if they are local
|
||||
- Has the shiny bits as MapReduce:
|
||||
- Fault tolerance
|
||||
- Data locality
|
||||
- Scalability
|
||||
- Approach: argument data flow with RDD
|
||||
|
||||
### RDD: Resilient Distributed Datasets
|
||||
|
||||
- Basic level of abstraction in spark
|
||||
- Distributed memory model: RDDs
|
||||
- Immutable collections of data **Distributed** across the nodes of cluster
|
||||
- New RDD is created by:
|
||||
- Loading data from input
|
||||
- transform existing collection to generate a new one
|
||||
- Can be saved to HDFS or other programs with action
|
||||
- Operations:
|
||||
- Transformation: Define new RDD from existing one
|
||||
- `map`
|
||||
- `filter`
|
||||
- `sample`
|
||||
- `union`
|
||||
- `groupByKey`
|
||||
- `reduceByKey`
|
||||
- `join`
|
||||
- `cache`
|
||||
- Action: Take RDD and return a result to driver
|
||||
- `reduce`
|
||||
- `collect`
|
||||
- `count`
|
||||
- `save`
|
||||
- `lookupKey`
|
||||
|
||||
### Scala (Not in test)
|
||||
|
||||
- Scala is the native language for spark
|
||||
- Similar syntax to java, but has powerful type inference
|
||||
|
||||
### Scala application
|
||||
|
||||
- Consists of a **driver** that executes various parallel operations on
|
||||
**RDDs**, partitioned across cluster
|
||||
- Driver is on different machine where RDDs are created
|
||||
- Use action to retrieve data from RDD
|
||||
- TODO: look at diagram at p20
|
||||
- Driver program run the user's main function, executes parallel operations on a
|
||||
cluster
|
||||
|
||||
### Components
|
||||
|
||||
- Driver program run the user's main functions, executes parallel operation on a
|
||||
cluster
|
||||
- Run as **independent** sets of processors, coordinated by a `SparkContext`
|
||||
in driver
|
||||
- Context run in a cluster manager like YARN, which allocates system
|
||||
resources
|
||||
- Working in cluster in managed by **executor**, which is managed by
|
||||
`SparkContext`
|
||||
- **Executor** responsible for executing task and store data
|
||||
- Deploying is up to the cluster manager used, like YARN or standalone spark
|
||||
|
||||
### Computation
|
||||
|
||||
- Using anonymous functions
|
||||
- Named functions
|
||||
- `map`: create a new RDD, with the original value replaced by new value
|
||||
returned in map
|
||||
- `filter`: create a new RDD, with less values
|
||||
|
||||
### Deferred execution
|
||||
|
||||
- Only executes the transformation, the moment they are needed
|
||||
- Only the invocation of action triggers the execution chain
|
||||
- This allows internal optimization: combine the operations
|
||||
|
||||
### Spark performance
|
||||
|
||||
#### Issues
|
||||
|
||||
- Because spark has freedom, task allocation is much more challenging
|
||||
- Errors appear more often, and hard to debug
|
||||
- Knowledge of basics of map reduce helps
|
||||
|
||||
#### Tuning
|
||||
|
||||
- Memory: Spark uses more memory
|
||||
- Partitioning for RDD
|
||||
- Performance implication for each operation
|
||||
|
||||
### Spark ecosystem
|
||||
|
||||
- GraphX: Graph processing RDD
|
||||
- MLib: machine learning
|
||||
- Spark SQL
|
||||
- Spark Streaming: **Stream** processing with D-Stream RDDs
|
||||
|
||||
## Stream processing
|
||||
|
||||
### Information streams
|
||||
|
||||
- Data continuously generated from various sources
|
||||
- Unbound, the arrival time is not fixed
|
||||
- Process the information the moment it's generated
|
||||
- Apply a function to each new element
|
||||
- Look for **real time** changes and response
|
||||
|
||||
### Apache Storm
|
||||
|
||||
#### Intro
|
||||
|
||||
- Developed by BlackType, apache project
|
||||
- Real time computation of streams
|
||||
- Features
|
||||
- Scalable
|
||||
- No data loss guarantee
|
||||
- Extremely robust and fault tolerant
|
||||
- Programming language agnostic
|
||||
- Distributed Stream Processing: tasks distributed across cluster
|
||||
|
||||
## Discretized Streams
|
||||
|
||||
- Unlike true streaming processing, we process information in micro batches
|
||||
- Input -> Spark Streaming -> batched input data -> Spark Engine ->
|
||||
Processed batch data
|
||||
- In spark:
|
||||
- reuse the spark framework
|
||||
- Can use spark transformations on RDDs
|
||||
- Construct a RDD every few seconds (defined time) to manage data streams
|
||||
- New RDD processed at each time slot
|
||||
|
||||
### DStream RDD
|
||||
|
||||
- Composes of a series of RDDs, to represent data over time
|
||||
- Choosing timer interval:
|
||||
- Small interval: quicker response time, at the cost of frequent batching
|
||||
|
||||
### DStream Transformations
|
||||
|
||||
- Change each RDD in the stream
|
||||
|
||||
### DStream Streaming Context
|
||||
|
||||
- Create a `StreamingContext` to manage the stream and transformation, and need a action to
|
||||
collect the results
|
||||
|
||||
### DStream Sliding windows
|
||||
|
||||
- Some usage require looking at a set of stream messages, to perform computation
|
||||
- Sliding window stores a rolling list with latest items from stream
|
||||
- The contents are changed over time, with new items added and old items popped
|
||||
- Using in Spark DStream:
|
||||
- has API to configure size of window (seconds) and frequency of computation (seconds)
|
||||
- Code?: `reduceByWindowAndKey((a,b)=>math.max(a,b), Seconds(60), Seconds(5) )`
|
288
4-2-cdn.md
Normal file
288
4-2-cdn.md
Normal file
|
@ -0,0 +1,288 @@
|
|||
# Content Delivery Networks
|
||||
|
||||
## DNS
|
||||
|
||||
### Definition:
|
||||
|
||||
- Domain name system
|
||||
- Intended use: to translate domain name to IP addresses
|
||||
- Other uses: load distribution: replicated web server has many IPs, use DNS to
|
||||
redirect client to closest place
|
||||
- Distributed system, that servers are interconnected
|
||||
- Centralizing is hard, because of the huge traffic, and distance, and
|
||||
single point of failure
|
||||
- Many applications rely on DNS
|
||||
|
||||
### Hierarchy
|
||||
|
||||
- Root DNS Server: Root name server
|
||||
- First point of contact
|
||||
- Directly query authoritative name server
|
||||
- Get Domain-name - IP mapping
|
||||
- Query for IP address for TLD DNS servers
|
||||
- TLD (Top Level Domain) `.com`, `.org`, `.edu` DNS server
|
||||
- Query for IP address to Authoritative DNS Server
|
||||
- Authoritative DNS Server: Owned by site owner like `amazon.com`
|
||||
|
||||
### Local DNS Server:
|
||||
|
||||
- Actually a client, not in a part of the Hierarchy
|
||||
- Each ISP (Internet Service Provider) has one
|
||||
- Workings:
|
||||
- When host makes DNS query, it's sent to local DNS server
|
||||
- The Server may have local cache of name-to-address pair
|
||||
- Otherwise forward the query to the DNS hierarchy
|
||||
|
||||
### DNS Caching
|
||||
|
||||
- Once the server knows about the mapping, it is **cached**
|
||||
- Cache entry timeout after time (TTL): on the other hand it may be out of date
|
||||
- TLD servers are typically cached in local, since root names are not frequently
|
||||
visited
|
||||
- Benefits
|
||||
- Reduce network traffic on: **Root servers**, **across the internet**
|
||||
- This increases network performance because DNS response is much faster.
|
||||
|
||||
## P2P
|
||||
|
||||
### Definition
|
||||
|
||||
- A **Distributed** network architecture
|
||||
- Every node is both the **Client** and the **Server**
|
||||
- Advantages:
|
||||
- Scalable:
|
||||
- As the number of clients increase, the number of servers also
|
||||
increases
|
||||
- Both consume and donate resource
|
||||
- Less cost: Cost at the edge of network
|
||||
- More privacy: No centralized source of data
|
||||
- Reliability:
|
||||
- Distributed geographically
|
||||
- Has Replicas
|
||||
- No single point of failure
|
||||
- All of above made it easy to share content
|
||||
|
||||
### Categories
|
||||
|
||||
- Unstructured:
|
||||
- No restriction on overlay structures and data placement
|
||||
- Examples:
|
||||
- Napster, BitTorrent, FreeNet
|
||||
- Structured
|
||||
- Uses Distributed Hash Table, that use an interface like `put(k, v)`, and
|
||||
`get(k)`
|
||||
- Has restriction on overlay structure, and data placement
|
||||
- Examples:
|
||||
- Chord, Pastery and CAN
|
||||
|
||||
### Server Selection
|
||||
|
||||
- For BitTorrent, a Tracker is used, which informs the clients about the peers
|
||||
available
|
||||
- TODO: See diagram at page 26
|
||||
|
||||
### Issues with P2P
|
||||
|
||||
- Reliability
|
||||
- Performance
|
||||
- Control: have a lot of copyrighted content
|
||||
|
||||
## Content Delivery Networks
|
||||
|
||||
### History of Content Delivery
|
||||
|
||||
- Web 1.0: Pre-CDN, Infrastructure development
|
||||
- CDN 1.0: First generation of CDN, replication, intelligent routing, edge
|
||||
computing
|
||||
- CDN 2.0: P2P, Cloud Computing, Energy Awareness
|
||||
- CDN 3.0: Autonomic composition
|
||||
|
||||
### Web Caches
|
||||
|
||||
- The precursor to CDN
|
||||
- Improve efficiency by caching
|
||||
- Caching proxy:
|
||||
- Receive HTTP request from client
|
||||
- If object in cache, then send cached content
|
||||
- Otherwise request the object from origin server
|
||||
- Works as both client and server:
|
||||
- Client: request content from origin
|
||||
- Server: serve content to downstream client
|
||||
- Usually installed by ISP
|
||||
- Reason:
|
||||
- Reduce response time for client request
|
||||
- Reduce traffic across network
|
||||
- Problem:
|
||||
- Can't serve all of the web users, since the web is too large, and
|
||||
- Web content is dynamic and customized, which means many of them are not
|
||||
cacheable
|
||||
- Origin upstream web servers shouldn't rely on downstream caching proxy
|
||||
- Upstream web servers can't see the real statistics of their site, since
|
||||
the user data is not sent to their servers
|
||||
|
||||
### Definition
|
||||
|
||||
- Also called _Content Distribution Network_
|
||||
- **Infra**: large distributed system of servers deployed in multiple data
|
||||
centers across the internet
|
||||
- **Goal**: distribute content to end users on a large scale with high
|
||||
**availability** and high **performance**
|
||||
- Is a mechanism to **replicate** content on multiple servers on the internet,
|
||||
providing client a way to choose server that can provide content fast.
|
||||
- Content providers are the CDN customers:
|
||||
- They pay CDN companies to deliver their content
|
||||
- CDN pays ISPs, carriers, and network operators for hosting their servers
|
||||
- Usually used by large web platforms
|
||||
|
||||
### What CDN do
|
||||
|
||||
- Serve a large fraction of internet content
|
||||
- Web objects (Text, JavaScript, graphics)
|
||||
- Downloadable objects
|
||||
- Applications
|
||||
- Stream media
|
||||
- Most of the web uses CDN
|
||||
|
||||
### The model
|
||||
|
||||
- TODO: See the slide p41
|
||||
|
||||
### CDN Deployment
|
||||
|
||||
- CDN company deploy hundreds of servers around the world, often inside ISP
|
||||
networks, so that it's close to users
|
||||
- CDN Customer side:
|
||||
- Replicates customer's content in CDN servers
|
||||
- When provider update content, CDN update server with their content
|
||||
- User side:
|
||||
- Send request to origin server
|
||||
- Intercepted by redirection service
|
||||
- Forward user's request to best CDN server
|
||||
- Content served from CDN server
|
||||
|
||||
### Companies
|
||||
|
||||
- Akamai
|
||||
- Limelight
|
||||
- ChinaCache
|
||||
- Edgecast
|
||||
|
||||
### Benefits
|
||||
|
||||
- Reduce latency to users
|
||||
- Reduce load on original server
|
||||
- Increase security against Denial of Service Attacks
|
||||
- Scalability
|
||||
- Cheaper, easier to manage
|
||||
- Bypass traffic jams on the web:
|
||||
- Requested data is close to clients
|
||||
- Avoid bottleneck links
|
||||
|
||||
### Optimizations in CDN side
|
||||
|
||||
- Content is cached at various locations, for faster access
|
||||
- Use data compression
|
||||
- Use load balancing to reduce traffic
|
||||
- Security features like DDoS protection
|
||||
- Use network peering, for shorter data paths
|
||||
|
||||
### Examples and Usage
|
||||
|
||||
- Netflix:
|
||||
- Low latency and high defiition media can be played
|
||||
- Handles peak traffic
|
||||
- Content has consistent quality
|
||||
- Alibaba:
|
||||
- Rapid page loads for product listing
|
||||
- Support large scale events
|
||||
- Stability and scalability
|
||||
|
||||
### CDN Routing
|
||||
|
||||
#### Server Selection
|
||||
|
||||
- Load: To balance load
|
||||
- Performance: improve client performance, based on:
|
||||
- Geography
|
||||
- RTT
|
||||
- Throughput
|
||||
- Load
|
||||
- Any Node Alive: provide fault tolerance
|
||||
|
||||
#### Ways of redirecting
|
||||
|
||||
- As a part of routing: anycast (Single IP address is shared by many devices in
|
||||
multiple locations), cluster, load balancing
|
||||
- Pros: transparent to clients, works when browser cached failed addresses,
|
||||
circumvents many routing problems
|
||||
- Cons: Little control over selection of server, complex, scalability, and
|
||||
can't recover TCP
|
||||
- Part of application: HTTP Redirect
|
||||
- Pros: Application level, has more control
|
||||
- Cons: Has Additional load and RTT, and is hard to cache
|
||||
- Part of naming: DNS
|
||||
- Pros: Suitable for caching, dns redirect to any IP
|
||||
- Cons: This is implemented in resolver, requesting for a domain not URL,
|
||||
and hidden load factor for resolver's population
|
||||
- Can estimate the stats
|
||||
|
||||
#### More on DNS redirection
|
||||
|
||||
- DNS redirection is used to redirect client to a nearby server.
|
||||
- Based on:
|
||||
- Latency to client
|
||||
- Load balancing
|
||||
- Try to balance client across many servers to avoid hotspot
|
||||
- Available servers
|
||||
- Process:
|
||||
- Client's DNS request come to CDN's nameserver ( See below to how it's
|
||||
accessed. )
|
||||
- DNS request is being resolved to a nearby server, by accessing CDN
|
||||
controlled name servers
|
||||
- CDN measures the state of network in the infrastructure
|
||||
- Two types of DNS redirection
|
||||
- Full:
|
||||
- the origin server is controlled by CDN
|
||||
- Pro: All requests are automatically redirected
|
||||
- Cons: May send a lot of traffic to CDN, so it's expensive
|
||||
- Partial:
|
||||
- Content provider mark what to provide to CDN
|
||||
- usually larger objects
|
||||
- Refer to images as `<img src=http://cdn.com/foo/bar/img.gif>`
|
||||
- Accessing the website, CDN serve the data
|
||||
- Pros: Better control
|
||||
- Cons: Have to mark content
|
||||
|
||||
## Deployment
|
||||
|
||||
### Hosting your stuff
|
||||
|
||||
- Where: rely on measures
|
||||
- Sample popular hostnames on alexa.com
|
||||
- Ask DNS from multiple vantage points
|
||||
- Categorize by type:
|
||||
- Hostnames
|
||||
- Files
|
||||
- Unpopular
|
||||
|
||||
### Examples
|
||||
|
||||
- ChinaCache
|
||||
|
||||
## Future
|
||||
|
||||
### Challenges
|
||||
|
||||
- Mobile networks: latency to cell is higher, opaque internal network structure
|
||||
- Video: Large bandwidth,
|
||||
- 16M - 30M bps compressed
|
||||
- When Combined can be 25K TBps
|
||||
- Even data centers don't have that much
|
||||
- Using multicast from end systems as potential solution
|
||||
|
||||
### CDN2.0
|
||||
|
||||
- Hybrid CDN: Akamai
|
||||
- Cloud Based Video: NetFlix
|
||||
- Meta CDN: Conviva
|
||||
- Virtual CDN: ISP micro-datacenters
|
Loading…
Reference in a new issue