root/EBU6502_cloud_computing_notes

Fork 0

Ryan 3d4dd798b7 Update consistency level

2025-01-06 14:47:10 +08:00

9.1 KiB

Raw Blame History

Cloud Databases

Traditional database

Intro

Most traditional DB are relational
Relational data model has been developed to serve application that focus on consistency and reliability
Data consistency and reliability are guaranteed by ACID properties

ACID properties

Atomicity: No partial transaction, if it fail, rolls back
Consistency: Changes made by transaction respect all database integrity constraints, and database remain in consistent state
Isolation: Transactions can not see changes made by concurrent transactions
Durability: Changes stored permanently, persistent data

NOSQL Database System

Not relational database, called (Not Only SQL)
Features:
- Relax one of ACID properties
- Key-Value data model, for storing data
- schemeless, or scheme lacks the constraints
- Uses data partitioning, has data replication and distribution
- Has no joins, and use data sharding

NOSQL vs. relation DB

Comparison	NoSQL	Relational Database
Type of data	Mainly unstructured	Structured only
Volume of data	High volume	Low volume
Single point of failure	No	Yes
Data access	Many locations	Few locations

Advantages of NOSQL

Cheap
Data replication and partition, leading to no single point of failure
Easy to distribute
No schema reqired

Issues with relational database

Vertical scaling is hard
Horizontal scaling will break ACID

CAP

Theorem

Content

A distributed system can't have all of the three
Consistency of storage: having the same value for all copies, every read receive the most write or error
Availability: Request received from non-failing node, results in a response
Partitioning tolerance: System continues to operate despite a number of messages being dropped by network

NoSQL and CAP

Cloud databases's integrity are not expected
Availability is characterized by a small latency
No Clear border between availability and network partitions

Trade offs

SQL: Strong consistency and availability, but not partition tolerance
NoSQL, for example Dynamo: availability and partition tolerance, but weak consistency
- High latency losts revenues, so sacrificed strong consistency
- Now has a model of eventual consistency

Relaxing Availability:

Not possible, we want our website to be available at all time

Consistency

Definition

Strong consistency: replicas duplicate linearly in same total order
- No apparent conflicts
- The traditional way
- ACID SQL database are strongly consistent
Distributed and ACID: Need to form a concensus by synchronizing the replica states

Eventual Consistency

May not converge to same total order
Update accepted by local node, and propagated to other nodes
No synchronization needed
- Eventually all replicas are updated, if there's no update for a long enough time

Breakdown of a example

TODO: read the diagrams from P25

Example: Facebook

When checking out a story, it may not be available at first, because the consistency is not yet achieved
Because it's impossible without eventual consistency, with 2.7 billion users.
This reduces the load and improve the availability

Example: Dropbox

Based on personal syncing, not real time collaboration, so immediate consistency is not that important
Eventual consistency, your updates are propagated

Example: ATM

Event ATM uses eventual consistency
Because of higher availability needed
You can overdraw money, if the machined is partitioned from network
It has a limit on the amount of withdraw, and you will be charged

Strong consistency

Counterpart of Eventual consistency
All read must return data from latest write
Very hard to achieve with high availability and high partition tolerance

Replication of data

Data partitioning and replication

Reasons:
- Amount of data exceeds the capacity of a single machine
- Scaling for load balancing
- Ensure reliability and availability by replication
NoSQL databases usually maximize availability when partitioning
Techniques to achieve this
- Memory caches
- Separating RW
- Clustering and sharding

Caches

Memory caches are transient, partly partitioned and replicated in-memory databases
Replicates most requested parts to main memories of a number of servers
Advantages:
- Fast response to clients
- Off-loading of database servers

Separating Reads from Writes

Master server dedicated to writes
Some replicas servers (Slaves) dedicated to reads
Master replicates updates to slaves
Error handling: if the master crashes
- Before completing replication: Write operation is lost
- Otherwise: Most up to date slave take the master role

Clustering

HA (High availability) clusters are groups of computers that support server applications
Using redundant computers in groups to provide continued service, in case one of the components fail, to elimimate Single Point of Failure (SPOF)
Failover: When an HA cluster detects a fault, the application is immediately restarted on another system without requiring intervention

Sharding

Data partitioning scheme, that
- data that is accessed and modified frequently are stored on the same shard
- Each shard is on a different node
To spread load across different servers
This technique is not appropriate for relational database, since they are normalized
Popular with non-relational databases

Cloud DB Services

Types of NOSQL, and examples

Key value databases:
- Single value or row, indexed by a key
- Memcached, vertica
Column Family databases
- A sparse, distributed, persistent multidimensional sorted map
- Google BigTable, Apache Cassandra
Document databases
- Multi-field documents or objects with JSON access
- MongoDB
Graph databases
- Nodes, edges, properties
- Neo4J, sones

Comparing solutions based on CAP

CA: RDBMS
CP: BigTable, MongoDB
AP: Dynamo, Cassandra

Examples

Memcached

Features
- High performance distributed in-memory caching service
- Similar to hash table, manages objects
- Use standard get and set
To serve requests for read-heavy workloads
Used in Facebook, LinkedIn

Example pseudocode:

public String getFoo(String fooId)
    String foo = memcached.get(fooId)
    if foo != null return foo
    foo = fetchFooFromDatabase(fooId)
    memcached.set(fooId, foo)
    return foo

Size:
- Has set number of bits 128 bits
- Each value no more than 1MB
- LRU policy: when full, new one use old one
  - Least used item removed
  - This way, frequently accessed items are in cache, regardless of capacity
Clustering:
- Multiple nodes, each one of them has different keys
- Servers are not aware there are multiple of them (transparent): Smart Clients & Dumb Servers
  - Client has function that maps keys to servers, while
  - Servers are not aware there are multiple of them

Cassandra

Features:

Elastic scaling: read and write increases linearly
Replication, per node and per data center
Decentralized
Column based, multi dimensional hash table without join capabilities
Consistency is tunable

Consistent hashing: Point on a circle

Each database object is mapped to a point on circle, by hashing key
Node is also mapped to a point on the same circle
To find an object, it:
- Hashes the object's key to a point on the circle
- Walk clockwise, until it encounters the first node, the node will store the object
Benefits: reduce the impact of node entering and leaving the ring
Node changes: redistribute existing data objects
- Leave: node in the clockwise direction will store objects that belong to left code
- Add: mapped to a point, map objects that should belong to it to the node

Replication

Replication factor: how many copies of my data exists: data item replicated at N hosts, this is associated to a key. And replicated nodes are chosen clockwise
Coordinator node: replication of the data items
Consistency level: how recent and in-sync all replicas of a row of data are. It prefers availability over consistency, and let you tune how consistent it would be, replica consistency levels:
- ANY: one replica node, including handoff need to reply, in order for write to success
- ONE: one node
- QUORUM: N / 2 + 1
  - LOCAL_QUORUM: local datacenter
  - EACH_QUORUM: each datacenter
- ALL: all replicas
Consistency policies: Rack unaware, rack aware, datacenter aware

9.1 KiB Raw Blame History

Cloud Databases

Cloud Databases

Traditional database

Intro

ACID properties

NOSQL Database System

NOSQL vs. relation DB

Advantages of NOSQL

Issues with relational database

CAP

Theorem

Content

NoSQL and CAP

Trade offs

Relaxing Availability:

Consistency

Definition

Eventual Consistency

Breakdown of a example

Example: Facebook

Example: Dropbox

Example: ATM

Strong consistency

Replication of data

Data partitioning and replication

Caches

Separating Reads from Writes

Clustering

Sharding

Cloud DB Services

Types of NOSQL, and examples

Comparing solutions based on CAP

Examples

Memcached

Cassandra

Features:

Consistent hashing: Point on a circle

Replication

Data Model

Example

EBay

9.1 KiB

Raw Blame History