root/EBU6502_cloud_computing_notes

Fork 0

Ryan 066f0ff358 Fill in some parts

2025-01-04 20:10:11 +08:00

5.7 KiB

Raw Permalink Blame History

Graphs

Intro

A graph is a pair G = (V, E), where
- V is the set of vertices
- E is the set of edges
- They may contain additional information
Types:
- Directed vs. un-directed
- Presence or absence of cycles
Examples
- Internet map
- Food web
- Friendship network
- Google PageRank
- Bipartite graph
- Epidemic
Used to model interactions
The size grows with input data

Graph Storage

The need of graph DBs

Graph storage: traditional DB and NoSQL can store graphs, but their query language doesn't natively support it
We need languages or abstractions for finding the patterns
This lead to graph DBs:
- Neo4j
- Titan

Graph DBs:

Use graph structures with nodes, edges or properties to store data
Index free adjacency, since each node is a pointer to its adjacent element
Edges hold information, and connect to nodes

Graph queries

Filter edges and nodes
List sub-graphs
Possibility of said nodes to reach
Shortest path between two nodes

neo4j

intro

Is a java based graph database
Follows ACID
Schema-less modeling
- A graph has nodes and (multiple) edges
- Nodes and edges has properties attached
- Can also have labels
Performance is good for smaller datasets
Use Cypher

Cypher query language

Becoming a standard
Match queries, return elements that satisfies
No joins, simpler than SQL

features

Consistency: runs on single server, which guarantees consistency (no distribution)
ACID compliant: have to start a transaction before any changes
HA: Can optionally use replication, that use a master slave structure
- Master will propagate changes to slaves (writing to master is fast)
- When slave is written to, will first update the master Synchronously , then master update others

Graph processing

Sample problem

To find the shortest path: SSSP (Single Source Shortest Path) problem, from one start node to some target nodes
- Usually done with Dijkstra, on one single machine

Representation

Record the structure, attributes of edges and vertices
An example is edge list
Basic structure:
- Vertices: single, individual, discrete entities in a dataset (person)
- Edges: relationship between vertices, used to enhance the understanding and dynamics within the network (interactions, relationships)
- Attributes attached to edges and vertices
Example:
- Social network:
  - Users are vertices
  - Friendship are edges
  - Timestamp of friendship, personal interests are attributes
Adjacency Matrices
- Represented with a n \times n matrix M:
- Advantage:
  - Encapsulates the iteration over nodes
  - Rows and Columns correspond to in links and out links
- Disadvantages:
  - The graph is a sparse graph, if |E| is much smaller than |V|^2
  - When the graph is sparse, the matrix wastes space.

MapReduce and Graph

Approach: parallel processing of each vertex
- Each function has access to local vertex info: node and the links
- Iterative execution: the output of reducer is the input of mapper in the next iteration
Example: Equal weight SSSP
- Problem: from orogin node, find shortest distance to every other node of graph, with all link having same weight (distance)
- Intuition: Use BFS
  - start with distanceTo(node) = 0
  - Set directly reachable node's distanct to 1
  - For other nodes accessible to the current set of nodes, set distance to 1 plus min(distance to accessible node)
Problem using map reduce:
- Using map reduce will write data to HDFS every iteration, which is inefficient
- A lot of communication over the net, inefficient
Should use a in-memory system

Pregel model

Intro

In every iteration, a function is executed at the vertex: iterative computation
- vertices can send messages to neighbors
- messages arrive at next iteration
Parallel computation
- Vertex is independent, is dependent on only the neighbors (They have to be on same machine)
- Use message for synchronization
Good for
- Computing shortest path
- Ranking pages
- BFS

Graph partitioning

When the graph is too big, it can't fit on one single machine
We can split the graph across machines
A partition defines, which edge and vertices are allocated to which machine
Performance impact:
- Large Overhead for sending messages to neighboring vertices on different machines over the network.

Influences

Pregel style graph processing systems:
- Pregel (original)
- Apach Giraph (Written in Java)
- Apache Spark GraphX (Extension of Spark)

GraphX

Intro

Spark library for graph processing
Specialized RDD for graph representation and information
Methods for creating, transforming and implementing multiple graph metrics and algorithms

GraphX RDD

Hold graph data, and provide methods
- VertexRDD: VertexId, VertexData
- EdgeRDD EdgeData
- Triplets: Source vertex, edge, dest vertex

Predefined methods

Provides access to information, and operation:
- Graph.vertices, graph.edges, graph.triples: access property
- Graph.degrees: Access connected edges: Provides a tuple with (vertexId, degree of each vertex)
- Graph.connectedComponents: each of the connected components of the graph
Pregel-like iterative graph traversals:
- Number of iterations are needed
- mergeMsg: Combine incoming messages to a single one
- vprog: update vertex properties
- sendMsg: Send messages to neighbors
- New graph is returned, with updated vertex values

5.7 KiB Raw Permalink Blame History

Graphs

Intro

Graph Storage

The need of graph DBs

Graph DBs:

Graph queries

neo4j

intro

Cypher query language

features

Graph processing

Sample problem

Representation

MapReduce and Graph

Pregel model

Intro

Graph partitioning

Influences

GraphX

Intro

GraphX RDD

Predefined methods

5.7 KiB

Raw Permalink Blame History