EBU6502_cloud_computing_notes/4-4-graph.md
2025-01-04 20:10:11 +08:00

188 lines
5.7 KiB
Markdown

# Graphs
## Intro
- A graph is a pair `G = (V, E)`, where
- V is the set of vertices
- E is the set of edges
- They may contain additional information
- Types:
- Directed vs. un-directed
- Presence or absence of cycles
- Examples
- Internet map
- Food web
- Friendship network
- Google PageRank
- Bipartite graph
- Epidemic
- Used to model **interactions**
- The size grows with input data
## Graph Storage
### The need of graph DBs
- Graph storage: traditional DB and NoSQL can store graphs, but their query
language doesn't natively support it
- We need languages or abstractions for finding the patterns
- This lead to graph DBs:
- Neo4j
- Titan
### Graph DBs:
- Use graph structures with nodes, edges or properties to store data
- Index free adjacency, since each node is a pointer to its adjacent element
- Edges hold information, and connect to nodes
### Graph queries
- Filter edges and nodes
- List sub-graphs
- Possibility of said nodes to reach
- Shortest path between two nodes
### neo4j
#### intro
- Is a java based graph database
- Follows ACID
- Schema-less modeling
- A graph has nodes and (multiple) edges
- Nodes and edges has properties attached
- Can also have labels
- Performance is good for smaller datasets
- Use Cypher
#### Cypher query language
- Becoming a standard
- Match queries, return elements that satisfies
- No joins, simpler than SQL
#### features
- Consistency: runs on single server, which guarantees consistency (no
distribution)
- ACID compliant: have to start a transaction before any changes
- HA: Can optionally use replication, that use a master slave structure
- Master will propagate changes to slaves (writing to master is fast)
- When slave is written to, will first update the master **Synchronously** ,
then master update others
## Graph processing
### Sample problem
- To find the shortest path: SSSP (Single Source Shortest Path) problem, from
one start node to some target nodes
- Usually done with Dijkstra, on one single machine
### Representation
- Record the structure, attributes of edges and vertices
- An example is edge list
- Basic structure:
- Vertices: single, individual, discrete entities in a dataset (person)
- Edges: relationship between vertices, used to enhance the understanding
and dynamics within the network (interactions, relationships)
- Attributes attached to edges and vertices
- Example:
- Social network:
- Users are vertices
- Friendship are edges
- Timestamp of friendship, personal interests are attributes
- Adjacency Matrices
- Represented with a $n \times n$ matrix $M$:
- Advantage:
- Encapsulates the iteration over nodes
- Rows and Columns correspond to in links and out links
- Disadvantages:
- The graph is a sparse graph, if $|E|$ is much smaller than $|V|^2$
- When the graph is sparse, the matrix wastes space.
### MapReduce and Graph
- Approach: parallel processing of each vertex
- Each function has access to local vertex info: node and the links
- Iterative execution: the output of reducer is the input of mapper in the
next iteration
- Example: Equal weight SSSP
- Problem: from orogin node, find shortest distance to every other node of
graph, with all link having same weight (distance)
- Intuition: Use BFS
- start with `distanceTo(node) = 0`
- Set directly reachable node's distanct to 1
- For other nodes accessible to the current set of nodes, set distance
to 1 plus min(distance to accessible node)
- Problem using map reduce:
- Using map reduce will write data to HDFS every iteration, which is
inefficient
- A lot of communication over the net, inefficient
- Should use a in-memory system
### Pregel model
#### Intro
- In every iteration, a function is executed at the vertex: iterative
computation
- vertices can send messages to neighbors
- messages arrive at next iteration
- Parallel computation
- Vertex is independent, is dependent on only the neighbors (They have to be
on same machine)
- Use message for synchronization
- Good for
- Computing shortest path
- Ranking pages
- BFS
#### Graph partitioning
- When the graph is too big, it can't fit on one single machine
- We can split the graph across machines
- A partition defines, which edge and vertices are allocated to which machine
- Performance impact:
- Large Overhead for sending messages to neighboring vertices on different
machines over the network.
#### Influences
- Pregel style graph processing systems:
- Pregel (original)
- Apach Giraph (Written in Java)
- Apache Spark GraphX (Extension of Spark)
### GraphX
#### Intro
- Spark library for graph processing
- Specialized RDD for graph representation and information
- Methods for creating, transforming and implementing multiple graph metrics and
algorithms
#### GraphX RDD
- Hold graph data, and provide methods
- `VertexRDD`: VertexId, VertexData
- `EdgeRDD` EdgeData
- Triplets: Source vertex, edge, dest vertex
#### Predefined methods
- Provides access to information, and operation:
- `Graph.vertices`, `graph.edges`, `graph.triples`: access property
- `Graph.degrees`: Access connected edges: Provides a tuple with
(`vertexId`, `degree` of each vertex)
- `Graph.connectedComponents`: each of the connected components of the graph
- Pregel-like iterative graph traversals:
- Number of iterations are needed
- `mergeMsg`: Combine incoming messages to a single one
- `vprog`: update vertex properties
- `sendMsg`: Send messages to neighbors
- New graph is returned, with updated vertex values