From 53d7884dd5d6254faf03cad23d9e5666575d052f Mon Sep 17 00:00:00 2001 From: Ryan Date: Sat, 4 Jan 2025 20:08:31 +0800 Subject: [PATCH] Finish 4-4 --- 4-4-graph.md | 183 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 183 insertions(+) create mode 100644 4-4-graph.md diff --git a/4-4-graph.md b/4-4-graph.md new file mode 100644 index 0000000..49db779 --- /dev/null +++ b/4-4-graph.md @@ -0,0 +1,183 @@ +# Graphs + +## Intro + +- A graph is a pair `G = (V, E)`, where + - V is the set of vertices + - E is the set of edges + - They may contain additional information +- Types: + - Directed vs. un-directed + - Presence or absence of cycles +- Examples + - Internet map + - Food web + - Friendship network + - Google PageRank + - Bipartite graph + - Epidemic +- Used to model **interactions** +- The size grows with input data + +## Graph Storage + +### The need of graph DBs + +- Graph storage: traditional DB and NoSQL can store graphs, but their query + language doesn't natively support it +- We need languages or abstractions for finding the patterns +- This lead to graph DBs: + - Neo4j + - Titan + +### Graph DBs: + +- Use graph structures with nodes, edges or properties to store data +- Index free adjacency, since each node is a pointer to its adjacent element +- Edges hold information, and connect to nodes + +### Graph queries + +- Filter edges and nodes +- List sub-graphs +- Possibility of said nodes to reach +- Shortest path between two nodes + +### neo4j + +#### intro + +- Is a java based graph database +- Follows ACID +- Schema-less modeling + - A graph has nodes and (multiple) edges + - Nodes and edges has properties attached + - Can also have labels +- Performance is good for smaller datasets +- Use Cypher + +#### Cypher query language + +- Becoming a standard +- Match queries, return elements that satisfies +- No joins, simpler than SQL + +#### features + +- Consistency: runs on single server, which guarantees consistency (no + distribution) +- ACID compliant: have to start a transaction before any changes +- HA: Can optionally use replication, that use a master slave structure + - Master will propagate changes to slaves (writing to master is fast) + - When slave is written to, will first update the master **Synchronously** , + then master update others + +## Graph processing + +### Sample problem + +- To find the shortest path: SSSP (Single Source Shortest Path) problem, from + one start node to some target nodes + - Usually done with Dijkstra, on one single machine + +### Representation + +- Record the structure, attributes of edges and vertices +- An example is edge list +- Basic structure: + - Vertices: single, individual, discrete entities in a dataset (person) + - Edges: relationship between vertices, used to enhance the understanding + and dynamics within the network (interactions, relationships) + - Attributes attached to edges and vertices +- Example: + - Social network: + - Users are vertices + - Friendship are edges + - Timestamp of friendship, personal interests are attributes +- Adjacency Matrices + - Represented with a $n \times n$ matrix $M$: + - Advantage: + - Encapsulates the iteration over nodes + - Rows and Columns correspond to in links and out links + - Disadvantages: + - The graph is a sparse graph, if $|E|$ is much smaller than $|V|^2$ + - When the graph is sparse, the matrix wastes space. + +### MapReduce and Graph + +- Approach: parallel processing of each vertex + - Each function has access to local vertex info: node and the links + - Iterative execution: the output of reducer is the input of mapper in the + next iteration +- Example: Equal weight SSSP + - Problem: from orogin node, find shortest distance to every other node of + graph, with all link having same weight (distance) + - Intuition: Use BFS + - start with `distanceTo(node) = 0` + - Set directly reachable node's distanct to 1 + - For other nodes accessible to the current set of nodes, set distance + to 1 plus min(distance to accessible node) +- Problem using map reduce: + - Using map reduce will write data to HDFS every iteration, which is + inefficient + - A lot of communication over the net, inefficient +- Should use a in-memory system + +### Pregel model + +#### Intro + +- In every iteration, a function is executed at the vertex: iterative + computation + - vertices can send messages to neighbors + - messages arrive at next iteration +- Parallel computation + - Vertex is independent, is dependent on only the neighbors (They have to be + on same machine) + - Use message for synchronization +- Good for + - Computing shortest path + - Ranking pages + - BFS + +#### Graph partitioning + +- When the graph is too big, it can't fit on one single machine +- We can split the graph across machines +- A partition defines, which edge and vertices are allocated to which machine +- Performance impact: + - Large Overhead for sending messages to neighboring vertices on different + machines over the network. + +#### Influences + +- Pregel style graph processing systems: + - Pregel (original) + - Apach Giraph (Written in Java) + - Apache Spark GraphX (Extension of Spark) + +### GraphX + +#### Intro + +- Spark library for graph processing +- Specialized RDD for graph representation and information +- Methods for creating, transforming and implementing multiple graph metrics and + algorithms + +#### GraphX RDD + +- Hold graph data, and provide methods + - `VertexRDD`: VertexId, VertexData + - `EdgeRDD` EdgeData + - Triplets: Source vertex, edge, dest vertex + +#### Predefined methods + +- Provides access to information, and operation +- Pregel-like iterative graph traversals: + - Number of iterations are needed + - `mergeMsg`: Combine incoming messages to a single one + - `vprog`: update vertex properties + - `sendMsg`: Send messages to neighbors + - New graph is returned, with updated vertex values