Finish 4-4

2025-01-04 20:08:31 +08:00 · 2025-01-04 20:08:31 +08:00 · 53d7884dd5
parent 6b8ca497db
commit 53d7884dd5
1 changed files with 183 additions and 0 deletions
--- a/4-4-graph.md
+++ b/4-4-graph.md
@ -0,0 +1,183 @@
+# Graphs
+
+## Intro
+
+- A graph is a pair `G = (V, E)`, where
+    - V is the set of vertices
+    - E is the set of edges
+    - They may contain additional information
+- Types:
+    - Directed vs. un-directed
+    - Presence or absence of cycles
+- Examples
+    - Internet map
+    - Food web
+    - Friendship network
+    - Google PageRank
+    - Bipartite graph
+    - Epidemic
+- Used to model **interactions**
+- The size grows with input data
+
+## Graph Storage
+
+### The need of graph DBs
+
+- Graph storage: traditional DB and NoSQL can store graphs, but their query
+  language doesn't natively support it
+- We need languages or abstractions for finding the patterns
+- This lead to graph DBs:
+    - Neo4j
+    - Titan
+
+### Graph DBs:
+
+- Use graph structures with nodes, edges or properties to store data
+- Index free adjacency, since each node is a pointer to its adjacent element
+- Edges hold information, and connect to nodes
+
+### Graph queries
+
+- Filter edges and nodes
+- List sub-graphs
+- Possibility of said nodes to reach
+- Shortest path between two nodes
+
+### neo4j
+
+#### intro
+
+- Is a java based graph database
+- Follows ACID
+- Schema-less modeling
+    - A graph has nodes and (multiple) edges
+    - Nodes and edges has properties attached
+    - Can also have labels
+- Performance is good for smaller datasets
+- Use Cypher
+
+#### Cypher query language
+
+- Becoming a standard
+- Match queries, return elements that satisfies
+- No joins, simpler than SQL
+
+#### features
+
+- Consistency: runs on single server, which guarantees consistency (no
+  distribution)
+- ACID compliant: have to start a transaction before any changes
+- HA: Can optionally use replication, that use a master slave structure
+    - Master will propagate changes to slaves (writing to master is fast)
+    - When slave is written to, will first update the master **Synchronously** ,
+      then master update others
+
+## Graph processing
+
+### Sample problem
+
+- To find the shortest path: SSSP (Single Source Shortest Path) problem, from
+  one start node to some target nodes
+    - Usually done with Dijkstra, on one single machine
+
+### Representation
+
+- Record the structure, attributes of edges and vertices
+- An example is edge list
+- Basic structure:
+    - Vertices: single, individual, discrete entities in a dataset (person)
+    - Edges: relationship between vertices, used to enhance the understanding
+      and dynamics within the network (interactions, relationships)
+    - Attributes attached to edges and vertices
+- Example:
+    - Social network:
+        - Users are vertices
+        - Friendship are edges
+        - Timestamp of friendship, personal interests are attributes
+- Adjacency Matrices
+    - Represented with a $n \times n$ matrix $M$:
+    - Advantage:
+        - Encapsulates the iteration over nodes
+        - Rows and Columns correspond to in links and out links
+    - Disadvantages:
+        - The graph is a sparse graph, if $|E|$ is much smaller than $|V|^2$
+        - When the graph is sparse, the matrix wastes space.
+
+### MapReduce and Graph
+
+- Approach: parallel processing of each vertex
+    - Each function has access to local vertex info: node and the links
+    - Iterative execution: the output of reducer is the input of mapper in the
+      next iteration
+- Example: Equal weight SSSP
+    - Problem: from orogin node, find shortest distance to every other node of
+      graph, with all link having same weight (distance)
+    - Intuition: Use BFS
+        - start with `distanceTo(node) = 0`
+        - Set directly reachable node's distanct to 1
+        - For other nodes accessible to the current set of nodes, set distance
+          to 1 plus min(distance to accessible node)
+- Problem using map reduce:
+    - Using map reduce will write data to HDFS every iteration, which is
+      inefficient
+    - A lot of communication over the net, inefficient
+- Should use a in-memory system
+
+### Pregel model
+
+#### Intro
+
+- In every iteration, a function is executed at the vertex: iterative
+  computation
+    - vertices can send messages to neighbors
+    - messages arrive at next iteration
+- Parallel computation
+    - Vertex is independent, is dependent on only the neighbors (They have to be
+      on same machine)
+    - Use message for synchronization
+- Good for
+    - Computing shortest path
+    - Ranking pages
+    - BFS
+
+#### Graph partitioning
+
+- When the graph is too big, it can't fit on one single machine
+- We can split the graph across machines
+- A partition defines, which edge and vertices are allocated to which machine
+- Performance impact:
+    - Large Overhead for sending messages to neighboring vertices on different
+      machines over the network.
+
+#### Influences
+
+- Pregel style graph processing systems:
+    - Pregel (original)
+    - Apach Giraph (Written in Java)
+    - Apache Spark GraphX (Extension of Spark)
+
+### GraphX
+
+#### Intro
+
+- Spark library for graph processing
+- Specialized RDD for graph representation and information
+- Methods for creating, transforming and implementing multiple graph metrics and
+  algorithms
+
+#### GraphX RDD
+
+- Hold graph data, and provide methods
+    - `VertexRDD`: VertexId, VertexData
+    - `EdgeRDD` EdgeData
+    - Triplets: Source vertex, edge, dest vertex
+
+#### Predefined methods
+
+- Provides access to information, and operation
+- Pregel-like iterative graph traversals:
+    - Number of iterations are needed
+    - `mergeMsg`: Combine incoming messages to a single one
+    - `vprog`: update vertex properties
+    - `sendMsg`: Send messages to neighbors
+    - New graph is returned, with updated vertex values