184 lines
5.4 KiB
Markdown
184 lines
5.4 KiB
Markdown
|
# Graphs
|
||
|
|
||
|
## Intro
|
||
|
|
||
|
- A graph is a pair `G = (V, E)`, where
|
||
|
- V is the set of vertices
|
||
|
- E is the set of edges
|
||
|
- They may contain additional information
|
||
|
- Types:
|
||
|
- Directed vs. un-directed
|
||
|
- Presence or absence of cycles
|
||
|
- Examples
|
||
|
- Internet map
|
||
|
- Food web
|
||
|
- Friendship network
|
||
|
- Google PageRank
|
||
|
- Bipartite graph
|
||
|
- Epidemic
|
||
|
- Used to model **interactions**
|
||
|
- The size grows with input data
|
||
|
|
||
|
## Graph Storage
|
||
|
|
||
|
### The need of graph DBs
|
||
|
|
||
|
- Graph storage: traditional DB and NoSQL can store graphs, but their query
|
||
|
language doesn't natively support it
|
||
|
- We need languages or abstractions for finding the patterns
|
||
|
- This lead to graph DBs:
|
||
|
- Neo4j
|
||
|
- Titan
|
||
|
|
||
|
### Graph DBs:
|
||
|
|
||
|
- Use graph structures with nodes, edges or properties to store data
|
||
|
- Index free adjacency, since each node is a pointer to its adjacent element
|
||
|
- Edges hold information, and connect to nodes
|
||
|
|
||
|
### Graph queries
|
||
|
|
||
|
- Filter edges and nodes
|
||
|
- List sub-graphs
|
||
|
- Possibility of said nodes to reach
|
||
|
- Shortest path between two nodes
|
||
|
|
||
|
### neo4j
|
||
|
|
||
|
#### intro
|
||
|
|
||
|
- Is a java based graph database
|
||
|
- Follows ACID
|
||
|
- Schema-less modeling
|
||
|
- A graph has nodes and (multiple) edges
|
||
|
- Nodes and edges has properties attached
|
||
|
- Can also have labels
|
||
|
- Performance is good for smaller datasets
|
||
|
- Use Cypher
|
||
|
|
||
|
#### Cypher query language
|
||
|
|
||
|
- Becoming a standard
|
||
|
- Match queries, return elements that satisfies
|
||
|
- No joins, simpler than SQL
|
||
|
|
||
|
#### features
|
||
|
|
||
|
- Consistency: runs on single server, which guarantees consistency (no
|
||
|
distribution)
|
||
|
- ACID compliant: have to start a transaction before any changes
|
||
|
- HA: Can optionally use replication, that use a master slave structure
|
||
|
- Master will propagate changes to slaves (writing to master is fast)
|
||
|
- When slave is written to, will first update the master **Synchronously** ,
|
||
|
then master update others
|
||
|
|
||
|
## Graph processing
|
||
|
|
||
|
### Sample problem
|
||
|
|
||
|
- To find the shortest path: SSSP (Single Source Shortest Path) problem, from
|
||
|
one start node to some target nodes
|
||
|
- Usually done with Dijkstra, on one single machine
|
||
|
|
||
|
### Representation
|
||
|
|
||
|
- Record the structure, attributes of edges and vertices
|
||
|
- An example is edge list
|
||
|
- Basic structure:
|
||
|
- Vertices: single, individual, discrete entities in a dataset (person)
|
||
|
- Edges: relationship between vertices, used to enhance the understanding
|
||
|
and dynamics within the network (interactions, relationships)
|
||
|
- Attributes attached to edges and vertices
|
||
|
- Example:
|
||
|
- Social network:
|
||
|
- Users are vertices
|
||
|
- Friendship are edges
|
||
|
- Timestamp of friendship, personal interests are attributes
|
||
|
- Adjacency Matrices
|
||
|
- Represented with a $n \times n$ matrix $M$:
|
||
|
- Advantage:
|
||
|
- Encapsulates the iteration over nodes
|
||
|
- Rows and Columns correspond to in links and out links
|
||
|
- Disadvantages:
|
||
|
- The graph is a sparse graph, if $|E|$ is much smaller than $|V|^2$
|
||
|
- When the graph is sparse, the matrix wastes space.
|
||
|
|
||
|
### MapReduce and Graph
|
||
|
|
||
|
- Approach: parallel processing of each vertex
|
||
|
- Each function has access to local vertex info: node and the links
|
||
|
- Iterative execution: the output of reducer is the input of mapper in the
|
||
|
next iteration
|
||
|
- Example: Equal weight SSSP
|
||
|
- Problem: from orogin node, find shortest distance to every other node of
|
||
|
graph, with all link having same weight (distance)
|
||
|
- Intuition: Use BFS
|
||
|
- start with `distanceTo(node) = 0`
|
||
|
- Set directly reachable node's distanct to 1
|
||
|
- For other nodes accessible to the current set of nodes, set distance
|
||
|
to 1 plus min(distance to accessible node)
|
||
|
- Problem using map reduce:
|
||
|
- Using map reduce will write data to HDFS every iteration, which is
|
||
|
inefficient
|
||
|
- A lot of communication over the net, inefficient
|
||
|
- Should use a in-memory system
|
||
|
|
||
|
### Pregel model
|
||
|
|
||
|
#### Intro
|
||
|
|
||
|
- In every iteration, a function is executed at the vertex: iterative
|
||
|
computation
|
||
|
- vertices can send messages to neighbors
|
||
|
- messages arrive at next iteration
|
||
|
- Parallel computation
|
||
|
- Vertex is independent, is dependent on only the neighbors (They have to be
|
||
|
on same machine)
|
||
|
- Use message for synchronization
|
||
|
- Good for
|
||
|
- Computing shortest path
|
||
|
- Ranking pages
|
||
|
- BFS
|
||
|
|
||
|
#### Graph partitioning
|
||
|
|
||
|
- When the graph is too big, it can't fit on one single machine
|
||
|
- We can split the graph across machines
|
||
|
- A partition defines, which edge and vertices are allocated to which machine
|
||
|
- Performance impact:
|
||
|
- Large Overhead for sending messages to neighboring vertices on different
|
||
|
machines over the network.
|
||
|
|
||
|
#### Influences
|
||
|
|
||
|
- Pregel style graph processing systems:
|
||
|
- Pregel (original)
|
||
|
- Apach Giraph (Written in Java)
|
||
|
- Apache Spark GraphX (Extension of Spark)
|
||
|
|
||
|
### GraphX
|
||
|
|
||
|
#### Intro
|
||
|
|
||
|
- Spark library for graph processing
|
||
|
- Specialized RDD for graph representation and information
|
||
|
- Methods for creating, transforming and implementing multiple graph metrics and
|
||
|
algorithms
|
||
|
|
||
|
#### GraphX RDD
|
||
|
|
||
|
- Hold graph data, and provide methods
|
||
|
- `VertexRDD`: VertexId, VertexData
|
||
|
- `EdgeRDD` EdgeData
|
||
|
- Triplets: Source vertex, edge, dest vertex
|
||
|
|
||
|
#### Predefined methods
|
||
|
|
||
|
- Provides access to information, and operation
|
||
|
- Pregel-like iterative graph traversals:
|
||
|
- Number of iterations are needed
|
||
|
- `mergeMsg`: Combine incoming messages to a single one
|
||
|
- `vprog`: update vertex properties
|
||
|
- `sendMsg`: Send messages to neighbors
|
||
|
- New graph is returned, with updated vertex values
|