Finish 4-4
This commit is contained in:
parent
6b8ca497db
commit
53d7884dd5
183
4-4-graph.md
Normal file
183
4-4-graph.md
Normal file
|
@ -0,0 +1,183 @@
|
|||
# Graphs
|
||||
|
||||
## Intro
|
||||
|
||||
- A graph is a pair `G = (V, E)`, where
|
||||
- V is the set of vertices
|
||||
- E is the set of edges
|
||||
- They may contain additional information
|
||||
- Types:
|
||||
- Directed vs. un-directed
|
||||
- Presence or absence of cycles
|
||||
- Examples
|
||||
- Internet map
|
||||
- Food web
|
||||
- Friendship network
|
||||
- Google PageRank
|
||||
- Bipartite graph
|
||||
- Epidemic
|
||||
- Used to model **interactions**
|
||||
- The size grows with input data
|
||||
|
||||
## Graph Storage
|
||||
|
||||
### The need of graph DBs
|
||||
|
||||
- Graph storage: traditional DB and NoSQL can store graphs, but their query
|
||||
language doesn't natively support it
|
||||
- We need languages or abstractions for finding the patterns
|
||||
- This lead to graph DBs:
|
||||
- Neo4j
|
||||
- Titan
|
||||
|
||||
### Graph DBs:
|
||||
|
||||
- Use graph structures with nodes, edges or properties to store data
|
||||
- Index free adjacency, since each node is a pointer to its adjacent element
|
||||
- Edges hold information, and connect to nodes
|
||||
|
||||
### Graph queries
|
||||
|
||||
- Filter edges and nodes
|
||||
- List sub-graphs
|
||||
- Possibility of said nodes to reach
|
||||
- Shortest path between two nodes
|
||||
|
||||
### neo4j
|
||||
|
||||
#### intro
|
||||
|
||||
- Is a java based graph database
|
||||
- Follows ACID
|
||||
- Schema-less modeling
|
||||
- A graph has nodes and (multiple) edges
|
||||
- Nodes and edges has properties attached
|
||||
- Can also have labels
|
||||
- Performance is good for smaller datasets
|
||||
- Use Cypher
|
||||
|
||||
#### Cypher query language
|
||||
|
||||
- Becoming a standard
|
||||
- Match queries, return elements that satisfies
|
||||
- No joins, simpler than SQL
|
||||
|
||||
#### features
|
||||
|
||||
- Consistency: runs on single server, which guarantees consistency (no
|
||||
distribution)
|
||||
- ACID compliant: have to start a transaction before any changes
|
||||
- HA: Can optionally use replication, that use a master slave structure
|
||||
- Master will propagate changes to slaves (writing to master is fast)
|
||||
- When slave is written to, will first update the master **Synchronously** ,
|
||||
then master update others
|
||||
|
||||
## Graph processing
|
||||
|
||||
### Sample problem
|
||||
|
||||
- To find the shortest path: SSSP (Single Source Shortest Path) problem, from
|
||||
one start node to some target nodes
|
||||
- Usually done with Dijkstra, on one single machine
|
||||
|
||||
### Representation
|
||||
|
||||
- Record the structure, attributes of edges and vertices
|
||||
- An example is edge list
|
||||
- Basic structure:
|
||||
- Vertices: single, individual, discrete entities in a dataset (person)
|
||||
- Edges: relationship between vertices, used to enhance the understanding
|
||||
and dynamics within the network (interactions, relationships)
|
||||
- Attributes attached to edges and vertices
|
||||
- Example:
|
||||
- Social network:
|
||||
- Users are vertices
|
||||
- Friendship are edges
|
||||
- Timestamp of friendship, personal interests are attributes
|
||||
- Adjacency Matrices
|
||||
- Represented with a $n \times n$ matrix $M$:
|
||||
- Advantage:
|
||||
- Encapsulates the iteration over nodes
|
||||
- Rows and Columns correspond to in links and out links
|
||||
- Disadvantages:
|
||||
- The graph is a sparse graph, if $|E|$ is much smaller than $|V|^2$
|
||||
- When the graph is sparse, the matrix wastes space.
|
||||
|
||||
### MapReduce and Graph
|
||||
|
||||
- Approach: parallel processing of each vertex
|
||||
- Each function has access to local vertex info: node and the links
|
||||
- Iterative execution: the output of reducer is the input of mapper in the
|
||||
next iteration
|
||||
- Example: Equal weight SSSP
|
||||
- Problem: from orogin node, find shortest distance to every other node of
|
||||
graph, with all link having same weight (distance)
|
||||
- Intuition: Use BFS
|
||||
- start with `distanceTo(node) = 0`
|
||||
- Set directly reachable node's distanct to 1
|
||||
- For other nodes accessible to the current set of nodes, set distance
|
||||
to 1 plus min(distance to accessible node)
|
||||
- Problem using map reduce:
|
||||
- Using map reduce will write data to HDFS every iteration, which is
|
||||
inefficient
|
||||
- A lot of communication over the net, inefficient
|
||||
- Should use a in-memory system
|
||||
|
||||
### Pregel model
|
||||
|
||||
#### Intro
|
||||
|
||||
- In every iteration, a function is executed at the vertex: iterative
|
||||
computation
|
||||
- vertices can send messages to neighbors
|
||||
- messages arrive at next iteration
|
||||
- Parallel computation
|
||||
- Vertex is independent, is dependent on only the neighbors (They have to be
|
||||
on same machine)
|
||||
- Use message for synchronization
|
||||
- Good for
|
||||
- Computing shortest path
|
||||
- Ranking pages
|
||||
- BFS
|
||||
|
||||
#### Graph partitioning
|
||||
|
||||
- When the graph is too big, it can't fit on one single machine
|
||||
- We can split the graph across machines
|
||||
- A partition defines, which edge and vertices are allocated to which machine
|
||||
- Performance impact:
|
||||
- Large Overhead for sending messages to neighboring vertices on different
|
||||
machines over the network.
|
||||
|
||||
#### Influences
|
||||
|
||||
- Pregel style graph processing systems:
|
||||
- Pregel (original)
|
||||
- Apach Giraph (Written in Java)
|
||||
- Apache Spark GraphX (Extension of Spark)
|
||||
|
||||
### GraphX
|
||||
|
||||
#### Intro
|
||||
|
||||
- Spark library for graph processing
|
||||
- Specialized RDD for graph representation and information
|
||||
- Methods for creating, transforming and implementing multiple graph metrics and
|
||||
algorithms
|
||||
|
||||
#### GraphX RDD
|
||||
|
||||
- Hold graph data, and provide methods
|
||||
- `VertexRDD`: VertexId, VertexData
|
||||
- `EdgeRDD` EdgeData
|
||||
- Triplets: Source vertex, edge, dest vertex
|
||||
|
||||
#### Predefined methods
|
||||
|
||||
- Provides access to information, and operation
|
||||
- Pregel-like iterative graph traversals:
|
||||
- Number of iterations are needed
|
||||
- `mergeMsg`: Combine incoming messages to a single one
|
||||
- `vprog`: update vertex properties
|
||||
- `sendMsg`: Send messages to neighbors
|
||||
- New graph is returned, with updated vertex values
|
Loading…
Reference in a new issue