5.7 KiB
5.7 KiB
Graphs
Intro
- A graph is a pair
G = (V, E)
, where- V is the set of vertices
- E is the set of edges
- They may contain additional information
- Types:
- Directed vs. un-directed
- Presence or absence of cycles
- Examples
- Internet map
- Food web
- Friendship network
- Google PageRank
- Bipartite graph
- Epidemic
- Used to model interactions
- The size grows with input data
Graph Storage
The need of graph DBs
- Graph storage: traditional DB and NoSQL can store graphs, but their query language doesn't natively support it
- We need languages or abstractions for finding the patterns
- This lead to graph DBs:
- Neo4j
- Titan
Graph DBs:
- Use graph structures with nodes, edges or properties to store data
- Index free adjacency, since each node is a pointer to its adjacent element
- Edges hold information, and connect to nodes
Graph queries
- Filter edges and nodes
- List sub-graphs
- Possibility of said nodes to reach
- Shortest path between two nodes
neo4j
intro
- Is a java based graph database
- Follows ACID
- Schema-less modeling
- A graph has nodes and (multiple) edges
- Nodes and edges has properties attached
- Can also have labels
- Performance is good for smaller datasets
- Use Cypher
Cypher query language
- Becoming a standard
- Match queries, return elements that satisfies
- No joins, simpler than SQL
features
- Consistency: runs on single server, which guarantees consistency (no distribution)
- ACID compliant: have to start a transaction before any changes
- HA: Can optionally use replication, that use a master slave structure
- Master will propagate changes to slaves (writing to master is fast)
- When slave is written to, will first update the master Synchronously , then master update others
Graph processing
Sample problem
- To find the shortest path: SSSP (Single Source Shortest Path) problem, from
one start node to some target nodes
- Usually done with Dijkstra, on one single machine
Representation
- Record the structure, attributes of edges and vertices
- An example is edge list
- Basic structure:
- Vertices: single, individual, discrete entities in a dataset (person)
- Edges: relationship between vertices, used to enhance the understanding and dynamics within the network (interactions, relationships)
- Attributes attached to edges and vertices
- Example:
- Social network:
- Users are vertices
- Friendship are edges
- Timestamp of friendship, personal interests are attributes
- Social network:
- Adjacency Matrices
- Represented with a
n \times n
matrixM
: - Advantage:
- Encapsulates the iteration over nodes
- Rows and Columns correspond to in links and out links
- Disadvantages:
- The graph is a sparse graph, if
|E|
is much smaller than|V|^2
- When the graph is sparse, the matrix wastes space.
- The graph is a sparse graph, if
- Represented with a
MapReduce and Graph
- Approach: parallel processing of each vertex
- Each function has access to local vertex info: node and the links
- Iterative execution: the output of reducer is the input of mapper in the next iteration
- Example: Equal weight SSSP
- Problem: from orogin node, find shortest distance to every other node of graph, with all link having same weight (distance)
- Intuition: Use BFS
- start with
distanceTo(node) = 0
- Set directly reachable node's distanct to 1
- For other nodes accessible to the current set of nodes, set distance to 1 plus min(distance to accessible node)
- start with
- Problem using map reduce:
- Using map reduce will write data to HDFS every iteration, which is inefficient
- A lot of communication over the net, inefficient
- Should use a in-memory system
Pregel model
Intro
- In every iteration, a function is executed at the vertex: iterative
computation
- vertices can send messages to neighbors
- messages arrive at next iteration
- Parallel computation
- Vertex is independent, is dependent on only the neighbors (They have to be on same machine)
- Use message for synchronization
- Good for
- Computing shortest path
- Ranking pages
- BFS
Graph partitioning
- When the graph is too big, it can't fit on one single machine
- We can split the graph across machines
- A partition defines, which edge and vertices are allocated to which machine
- Performance impact:
- Large Overhead for sending messages to neighboring vertices on different machines over the network.
Influences
- Pregel style graph processing systems:
- Pregel (original)
- Apach Giraph (Written in Java)
- Apache Spark GraphX (Extension of Spark)
GraphX
Intro
- Spark library for graph processing
- Specialized RDD for graph representation and information
- Methods for creating, transforming and implementing multiple graph metrics and algorithms
GraphX RDD
- Hold graph data, and provide methods
VertexRDD
: VertexId, VertexDataEdgeRDD
EdgeData- Triplets: Source vertex, edge, dest vertex
Predefined methods
- Provides access to information, and operation:
Graph.vertices
,graph.edges
,graph.triples
: access propertyGraph.degrees
: Access connected edges: Provides a tuple with (vertexId
,degree
of each vertex)Graph.connectedComponents
: each of the connected components of the graph
- Pregel-like iterative graph traversals:
- Number of iterations are needed
mergeMsg
: Combine incoming messages to a single onevprog
: update vertex propertiessendMsg
: Send messages to neighbors- New graph is returned, with updated vertex values