# Graphs ## Intro - A graph is a pair `G = (V, E)`, where - V is the set of vertices - E is the set of edges - They may contain additional information - Types: - Directed vs. un-directed - Presence or absence of cycles - Examples - Internet map - Food web - Friendship network - Google PageRank - Bipartite graph - Epidemic - Used to model **interactions** - The size grows with input data ## Graph Storage ### The need of graph DBs - Graph storage: traditional DB and NoSQL can store graphs, but their query language doesn't natively support it - We need languages or abstractions for finding the patterns - This lead to graph DBs: - Neo4j - Titan ### Graph DBs: - Use graph structures with nodes, edges or properties to store data - Index free adjacency, since each node is a pointer to its adjacent element - Edges hold information, and connect to nodes ### Graph queries - Filter edges and nodes - List sub-graphs - Possibility of said nodes to reach - Shortest path between two nodes ### neo4j #### intro - Is a java based graph database - Follows ACID - Schema-less modeling - A graph has nodes and (multiple) edges - Nodes and edges has properties attached - Can also have labels - Performance is good for smaller datasets - Use Cypher #### Cypher query language - Becoming a standard - Match queries, return elements that satisfies - No joins, simpler than SQL #### features - Consistency: runs on single server, which guarantees consistency (no distribution) - ACID compliant: have to start a transaction before any changes - HA: Can optionally use replication, that use a master slave structure - Master will propagate changes to slaves (writing to master is fast) - When slave is written to, will first update the master **Synchronously** , then master update others ## Graph processing ### Sample problem - To find the shortest path: SSSP (Single Source Shortest Path) problem, from one start node to some target nodes - Usually done with Dijkstra, on one single machine ### Representation - Record the structure, attributes of edges and vertices - An example is edge list - Basic structure: - Vertices: single, individual, discrete entities in a dataset (person) - Edges: relationship between vertices, used to enhance the understanding and dynamics within the network (interactions, relationships) - Attributes attached to edges and vertices - Example: - Social network: - Users are vertices - Friendship are edges - Timestamp of friendship, personal interests are attributes - Adjacency Matrices - Represented with a $n \times n$ matrix $M$: - Advantage: - Encapsulates the iteration over nodes - Rows and Columns correspond to in links and out links - Disadvantages: - The graph is a sparse graph, if $|E|$ is much smaller than $|V|^2$ - When the graph is sparse, the matrix wastes space. ### MapReduce and Graph - Approach: parallel processing of each vertex - Each function has access to local vertex info: node and the links - Iterative execution: the output of reducer is the input of mapper in the next iteration - Example: Equal weight SSSP - Problem: from orogin node, find shortest distance to every other node of graph, with all link having same weight (distance) - Intuition: Use BFS - start with `distanceTo(node) = 0` - Set directly reachable node's distanct to 1 - For other nodes accessible to the current set of nodes, set distance to 1 plus min(distance to accessible node) - Problem using map reduce: - Using map reduce will write data to HDFS every iteration, which is inefficient - A lot of communication over the net, inefficient - Should use a in-memory system ### Pregel model #### Intro - In every iteration, a function is executed at the vertex: iterative computation - vertices can send messages to neighbors - messages arrive at next iteration - Parallel computation - Vertex is independent, is dependent on only the neighbors (They have to be on same machine) - Use message for synchronization - Good for - Computing shortest path - Ranking pages - BFS #### Graph partitioning - When the graph is too big, it can't fit on one single machine - We can split the graph across machines - A partition defines, which edge and vertices are allocated to which machine - Performance impact: - Large Overhead for sending messages to neighboring vertices on different machines over the network. #### Influences - Pregel style graph processing systems: - Pregel (original) - Apach Giraph (Written in Java) - Apache Spark GraphX (Extension of Spark) ### GraphX #### Intro - Spark library for graph processing - Specialized RDD for graph representation and information - Methods for creating, transforming and implementing multiple graph metrics and algorithms #### GraphX RDD - Hold graph data, and provide methods - `VertexRDD`: VertexId, VertexData - `EdgeRDD` EdgeData - Triplets: Source vertex, edge, dest vertex #### Predefined methods - Provides access to information, and operation: - `Graph.vertices`, `graph.edges`, `graph.triples`: access property - `Graph.degrees`: Access connected edges: Provides a tuple with (`vertexId`, `degree` of each vertex) - `Graph.connectedComponents`: each of the connected components of the graph - Pregel-like iterative graph traversals: - Number of iterations are needed - `mergeMsg`: Combine incoming messages to a single one - `vprog`: update vertex properties - `sendMsg`: Send messages to neighbors - New graph is returned, with updated vertex values