EBU6502_cloud_computing_notes/4-4-graph.md
2025-01-04 20:10:11 +08:00

5.7 KiB

Graphs

Intro

  • A graph is a pair G = (V, E), where
    • V is the set of vertices
    • E is the set of edges
    • They may contain additional information
  • Types:
    • Directed vs. un-directed
    • Presence or absence of cycles
  • Examples
    • Internet map
    • Food web
    • Friendship network
    • Google PageRank
    • Bipartite graph
    • Epidemic
  • Used to model interactions
  • The size grows with input data

Graph Storage

The need of graph DBs

  • Graph storage: traditional DB and NoSQL can store graphs, but their query language doesn't natively support it
  • We need languages or abstractions for finding the patterns
  • This lead to graph DBs:
    • Neo4j
    • Titan

Graph DBs:

  • Use graph structures with nodes, edges or properties to store data
  • Index free adjacency, since each node is a pointer to its adjacent element
  • Edges hold information, and connect to nodes

Graph queries

  • Filter edges and nodes
  • List sub-graphs
  • Possibility of said nodes to reach
  • Shortest path between two nodes

neo4j

intro

  • Is a java based graph database
  • Follows ACID
  • Schema-less modeling
    • A graph has nodes and (multiple) edges
    • Nodes and edges has properties attached
    • Can also have labels
  • Performance is good for smaller datasets
  • Use Cypher

Cypher query language

  • Becoming a standard
  • Match queries, return elements that satisfies
  • No joins, simpler than SQL

features

  • Consistency: runs on single server, which guarantees consistency (no distribution)
  • ACID compliant: have to start a transaction before any changes
  • HA: Can optionally use replication, that use a master slave structure
    • Master will propagate changes to slaves (writing to master is fast)
    • When slave is written to, will first update the master Synchronously , then master update others

Graph processing

Sample problem

  • To find the shortest path: SSSP (Single Source Shortest Path) problem, from one start node to some target nodes
    • Usually done with Dijkstra, on one single machine

Representation

  • Record the structure, attributes of edges and vertices
  • An example is edge list
  • Basic structure:
    • Vertices: single, individual, discrete entities in a dataset (person)
    • Edges: relationship between vertices, used to enhance the understanding and dynamics within the network (interactions, relationships)
    • Attributes attached to edges and vertices
  • Example:
    • Social network:
      • Users are vertices
      • Friendship are edges
      • Timestamp of friendship, personal interests are attributes
  • Adjacency Matrices
    • Represented with a n \times n matrix M:
    • Advantage:
      • Encapsulates the iteration over nodes
      • Rows and Columns correspond to in links and out links
    • Disadvantages:
      • The graph is a sparse graph, if |E| is much smaller than |V|^2
      • When the graph is sparse, the matrix wastes space.

MapReduce and Graph

  • Approach: parallel processing of each vertex
    • Each function has access to local vertex info: node and the links
    • Iterative execution: the output of reducer is the input of mapper in the next iteration
  • Example: Equal weight SSSP
    • Problem: from orogin node, find shortest distance to every other node of graph, with all link having same weight (distance)
    • Intuition: Use BFS
      • start with distanceTo(node) = 0
      • Set directly reachable node's distanct to 1
      • For other nodes accessible to the current set of nodes, set distance to 1 plus min(distance to accessible node)
  • Problem using map reduce:
    • Using map reduce will write data to HDFS every iteration, which is inefficient
    • A lot of communication over the net, inefficient
  • Should use a in-memory system

Pregel model

Intro

  • In every iteration, a function is executed at the vertex: iterative computation
    • vertices can send messages to neighbors
    • messages arrive at next iteration
  • Parallel computation
    • Vertex is independent, is dependent on only the neighbors (They have to be on same machine)
    • Use message for synchronization
  • Good for
    • Computing shortest path
    • Ranking pages
    • BFS

Graph partitioning

  • When the graph is too big, it can't fit on one single machine
  • We can split the graph across machines
  • A partition defines, which edge and vertices are allocated to which machine
  • Performance impact:
    • Large Overhead for sending messages to neighboring vertices on different machines over the network.

Influences

  • Pregel style graph processing systems:
    • Pregel (original)
    • Apach Giraph (Written in Java)
    • Apache Spark GraphX (Extension of Spark)

GraphX

Intro

  • Spark library for graph processing
  • Specialized RDD for graph representation and information
  • Methods for creating, transforming and implementing multiple graph metrics and algorithms

GraphX RDD

  • Hold graph data, and provide methods
    • VertexRDD: VertexId, VertexData
    • EdgeRDD EdgeData
    • Triplets: Source vertex, edge, dest vertex

Predefined methods

  • Provides access to information, and operation:
    • Graph.vertices, graph.edges, graph.triples: access property
    • Graph.degrees: Access connected edges: Provides a tuple with (vertexId, degree of each vertex)
    • Graph.connectedComponents: each of the connected components of the graph
  • Pregel-like iterative graph traversals:
    • Number of iterations are needed
    • mergeMsg: Combine incoming messages to a single one
    • vprog: update vertex properties
    • sendMsg: Send messages to neighbors
    • New graph is returned, with updated vertex values