EBU6502_cloud_computing_notes/4-4-graph.md

# Graphs

## Intro

- A graph is a pair `G = (V, E)`, where
    - V is the set of vertices
    - E is the set of edges
    - They may contain additional information
- Types:
    - Directed vs. un-directed
    - Presence or absence of cycles
- Examples
    - Internet map
    - Food web
    - Friendship network
    - Google PageRank
    - Bipartite graph
    - Epidemic
- Used to model **interactions**
- The size grows with input data

## Graph Storage

### The need of graph DBs

- Graph storage: traditional DB and NoSQL can store graphs, but their query
  language doesn't natively support it
- We need languages or abstractions for finding the patterns
- This lead to graph DBs:
    - Neo4j
    - Titan

### Graph DBs:

- Use graph structures with nodes, edges or properties to store data
- Index free adjacency, since each node is a pointer to its adjacent element
- Edges hold information, and connect to nodes

### Graph queries

- Filter edges and nodes
- List sub-graphs
- Possibility of said nodes to reach
- Shortest path between two nodes

### neo4j

#### intro

- Is a java based graph database
- Follows ACID
- Schema-less modeling
    - A graph has nodes and (multiple) edges
    - Nodes and edges has properties attached
    - Can also have labels
- Performance is good for smaller datasets
- Use Cypher

#### Cypher query language

- Becoming a standard
- Match queries, return elements that satisfies
- No joins, simpler than SQL

#### features

- Consistency: runs on single server, which guarantees consistency (no
  distribution)
- ACID compliant: have to start a transaction before any changes
- HA: Can optionally use replication, that use a master slave structure
    - Master will propagate changes to slaves (writing to master is fast)
    - When slave is written to, will first update the master **Synchronously** ,
      then master update others

## Graph processing

### Sample problem

- To find the shortest path: SSSP (Single Source Shortest Path) problem, from
  one start node to some target nodes
    - Usually done with Dijkstra, on one single machine

### Representation

- Record the structure, attributes of edges and vertices
- An example is edge list
- Basic structure:
    - Vertices: single, individual, discrete entities in a dataset (person)
    - Edges: relationship between vertices, used to enhance the understanding
      and dynamics within the network (interactions, relationships)
    - Attributes attached to edges and vertices
- Example:
    - Social network:
        - Users are vertices
        - Friendship are edges
        - Timestamp of friendship, personal interests are attributes
- Adjacency Matrices
    - Represented with a $n \times n$ matrix $M$:
    - Advantage:
        - Encapsulates the iteration over nodes
        - Rows and Columns correspond to in links and out links
    - Disadvantages:
        - The graph is a sparse graph, if $|E|$ is much smaller than $|V|^2$
        - When the graph is sparse, the matrix wastes space.

### MapReduce and Graph

- Approach: parallel processing of each vertex
    - Each function has access to local vertex info: node and the links
    - Iterative execution: the output of reducer is the input of mapper in the
      next iteration
- Example: Equal weight SSSP
    - Problem: from orogin node, find shortest distance to every other node of
      graph, with all link having same weight (distance)
    - Intuition: Use BFS
        - start with `distanceTo(node) = 0`
        - Set directly reachable node's distanct to 1
        - For other nodes accessible to the current set of nodes, set distance
          to 1 plus min(distance to accessible node)
- Problem using map reduce:
    - Using map reduce will write data to HDFS every iteration, which is
      inefficient
    - A lot of communication over the net, inefficient
- Should use a in-memory system

### Pregel model

#### Intro

- In every iteration, a function is executed at the vertex: iterative
  computation
    - vertices can send messages to neighbors
    - messages arrive at next iteration
- Parallel computation
    - Vertex is independent, is dependent on only the neighbors (They have to be
      on same machine)
    - Use message for synchronization
- Good for
    - Computing shortest path
    - Ranking pages
    - BFS

#### Graph partitioning

- When the graph is too big, it can't fit on one single machine
- We can split the graph across machines
- A partition defines, which edge and vertices are allocated to which machine
- Performance impact:
    - Large Overhead for sending messages to neighboring vertices on different
      machines over the network.

#### Influences

- Pregel style graph processing systems:
    - Pregel (original)
    - Apach Giraph (Written in Java)
    - Apache Spark GraphX (Extension of Spark)

### GraphX

#### Intro

- Spark library for graph processing
- Specialized RDD for graph representation and information
- Methods for creating, transforming and implementing multiple graph metrics and
  algorithms

#### GraphX RDD

- Hold graph data, and provide methods
    - `VertexRDD`: VertexId, VertexData
    - `EdgeRDD` EdgeData
    - Triplets: Source vertex, edge, dest vertex

#### Predefined methods

- Provides access to information, and operation:
    - `Graph.vertices`, `graph.edges`, `graph.triples`: access property
    - `Graph.degrees`: Access connected edges: Provides a tuple with
      (`vertexId`, `degree` of each vertex)
    - `Graph.connectedComponents`: each of the connected components of the graph
- Pregel-like iterative graph traversals:
    - Number of iterations are needed
    - `mergeMsg`: Combine incoming messages to a single one
    - `vprog`: update vertex properties
    - `sendMsg`: Send messages to neighbors
    - New graph is returned, with updated vertex values