Research/Blog
Working with Giant Graphs
- August 27, 2020
- Posted by: vsinghal
- Category: Graph Neural Networks Healthcare Pharma Publishing
#CellStratAILab #disrupt4.0 #WeCreateAISuperstars #WhereLearningNeverStops
Recently, our AI Lab Researcher Gouthaman Asokan presented an excellent session on Giant Graphs, wherein we use specialized Graph Neural Networks to process and mine massive graphs – such as social networks or paper citation graphs.
Examples of Real Life Giant Graphs
Giant graphs has usually millions or billions of nodes and edges as seen in real life examples.
Reddit dataset constructed by Hamilton et al., wherein the nodes are posts and edges are established if two nodes are commented by a same user. This graph has 233,000 nodes, 114.6 million edges and 41 categories.
Sample Biomedical Knowledge graph below contains nearly 250000 nodes and 1 billion edges :
Social Networks like Facebook, Twitter and LinkedIn has user graphs ranging from 10 million to billion nodes for the representation of the users.
Strategies of Sampling Neighbors
Due to the storage and computation needed for training these graphs in neural networks, we need sampling techniques.
There are various ways in which sampling can be done from giant graphs. Two of the famous strategies are :
- Neighbor Sampling
- Control Variate Sampling
Node Flow :-
Mini batches for sampling is done with Node flow.
Node flow is a layered graph that has L+1 sequential layers, edges only between the adjacent layers and forming blocks.
It is constructed backwards starting from the last layer. Features of each node and edge in each block are stored as separate tensors.
Let D(l) be the number of neighbors to be sampled for each node at the l-th layer. Receptive field of each node can be controlled under this value :
Neighbor Sampling :-
It randomly samples a few neighbors to estimate the aggregation of the total neighbors in the l-th GCN layer by an unbiased estimator.
Control Variate Sampling :-
Neighbor sampling estimator will have a very high variance if sufficient number of neighbors are not selected. Control variate helps in finding the estimator with only 2 neighbors with a standard variation reduction technique.
Our idea is to maintain the history h-(l)v for each h(l)v as an affordable approximation. Each time when h(l)v is computed, we update h-(l)v with h(l)v . We expect h-(l)v and h(l)v to be similar if the model weights do not change too fast during the training.
Receptive Field of a Vertex :-
The figure below shows two-layer graph convolutional networks, and the receptive field of a single vertex. We can see how Sampling narrows the receptive field.
Neighbor Sampling Equations :-
Neighbor Sampling reduces the receptive field size from all the L-hop neighbors to the number of sampled neighbors.
Below Equation helps in finding the Estimator for the aggregation of the total neighbors in the l-th GCN layer.
N(l)(v) and N(v) represents the sampled and total neighbors respectively.
Control Variate Sampling Equations :-
While computing the neighbour average, we cannot afford to evaluate all the h(l)v terms because they need to be computed recursively, i.e., we again need the activations h(l-1)w of all of v’s neighbors w.
History of the nodes is used to update the sampled approximation by the following formulae :
GraphSage
GraphSage is a variant of Graph Convolutional Network used for finding inductive node embeddings. This particular technique will be used in finding the node classification for the dataset Pubmed by Neighbor Sampling in the code example.
GraphSage is an abbreviated word for Graph Sample and Aggregate.
Aggregator function helps in aggregating the feature information from the neighbors of the node and this aggregation is done from a different number of hops or search depth, away from a given node.
If hkN represents the aggregation, huk-1 nodes represents the neighboring nodes of node to be updated.
Let hvk-1 be the node representation from the previous layer, it is concatenated with aggregation hkN, multiplied with the weight matrix and activation function applied to get the final node representation of hvk .
Process Flow for GraphSage :-
The figure below shows the Visual Representation of the sample and aggregate approach.
GraphSage – Code Snippet :-
Inference with SageNet – Code Snippet :-
PubMed Dataset
The PubMed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes(“Diabetes Mellitus, Experimental”, “Diabetes Mellitus Type 1”, “Diabetes Mellitus Type 2”). The citation network consists of 88651 links.
Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words.
The files consists of tab delimited files where the first line describes the contents of the files and the second line describes the names and types of the attributes.
PubMed – Sample Node Information :-
Single and Multiple Layers in Sampling :-
We need to mention the number of nodes to be sampled in each layer for the GNN. In Neighbor Sampling, it affects the computation for the computation. For multiple layers, we need not only the nodes themselves and their neighbors, but also the neighbors of these neighbors.
Computation dependency for multi-layer GNNs is a bottom-up process: we start from the output layer, and grow the node set towards the input layer.
DGL Functions used in Neighbour Sampling :-
dgl.sample_neighbors – Sample neighboring edges of the given nodes and return the induced subgraph
For each node, a number of inbound (or outbound when edge_dir == ‘out’) edges will be randomly chosen. The graph returned will then contain all the nodes in the original graph, but only the sampled edges.
dgl.to_block – Convert a graph into a bipartite-structured block for message passing
Bipartite graph (or bigraph) is a graph whose vertices can be divided into two disjoint and independent sets {U} and {V} such that every edge connects a vertex in {U} to one in {V}.
Creating Neighbor Sampler Class :-
Neighbor Sampling Flow :-
Summary
- Graphs in real life has millions of nodes and billions of edges. Storage and computation power needed in training these graphs make it infeasible to train graphs with graph neural networks.
- Sampling techniques are used to work around this challenge in Giant graphs and two of the most used techniques are Neighbor Sampling and Control Variate Sampling.
- GraphSAGE by sampling neighbors is efficient in generating the embedding for the unseen nodes while control variate reduces the receptive field and helps in the fast stochastic training of the graphs
CellStrat Advanced Product Internship Program :-
CellStrat AI Lab is a leading AI Lab and is working on the cutting-edge of Artificial Intelligence including latest algorithms in ML, DL, RL, Computer Vision, NLP, Graph Theory etc.
Interested in working in our world-class Healthcare AI Product team to gain FULL-STACK AI MODEL DEVELOPMENT AND DEPLOYMENT experience on our AWS Cloud Platform ? If yes, enroll in our Advanced Product Internship Program. More details and enrollment here : https://bit.ly/CS-CPI
Questions ? Please feel free to call me at +91-9742800566 !
Best Regards,
Vivek Singhal
Co-Founder & Chief Data Scientist, CellStrat
+91-9742800566