Research/Blog

CellStrat > Research/Blog > Artificial Intelligence > Generative Modeling > Generative Models for Graphs

Generative Models for Graphs

August 13, 2020
Posted by: vsinghal
Category: Generative Modeling Graph Neural Networks Pharma

#CellStratAILab #disrupt4.0 #WeCreateAISuperstars

Recently, our AI Lab Researcher Gouthaman Asokan presented an amazing session on Generative Models for Graphs. This is part of our webinar series on Graphs and Graph Neural Networks.

Graph is a nonlinear data structure consisting of nodes and edges. Widely used in applications like Physical system interactions, Knowledge graphs, Molecular interactions, Transportation routes, Information extraction.

Generative Models

Automatically generating graphs from inputs will be useful for map building, relational extraction.
Graph completion by the generative models will help in finding new links useful in knowledge graph completion, social networks.
Generating new structures is vital for drug and material discovery

One can use pre trained generative models to create Graphs (along with the vertices and the edges) on the fly (Cycle generated during inference).

Deep Generative Models of Graphs (DGMG)

DGMG helps in structural generation that uses probability driven structure to sequentially add the nodes, edges and finally connect with the destination nodes.

*Image credit : https://docs.dgl.ai/tutorials/models/3_generative_model/5_dgmg.html*

Teacher Forcing :-

Teacher forcing is the technique where the target word is passed as the next input to the decoder.

DGMG: Optimization :-

For each graph, there exists a sequence of actions a1…aT that generates it. The Objective is to compute the joint probabilities of such action sequences and maximize them using teacher forcing

Maximum Likelihood Estimation (MLE) loss (also called -ve log likelihood) that acts as the optimization objective is as follows

DGMG Skeleton :-

Initializations made in the DGMG class are as follows :-

Encoding a Dynamic Graph

All the actions generating a graph are sampled from probability distributions. In order to do that, you project the structured data, namely the graph, onto an Euclidean space. The challenge is that such process, called embedding, needs to be repeated as the graphs mutate.

Graph Embedding – An approach used to transform the nodes, edges and their features into vector space whilst maximally preserving properties like graph structure and information.

Graph Propagation – As new nodes and edges are added, the vector representation changes and this update is done with the help of graph propagation.

Graph Embedding :-

Let the Graph be represented as G=(V,E).

Each node v has an embedding vector h_v. Similarly, the graph has an embedding vector h_G.

Sigmoid(g_m(h_v)) computes a gating function and can be thought of as how much the overall graph embedding attends on each node.

f_m maps the node embeddings to the space of graph embeddings.

Graph Propagation :-

For every node v in the graph, the neighboring nodes sends the message as follows

where x(u,v) is the embedding of the edge between u and v.

V summarizes with a node activation vector after receiving message from its neighbors.

This information is used to update its own feature

Graph Propagation and Prediction Process :-

Run T rounds of propagation to update node vectors, after which a graph representation vector is computed and output predicted.

*Image credit : https://arxiv.org/abs/1803.03324*

Sequential Graph Generation Process :-

Sampling from Probability Distribution :-

Distribution is the abstract base class for probability distributions, helpful in adding nodes and choosing destination for generative models by sampling.

Adding nodes: Bernoulli Distribution- Creates a Bernoulli distribution parameterized by probs or logits.

Choosing destination: Categorical Distribution- Creates a Categorical distribution parameterized by probs or logits similar to Bernoulli.

Here we see examples of how these distributions are invoked from Pytorch

Adding Nodes :-

Given the graph embedding vector h_G, the below equation is used to parametrize a bernoulli distribution for deciding whether to add a new node.

If a new node is to be added, it is initialized as follows where h_init is a learnable embedding module for untyped nodes.

Adding Edges :-

Given the graph embedding vector h_G and node embedding vector h_V, the below equation is used to parametrize a bernoulli distribution for deciding whether to add a new edge.

Choosing Destination :-

Newly added node with the edge finds the destination node to connect to

For each possible destination u∈{0,⋯,v−1}, the probability of choosing it is given by

Synthetic Graphs

Synthetic graphs are graphs not collected from real life data but instead constructed using particular rules.

Barabasi Albert model is a Scale Free Model that incorporates growth and preferential attachment.

*Image credit : https://www.geeksforgeeks.org/barabasi-albert-graph-scale-free-models/*

Generation of Synthetic Graphs :-

Model is being trained on three different set of graphs: cycles, trees and graphs generated by Barabasi-Albert model. Percentage of valid samples generated and the training curves for the dataset by the models DGMG and LSTM are shown below

Grammar for Graphs

Graph Grammar associates vertices with a set of attributes and rewrites these attributes with help of functions.

Simplified Molecular-Input Line-Entry System (SMILES) is a specification in the form of a line notation for describing the structure of chemical species using short ASCII strings.

*Image credit : https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system*

Molecular Generation :-

Model’s ability to generate molecules is also tested. It is trained on the Zinc dataset and the results are compared with another model GrammarVAE which uses SMILES grammar.

Dataset for Generative Model :-

Dataset is generated with input of the range of the vertices to be present and the number of samples to be created.
Decision sequence is obtained for generating valid cycles with DGMG for teacher forcing optimization.
CycleModelEvaluation helps to find whether vertices are in the range and whether it is a valid cycle.
Adam optimizer is used. Loss is calculated with the log_prob and backpropagation is done.

Improving model and corresponding graphs generated :-

With more batches and epochs, models get better over time. It can be seen that the model creates better cycles as the training progresses

Summary

Deep Generative Model for Graphs (DGMG) is capable of generating arbitrary graphs through a sequential process.
Sequential Process follows adding node, adding edge and choosing the destination iteratively.
Graph Embedding and Propagation helps in encoding the graph dynamically when taking the actions.
Applications like Synthetic Graphs and Molecular Generation benefit greatly from DGMG models as compared to the traditional RNN models.