Written by Michael (Mike) Erlihson, PhD.

This review is part of a series of reviews in Machine & Deep Learning that are originally published in Hebrew, aiming to make it accessible in a plain language under the name #DeepNightLearners.

Good night friends, today we are again in our section DeepNightLearners with a review of a Deep Learning article. Today I've chosen to review the article A causal view of compositional zero-shot recognition

Reviewer Corner:

Reading recommendation: Warmly recommended if you have previous knowledge in the relevant fields.
Clarity of writing: High
Math and DL knowledge level: A good background in probability is required as well as causality principles.
Practical applications: The ideas in the article can be used for creating a generative model for new, unseen, data (such as images) with objects which aren't in the training set.

Article Details:

Article link: available for download here
Code link: available here
Published on: 01/11/2020 on Arxiv
Presented at: NeurIPSi 2020

Article Domains:

  • zero-shot learning
  • compositional generalization - the ability to recognize new, previously unseen, combinations of known features

Mathematical Tools, Concepts and Marks:

  • Causality: causality graphs, confounding features, feature intervention
  • Disentangled representations
  • Hilbert-Schmidt Independence Criterion (HSIC): a tool to evaluate and measure the independence of two random variable samples
  • post interventional disagreement (PIDA)

The Article in Essence:

One of the main challenges in zero-shot learning is enabling compositional generalization to the mode. In other words, we would like to teach the model to recognize new combinations of the basic data components, which it has already managed to recognize in the training set (compositional generalization is a private case of Zero-Short learning). Let's start with an example of compositional generalization in the visual domain. Suppose you've never seen white wolves, but as soon as you see one, you would easily be able to recognize it as a "white wolf", because you know what a wold look like, and you are able to recognize the white color. Meaning, in humans memory the object "wolf" and the attribute "white" is stored separately and it's easy for us to combine the two, even if we have never seen that combination before. Unfortunately, for deep learning models, which are trained in a discriminate fashion it is difficult to demonstrate this ability due to two main reasons:

  1. Distribution differences between the training and the test sets: during training, the model has not yet "seen" the test-set combinations. Moreover, the model learn specific connections between features combinations that interfere reconstructing them properly on the test set. A model which only trained on grey wolves, had learned to connect the attribute "grey" to the object "wolf" and therefore will have trouble to detect wolves in different colors.
  2. Training set labels: the model has difficulties to break the labels into their basic components. For example, if the picture label is "grey wolf", a discriminatory trained model probably won't be smart enough to understand which visual features are important to recognize the object "wolf" and which ones define the color "grey".

The article tries to overcome these difficulties by offering a generative model, where the model input is label combinations for the picture. For example, to generate a photo of a white wolf, we will choose the object type (wolf) and the attribute (white), then generate an image based on these properties. The advantage of this approach is that the conditional probability of an image, combined with the label combinations, will be identical in both the training and the test set (!).

Intuition corner:

A combination of an object and an image attribute tends to create similar images on both the training and the test sets, as opposed to the image distribution which conditioned on the type of the object or the attribute, separately. This is the base assumption of the article.

You may ask what's the connection between a generative model and the zero-shot learning that the article tries to solve? The answer is intuitive - the generative model is trained, when during inference on image x (represented by a feature vector of x), we choose the attribute combinations (a, o) which maximizes the conditional probability of P(x | a, o).

So far so good, but how do the learning and (zero-shot style) inference executed based on these assumptions? For this purpose, the article construct a causality graph (G) to describe the process of creating 'real' images. G can be described as such:

  1. a pair of an object (o) and an attribute (a) are chosen from the object space S_o and the attribute space S_a. Pay attention that a and o dependent in each other - a.k.a cofounding. This dependency is the main obstacle in allowing the compositional generalization ability to discriminatory models in zero shot learning. The objects and the attributes are modeled by categorical variables (you can think of S_o and S_a as dictionaries of object and attribute types).
  2. An object o and an attribute a creates the core features f_o and f_a. As we previously said, the base assumption of the article is that the f_o and f_a distributions are don't differ between the train and the test sets.
  3. The core features f_o and f_a creates a feature vector g of the image.

But how is the described causality graph connected to zero-shot learning ZS? In fact, we need to find a way to model test set combinations which were not seen in the training set by changing G. The article suggests to perform an "intervention" on G. We will force specific values for a and o and remove their dependency. So the zero-shot issue that the article solves can be put this way:

Finding an highest probability for an intervention of the object and the attribute types that created the picture.

Explanation of the basic ideas

Now that we've understood the basic principles of the article, it's time to discuss the methodology of this approach. Our goal is finding the pair - object o and an attribute a - for a given picture from the training set, so that the conditional probability of this picture - P(x|o, a) - will be maximized.

Definition: to solve this problem, the article defines two latent spaces F_o and F_a which contain latent representation of objects and attributes. The object o creates a conditional distribution P(f_o | o) modeled by the Gaussian with the mean h_o(o) and a covariance diagonal matrix. One can interpret h_o as a prototype representation of the object o. A latent representation of the attributes a, marked as f_a, is defined similarly. Let us state that the article assumes that both the training set and the test set distributions - P(f_a | a) and P(f_o | o) - are the same.

Figure 1: . (a) The causal graph that generates an image. The solid arrows represent the real-world processes by which the two categorical variables “Object” and “Attribute” each generate “core features” [21, 17] φo and φa. The core features then jointly generate an image feature vector x. The core features are assumed to be stable for unseen combinations of objects and attributes. The dotted double-edged arrows between the Object and Attribute nodes indicates that there is a process “confounding” the two: they are not independent of each other. (b) An intervention that generates a test image of a purple-cauliflower, by enforcing a = purple and o = cauliflower. It cuts the confounding link between the two nodes [49] and changes the joint distribution of the nodes to the “interventional distribution”. (c) Illustration of the learned mappings, detailed in Section 4

A feature vector for an image x is defined as Gaussian with a mean vector g(f_a, f_o) and a fixed covariance diagonal matrix. As usual, during the training of generative models, we need to model the posterior distribution of the latent representation vectors - f_o and f_a (given the feature vectors of image x). These models are marked by g_io and g_ia.

Now that we're done with all the definitions, we can move on to describing the learning process itself. The goal of the learning process is to train 5 networks (all MLP): h_a, h_o, g, g_ia, g_io

The loss function is combined of 3 parts:

  1. L_like - for a labeled image in the training set with an object o and an attribute a, the loss function is composed of these 3 components:
  • One that ensures that the assessment of the latent representation of the object o - which is given by the network g_io(x) -  bring closer the prototype representation h_o of o. The distance here is measured as the squared difference between h_o and g_ia(x).
  • Another that assess the squared distance between g_ia(x) and its prototype h_a.
  • And Triplet Loss - when the anchor is a feature vector of the image x, the positive example is the image's real label pair (a, o) and the negative example is some random pair of an object and an attribute. The distance function here is the Euclidean distance between x and g(a, o). The purpose of the triplet loss is minimization of the distance between the positive example to the anchor and maximizing the distance between the anchor and the negative example. In our case we want to create an image with similar features to x given the object and its attribute, and maximize the distance between x and the random object/attribute pair created image.

2. Part 2 of the loss L_indep:  this part tries to to minimize the conditional dependency between the core features f_a and f_o, given the object/attribute types. The causal graph in Figure 1 (a) above, demonstrates the independence between the core feature f_o and the attribute a, given the selected object o. By the way, the article states that this independence related to the matrix that measures the disentanglement of data representations (PIDA). Furthermore, f_a must be independent of f_o also given the selected object o, and this independence must exists given the attribute o. Since we can't sample the latent spaces f_o and f_a, we try to force the conditional independence between the posterior assessment given by g_ia(x) and g_io(x). But how can a loss be built, such that can minimize the statistical dependency between random vector samples? A simple correlation between the vectors isn't enough here, because it only measures the linear dependency between the vectors. There are parametric methods, based on mutual information, others based on adversarial training, but the article chose a non-parametric method called Hilbert Schmidt Independence Criterion (HSIC). Without getting too much into the mathematical details (HSIC is quite complex), think about this criteria as some generalization of the correlation between vectors, where the vectors are non-linearly transformed through a kernel. Let me just state that L_indep is composed of 4 HSIC components (we want to force conditional independence between 4 pairs of the core features, objects and attributes, some of which were listed in the beginning of this section).

3. part 3 of the L_invert Loss tries to force the embedding h_o, h_a and the image feature vectors g(h_a, h_o) to contain as much information as possible about the true image labels a and o. If this isn't done, h_a and h_o might converge into a trivial solution, because we don't have an access to the true values of the latent features f_o and f_a (see the explanation about the first loss L_like). And then:

  • Linear layers h_a and h_o are added to classify the attribute and the object type accordingly (each receives its own linear layer and trained separately) and each is trained with a cross-entropy loss (2 losses in total).
  • An additional layer is added to the representation network g to classify the object, and another linear layer to classify the attribute. Both are trained with the same cross-entropy loss (2 losses in total).
  • The L_invert loss is now composed of the sum of the previously described 4 losses.

Now we must discuss the the inference method.


As we previously said, we are trying to locate a pair (a, o) which maximizes the probability of a given image x. The article shows that log p(x | a, o) can be approximate by a sum of these 3 components:

  1. Square distance between g_ia(x) to a's prototype h_a. This distance represents "how much the image contains the attribute a", assessed by the distance between the core feature f_a of x to g_ia(x)
  2. The square distance between g_io(x) and o's prototype h_o
  3. The square distance between g(h_a, h_o) to the image x, which represents how accurately can the image x be reconstructed from the pair (a, o)

At the end, a pair (a, o) which maximized log p(x | a, o) is chosen.


The article demonstrates performance improvements of Zero-Shot tasks on the datasets: MIT states UTZappos50K and the synthetic dataset AO-CLEVR against several other zero-shot methods, such as TMN, ATTOP and VisProd.

Table 1: Results for Zappos. ± denotes the Standard Error of the Mean (S.E.M.) over 5 random model initializations.


This is a very interesting article which suggests a new method for zero-shot training, which answers the difficulties that descriptive models experienced when encountering new unseen combinations during the test-set. The article suggests a causal frame to overcome this difficulty, and achieves significant performance improvement on Zero-Shot tasks on 3 datasets. The article is using quite heavy mathematical tools, but is written in a very clear method, which allows the reader to easily understand the main idea. In short - I highly recommend to read it!


This post was written by Michael (Mike) Erlihson, Ph.D.

Michael works in the cybersecurity company Salt Security as a principal data scientist. Michael researches and works in the deep learning field while lecturing and making scientific material more accessible to the public audience.