Generating Image / Video from text using GANs

If you are using Instagram / Facebook, and you want to upload a clicked image with a related caption, you can find many applications which will give you a nice caption based on your image. But have you ever wondered about an application, where you can just give a text or a caption and it returns you an image or a video based on your text? Interesting isn’t it!! Well, this was not possible for a long time, but then recently came a model called GAN (Generative Adversarial Networks) which can synthesize an image or a video based on the input text given by the user. Something like this:

Figure 1. Image synthesized by different GAN architectures on the entered text.

As we see above, the user enters a text and the model synthesizes an image based on the text from the representations it has learned on the training data. In a very similar way, a video can be generated by the GAN model based on the text, depending upon the data on which the model has been trained. In this article, we will discuss two research papers that use different architectures of GAN model to generate image and video on a text caption: StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks and To Create What You Tell: Generating Videos from Captions.

Before diving into the research papers which make the above task possible, we will first learn about Generative Adversarial Networks, the basic model used in the two papers, its architecture, and objective function which be helpful to understand the objective function in the two papers.

About GANs

Figure 2. GAN

The two models are set up in a contest or a game (in a game theory sense) where the generator model seeks to fool the discriminator model, and the discriminator is provided with both examples of real and generated samples. The generator generates a batch of samples, and these, along with real examples from the domain, are provided to the discriminator and classified as real or fake. The discriminator is then updated to get better at discriminating real and fake samples in the next round, and importantly, the generator is updated based on how well, or not, the generated samples fooled the discriminator.

The generator G is optimized to reproduce the true data distribution p_data by generating images that are difficult for the discriminator D to differentiate from real images. Meanwhile, D is optimized to distinguish real images and synthetic images generated by G. Overall, the training procedure is similar to a two-player min-max game with the following objective function,

where x is a real image from the true data distribution p_data, and z is a noise vector sampled from distribution p_z (e.g., uniform or Gaussian distribution).

Text to Image synthesis: Stack GANs

In this paper, we propose Stacked Generative Adversarial Networks (StackGAN) to generate 256×256 photo-realistic images conditioned on text descriptions. We decompose the hard problem into more manageable sub-problems through a sketch-refinement process. It decomposes the text-to-image generative process into two stages:

  • Stage-I GAN: it sketches the primitive shape and basic colors of the object conditioned on the given text description, and draws the background layout from a random noise vector, yielding a low-resolution image.
  • Stage-II GAN: it corrects defects in the low-resolution image from Stage-I and completes details of the object by reading the text description again, producing a high-resolution photo-realistic image.
Figure 3. The architecture of Stack GAN.

As we can see in Figure 3 above, the Stage-I generator draws a low-resolution image by sketching rough shapes and basic colors of the object from the given text and painting the background from a random noise vector. Conditioned on Stage-I results, the Stage-II generator corrects defects and adds compelling details into Stage-I results, yielding a more realistic high-resolution image.

The text description t is first encoded by an encoder, yielding a text embedding 𝜑ₜ. We introduce a Conditioning Augmentation technique to produce additional conditioning variables ĉ. We randomly sample the latent variables ĉ from an independent Gaussian distribution N(µ(𝜑ₜ), Σ(𝜑ₜ)), where the mean µ(𝜑ₜ) and diagonal covariance matrix Σ(𝜑ₜ) are functions of the text embedding 𝜑ₜ. This yields more training pairs given a small number of image-text pairs, and thus encourages robustness to small perturbations along the conditioning manifold. To further enforce the smoothness over the conditioning manifold and avoid overfitting, we add the following regularization term to the objective of the generator during training,

which is the Kullback-Leibler divergence (KL divergence) between the standard Gaussian distribution and the conditioning Gaussian distribution.

Stage-I GAN:

Let 𝜑ₜ be the text embedding of the given description, which is generated by a pre-trained encoder. The Gaussian conditioning variables ĉₒ for text embedding are sampled from N (µₒ(𝜑ₜ), Σₒ(𝜑ₜ)) to capture the meaning of 𝜑ₜ with variations. Conditioned on ĉₒ and random variable z, Stage-I GAN trains the discriminator Dₒ and the generator Gₒ by alternatively maximizing Dₒ in and minimizing ℒ_Gₒ in the below equation:

where the real image Iₒ and the text description t are from the true data distribution p_data. z is a noise vector randomly sampled from a given distribution p_z (Gaussian distribution in this paper).

Stage-II GAN:

It is conditioned on low-resolution images and also the text embedding again to correct defects in Stage-I results. Conditioning on the low-resolution result sₒ = Gₒ(z, ĉₒ) and Gaussian latent variables ĉ, the discriminator D and generator G in Stage-II GAN are trained by alternatively maximizing L_D and minimizing L_G in the equation:

Different from the original GAN formulation, the random noise z is not used in this stage with the assumption that the randomness has already been preserved by sₒ. Gaussian conditioning variables ĉ used in this stage and ĉₒ used in Stage-I GAN share the same pre-trained text encoder, generating the same text embedding 𝜑ₜ.


We choose a recently proposed numerical assessment approach “Inception score” for quantitative evaluation,

where x denotes one generated sample, and y is the label predicted by the Inception model. We compare our results with the state-of-the-art text-to-image methods on the CUB, Oxford-102 and COCO dataset based on the Inception score and human rank and we get the following result:

Our StackGAN achieves the best inception score and average human rank on all three datasets. Compared with GAN-INT-CLS, StackGAN achieves 28.47% improvement in terms of inception score on CUB dataset (from 2.88 to 3.70), and 20.30% improvement on Oxford-102 (from 2.66 to 3.20). The better average human rank of our StackGAN also indicates our proposed method is able to generate more realistic samples conditioned on text descriptions.

Figure 4. Samples generated by StackGAN from unseen texts in CUB test set.


The proposed method decomposes the text-to-image synthesis to a novel sketch-refinement process. Stage-I GAN sketches the object following basic color and shape constraints from given text descriptions. Stage-II GAN corrects the defects in Stage-I results and adds more details, yielding higher resolution images with better image quality.

Text to Video synthesis: TGANs-C

In this paper, we present a novel Temporal GANs conditioning on Captions, namely TGANs-C, in which the input to the generator network is a concatenation of a latent noise vector and caption embedding, and then is transformed into a frame sequence with 3D spatio-temporal convolutions.

Unlike the naive discriminator which only judges pairs as fake or real, our discriminator additionally notes whether the video matches the correct caption. In particular, the discriminator network consists of three discriminators: video discriminator classifying realistic videos from generated ones and optimizes video-caption matching, frame discriminator discriminating between real and fake frames and aligning frames with the conditioning caption, and motion discriminator emphasizing the philosophy that the adjacent frames in the generated videos should be smoothly connected as in real ones.

Figure 5. Temporal GANs conditioning on Captions (TGANs-C) framework

As we can see from the above figure, TGANs-C framework mainly consists of a generator network 𝐺 and a discriminator network 𝐷 (better viewed in color). Given a sentence S, a bi-LSTM is first utilized to contextually embed the input word sequence, followed by a LSTM-based encoder to obtain the sentence representation S. The generator network 𝐺 tries to synthesize realistic videos with the concatenated input of the sentence representation S and random noise variable z. The discriminator network 𝐷 includes three discriminators: video discriminator to distinguish the real video from the synthetic one and align video with the correct caption, frame discriminator to determine whether each frame is real/fake and semantically matched/mismatched with the given caption, and motion discriminator to exploit temporal coherence between consecutive frames. Accordingly, the whole architecture is trained with the video-level matching-aware loss, frame-level matching-aware loss, and temporal coherence loss in a two-player minimax game mechanism.

Video-level matching-aware loss:

The input video-caption pair {𝑣, S} might not only be from distinctly sources (i.e., real or synthetic), but also contain matched or mismatched semantics. Hence, given the real-synthetic video triplet {𝑣_𝑠𝑦𝑛+ , 𝑣_𝑟𝑒𝑎𝑙+ , 𝑣_𝑟𝑒𝑎𝑙− } and the conditioning caption S, the video-level matching-aware loss is measured as:

By minimizing this loss over positive video-caption pair (i.e., {𝑣𝑟𝑒𝑎𝑙+ , S}) and negative video-caption pairs (i.e., {𝑣𝑠𝑦𝑛+ , S} and {𝑣𝑟𝑒𝑎𝑙− , S}), the video discriminator 𝐷ₒ is trained to not only recognize each real video from synthetic ones but also classify semantically matched video-caption pair from mismatched ones.

Frame-level matching-aware loss:

To further enhance the frame reality and semantic alignment with the conditioning caption for each frame, a frame-level matching-aware loss is involved here which enforces the frame discriminator 𝐷₁ to discriminate whether each frame of the input video is both real and semantically matched with the caption. Therefore, given the real-synthetic video triplet {𝑣_𝑠𝑦𝑛+ , 𝑣_𝑟𝑒𝑎𝑙+ , 𝑣_𝑟𝑒𝑎𝑙− } and the conditioning caption S, we calculate the frame-level matching-aware loss as:

where 𝑓^𝑖_𝑟𝑒𝑎𝑙+ , 𝑓^𝑖_ 𝑟𝑒𝑎𝑙− and 𝑓^𝑖_𝑠𝑦𝑛+ denotes the 𝑖-th frame in 𝑣_𝑟𝑒𝑎𝑙+ , 𝑣_𝑟𝑒𝑎𝑙− and 𝑣_𝑠𝑦𝑛+ , respectively.

Temporal coherence loss:

Temporal coherence is one generic prior for video modeling, which reveals the intrinsic characteristic of video that the consecutive video frames are usually visually and semantically coherent. To incorporate this temporal coherence prior into TGANs-C for video generation, we consider two kinds of schemes on the basis of motion discriminator 𝐷2 (︀ 𝑓^𝑖 , 𝑓^𝑖−1 )︀.

  • (1) Temporal coherence constraint loss: given the real-synthetic video triplet, we characterize the temporal coherence of the synthetic video 𝑣_𝑠𝑦𝑛+ as a constraint loss by accumulating the Euclidean distances over every two consecutive frames:
  • (2) Temporal coherence adversarial loss: Similar to frame discriminator 𝐷₁, the motion tensor m_𝑓𝑖 in motion discriminator 𝐷₂ is first augmented with embedded sentence representation 𝜑₂ (S). Next, such concatenated tensor representation is leveraged to measure the final probability Φ₂(m_𝑓𝑖 , S) of classifying the temporal dynamics between consecutive frames as real ones conditioning on the given caption. Thus, given the real-synthetic video triplet {𝑣_𝑠𝑦𝑛+ , 𝑣_𝑟𝑒𝑎𝑙+ , 𝑣_𝑟𝑒𝑎𝑙− } and the conditioning caption S, the temporal coherence adversarial loss is measured as:

where m_𝑓_𝑖 𝑟𝑒𝑎𝑙+ , m_𝑓_𝑖 𝑟𝑒𝑎𝑙− and m_𝑓_𝑖 𝑠𝑦𝑛+ denotes the motion tensor in 𝑣_𝑟𝑒𝑎𝑙+ , 𝑣_𝑟𝑒𝑎𝑙− and 𝑣_𝑠𝑦𝑛+ , respectively. By minimizing the temporal coherence adversarial loss, the temporal discriminator 𝐷₂ is trained to not only recognize the temporal dynamics across synthetic frames from real ones but also align the temporal dynamics with the matched caption.


The overall training objective function of TGANs-C integrates the video-level matching-aware loss, frame-level matching-aware loss and temporal coherence constraint loss/temporal coherence adversarial loss. As our TGANs-C is a variant of the GANs architecture, we train the whole architecture in a two-player minimax game mechanism. For the discriminator network 𝐷, we update its parameters according to the following overall loss:

where 𝒯 is the set of real-synthetic video triplets, Lˆ(1)_𝐷 and Lˆ(2)_𝐷 denotes the discriminator network 𝐷’s overall adversarial loss in unconditional scheme (i.e., TGANs-C with temporal coherence Constraint loss (TGANs-C-C)) and conditional scheme (i.e., TGANs-C with temporal coherence Adversarial loss (TGANs-C-A)), respectively. By minimizing this term, the discriminator network 𝐷 is trained to classify both videos and frames with correct sources, and simultaneously align videos and frames with semantically matching captions.

For the generator network 𝐺, its parameters are adjusted with the following overall loss:

where 𝒧Lˆ(1)_𝐺 and 𝒧Lˆ(2)_𝐺 denotes the generator network 𝐺’s overall adversarial loss in TGANs-C-C and TGANs-C-A, respectively. The generator network 𝐺 is trained to fool the discriminator network 𝐷 on videos/frames source prediction with its synthetic videos/frames and meanwhile align synthetic videos/frames with the conditioning captions.


We evaluate and compare our proposed TGANs-C with stateof-the-art approaches by conducting video generation task on three datasets of progressively increasing complexity: SingleDigit Bouncing MNIST GIFs (SBMG), Two-digit Bouncing MNIST GIFs (TBMG), and Microsoft Research Video Description Corpus (MSVD). The first two are recently released GIF-based datasets consisting of MNIST digits moving frames and the last is a popular video captioning benchmark of YouTube videos.

As shown in the above figure, each GIF is accompanied with single sentence describing the digit and its moving direction and randomly selected two examples from MSVD dataset.


In this paper, we have presented the Temporal GANs conditioning on Captions (TGANs-C) architecture, succeeded in generating videos that correspond to a given input caption. Our model expands on adversarial learning paradigm from three aspects. First, we extend 2D generator network to 3D for explicitly modeling spatio-temporal connections in videos. Second, in addition to naive discriminator network which only judges fake or real, ours further evaluate whether the generated videos or frames match the conditioning caption. Finally, to guarantee the adjacent frames coherently formed over time, the motion information between consecutive real or generated frames is taken into account in the discriminator network.


Pursuing MS in CS at Columbia University

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store