NeurIPS2022

CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers

Ming Ding, Wendi Zheng, Wenyi Hong, Jie Tang

417 citations

Abstract

Development of transformer-based text-to-image models is impeded by its slow generation and complexity, for high-resolution images. In this work, we put forward a solution based on hierarchical transformers and local parallel autoregressive generation. We pretrain a 6B-parameter transformer with a simple and flexible self-supervised task, a cross-modal general language model (CogLM), and finetune it for fast super-resolution. The new text-to-image system, CogView2, shows competitive generation performance to the concurrent state-of-the-art DALL-E-2, and naturally supports interactive text-guided editing on images. A lion man is typing in the office. A beautiful girl is hugging a husky. A lion teacher wearing a suit is in front of a blackboard. A robot is riding under the blue and cloudy sky. Several youths are talking in a bar. A young woman is taking photos. A tiger with angel's wings. A girl holding an oil-paper umbrella in a rainy lane. Earth in the Eye. A magnificent church. Sketch. Mount Fuji, cherry blossom and Akita dog. Oil painting. A pirate captain with a skull. Figure 1: Text-to-Image samples from CogView2, which supports both Chinese and English. The actual input text is in Chinese, translated into English here for better understanding. Codes and a demo website will be updated at https://github.com/THUDM/CogView2 . Preprint. Under review.