CVPR2023

Generalized Deep 3D Shape Prior via Part-Discretized Diffusion Process

Yuhan Li, Yishun Dou, Xuanhong Chen, Bingbing Ni, Yilin Sun, Yutian Liu, Fuzhen Wang

摘要

We first provide the implementation details of the P-VQ-VAE, discrete diffusion generator and condition pipeline in Sec. 1. More ablation study about important settings is reported in Sec. 2. Technical details about experiments are given in Sec. 3, with more visual results in Sec. 4. Implementation P-VQ-VAE Backbone Architecture details. Our P-VQ-VAE backbone consists of three components: an encoder E, a decoder D and a vector quantizer V Q with convolutions. Following Au-toSDF [8], we adapt the VQ-VAE from the VAE backbone of LDM [11] . We show the details of encoder in Tab. 1, the decoder in Tab. 3, and the vector quantizer in Tab. 2. Dataset details. We train the P-VQ-VAE using the objects from 13 categories of ShapeNet [2] data, including [airplane, bench, cabinet, car, chair, display, lamp, speaker, rifle, sofa, table, phone, watercraft]. We first extract the Truncated-SDF (T-SDF) following pre-processing steps in DISN [15] and PixelTransformer [13] . The shapes are normalized to lie in an origin-centered cube in [-1, 1] 3 , while most shape T-SDFs' absolute values are less than 0.5. The signed distance function is evaluated at locations in a uniformly sampled 64 3 grid. Following AutoSDF [8], we use 0.2 as the threshold to further obtain the T-SDFs representations. Training details. Then the whole 3D shape in the format of T-SDF X ∈ R 64×64×64 is divided into 512 partial regions X ′ ∈ R N ×8×8×8 , and N = 512 is the number of non-overlap regions, for directly working on whole shape is computationally unaffordable with cubic increase with resolution. Afterward, all regions are vectorized by the encoder as Z = E(X) ∈ R N ×nz , while each patch is treated independently and equally.