We describe a new training methodology for generative adversarial netwworks. The key idea isto grow both the generator and discriminator progressively : startting from a low resolution, we add new layers that model increasingly fine details as training progressses. This both speeds the training up and greatly stabilizes it, allowing us to produce images of unprecedented quality, e.g., CELEBA images at $1024^2$. We also propose a simple way to increase the variation in generated images, and achieve a record inception score of 8.80 in unsupervised CIFAR 10. Additionally, we describe several implementation details that are importtant for discouragion unhealthy competition betwnne the generator and discrimination, Finally, we suggest a new metric for evaluation GAN results, both in terms of image quality and variation. As an additional contribution, we construct a higher–quality versiong of the CELEBA dataset.
Generative mothods that produce novel samples from high-dimensional data distributions, such as images, are finding widespread use, for example in speech synthesis (van den Oord te al., 2016a), image-to-image translation (Zhu et al., 2019; Liu et al., 2019; Wang et al., 2017). and image inpainting (Lizuka et al., 2017). Currently tyhe most prominent approaches are auto regressive models (van den Oord et al., 2016b;c), variarional autoencoders(VAE)(Kingma & Welling, 2014), and generative adversarial networks(GAN) (Goodfellow et al., 2014). Currently they all have significant strengehs and weaknesses. Auto regressive models - such as PixelCNN - produce sharp images but are slow to evaluate and do not have a latent representation as they direvtly model the conditional distribution over pixels, potentially limitiing their applicability. VAEs are easy to train but tend to produce blurry results due to restrictions in the model, although recent work is improveing this (Kingma et al., 2016). GANs produce sharp images, albeit only in fairly small resolutions and with somewhat limited variation, and the training continues to be unstable despite recent progress(Salimans et al., 2016; Gulrajani et al., 2017; Berthelot et al., 2016’ Kodali et al., 2017). Hybrid methods combine various strengths of the three, but so far lag behind GANs in image quality (Makhzani & Frey, 2017; Ulyanov et al., 2017; Dumoulin et al., 2016)
Typically, a a GAN consists of two networks: generator and discriminator (aka critic). The generator produces a sample, e.g., an image, from a latent code, and the distribution of these images should ideally be indistinguishable from the training distribution. Since it is generally infeasible to engineer a function that tells whether that is the case, a discriminator network is trained to do the assessment, and since networks are differentiable, we also get a gradient we can use to steer both networks to the right direction. Typically, the generator is of main interest - the discriminator is an adaptive loss function that gets discarded once the generator has been trained.
There are multiple porential problems with this formulation. When we measure the distance between the training disributions do not have substantial overlap, i.e., are too easy to tell apart (Arjovsky * Bottou, 2016). Originally, jensen-Shannon divergence was used as a distance metric (Goodfellow et al., 2014), and recently that formulation has been improved (Hjelm et al., 2016 and a number of more stable alternatives have been proposed, including least squares (Mao et al., 2016b), absolute deviation with margin (Zhao et al., 2016), and Wasserstein distance (Arjovsky et al., 2017; Gulrajani et al., 2017) Our contributions are largely orthogonal to this ongoing discussion, and we primarily use the improbed Wasserstein loss, but alos ecperiment with least-squares loss.
The generation of high-resolution images is difficult because higher resolution makes it easier to tell the generated images apert from training images (Odena et al., 2017), thus drastically amplifying the gradient problem. Large resolutions also necessitate using smaller minibatches due to memory constraintsm further compromising training stability. Our key insight is that we can grow both the generator and discriminator progressively, stating from easier low-resolution images, and add new layers that introduce higher-resolution details as the training progresses. This greatly speeds up training and improves stability in high resolutions, as we will discuss in Section 2.
The GAN formulation does not explicitly require the entire training data distribution to be represented by the resulting generative model. The conventional wisdom has been that there is a tradeoff between image quality and variation is currently receiving attention and various methods have been suggested for measuring it,mincludig inception score (Salimans et al., 2016), multi-scale structural similarity(MS-SSIM)(Odena et al., 2017; Wang et al., 2003), birthday paradox (Arora & Zhang, 2017), and explicit tests for the number of discrete modes discovered (Metz et al., 2016). We will describe our method for encouraging variation in Section 3, and propose a new metric for evaluation the quality and variation in Section 5.
Section 4.1 discusses a subtle modification to the initialization of networks, leading to a more balanced learning speed for different layters. Furthermore, we observe that mode collapses traditionally plaguing GANs tend to happen very quickly, over the course of a dozen minibatches. Commonly they start when the discriminator overshoots, leading to exaggerated gradients, and an unhealthy competition follows where the signal magnitudes escalate in both networks. We propose a mechanism to stop the generator from participation in such escalation, overcoming the issue (Section 4.2)
We evaluate our contributions using the CELEBA, LSUN, CIFAR10 datasets. We improve the best published iinception score for CIFAR10. Since the datasets commonly used in benchmarking generative methods are limited to a fairly low resolution, we have also created a higher quality version of the CELEBA dataset that allowds experimentation with output resolutions up to $1024 \times 1024$ pixels. This dataset and our full implementation are aavailable at [https://github.com/tkarras/progressive_growing_of_gans], trained networks can be found at [https://drive.google.com/open?id=0B4qLcYyJmiz0NHFULTdYc05lX0U] along with result images, and a supplementary video illustrating the datasets, additional results, and latent space interpolations is at [https://youtu.be/G06dEcZ-QTg].
Our primary contribution is a training methodology for GANs where we start with low-resolution imagesm and then progressively increase the resolution by adding layers to the networks as visualized image distribution and then shift attention to increasingly finer scale detail, instead of having to learn all scales simultaneously.
We use generator and discriminator networks that are mirror images of each other and always grow in synchrony. All existing layers in both networks remain trainable throughout the training process. When new layers are added to the networks, we fade them in smoothly, as illustrated in Figure 2. This avoids sudden shocks th the already well-trained, smaller-resolution layers. Appendix A describes structure of the generator and discriminator in detail, along with other training parameters.
We observe that the progressive training has several benefits. Early on, the generation of smaller images is substantially more stable because there is less class information and fewer modes (Odena et al.,2017). By increasing the resolution little by little we are continuously asking a much simpler question compared to the end goal of discovering a mapping from latent vectors to e.g. $1024^2$ images. This approach has conceptual similarity to recent work by Chen & Koltun (2017). in prectice it stabilizes the training sufficiently for us to reliably synthesize megapixel-scale images using WGAN-GP loss (Gulrajani et al.,2017) and even LSGAN loss (Mao et al.,2016b).
Another benefit is the reduced training time. With progressively growing GANs most of the iterations are done at lower resolutions, and comparable result quality is often obtained up to 2-6 times faster, depending on the final output resolution.
The ideaof growing GANs progressively is related to the work of Wang et al.(2017), who use multiple discriminators that operate on different spatial resolutions. That work in turn is motivated by Durugkar et al.(2016) who use one generator and multiple discriminators concurrently, and Ghosh et al.(2017) who do the opposite with multiple generators and one discriminator. Hierarchical GANs(Denton et al.,2015; Huand et al.,2016; Zhang et al.,2017) define a generator and discriminator for each level of an image pyramid. These mothods build on the same observation as our work - that the complex mapping from latents to high-resolution images is easier to learn in steps - but the crucial difference is that we have only a single GAN instead of a hierarchy of them. In contrast to early work on adaptively growing networks, e.g., growing neural gas(Fritzke, 1995) and neuro evolution of augmenting topologies (Stanley & Miikkulainen, 2002) that grow networks greedily, we simply defer the introduction of pre-configured layers. in that sense our approach resembles layer-wise training of autoencoders ( Bengio et al.,2007).
GANs have a tendency to capture only a subset of the variation found in training data, and Salimans et al.(2016) suggest “minibatch discrimination” as a solution.
GAN은 트레이닝 데이터에서 찾은 variation의 일부만 캡쳐하는 경향이 있고, 살리만 이것의 해결법으로 minibatch discrimination 을 제안했다.
They compute feature statistics not only from individual images but also across the minibatch, thus encouraging the minibatches of generated and training images to show similar statistics. 그들은 각 이미지가 아닌 미니배치에 거쳐 통계량을 계산하고, 따라서 생성되고 학습된 미니배치가 비슷한 통계량을 가지도록 도와준다.
This is implemented by adding a minibatch layer towards the end of the discriminator, where the layer learns a large tensor that projects the input activation to an array of statistics.
이것은 discriminator 의 끝쪽에 미니배치레이어를 더하는것으로 구현된다. 그 레이어는 입력의 통계값의 활성화를 투영한 큰 텐서를 학습한다.
A separate set of statistics is produced for each example in a minibatch and it is concatenated to the layer’s output, so that the discriminator can use the statistics internally. We simplify this approach drastically while also improving the variation.
통계량의 일부분은 미니배치의 각 샘플에서 생산되며 레이어의 출력과 concat 된다. discrimiantor가 그 통계값을 내부적으로 사용 가능하도록. 이 접근법을 극적으로 단순화시키면서 variation도 향상시켰다.
Our simplified solution has neither learnable parameters nor new hyperparameters.
우리의 간단한 해결책은 학습가능한 파라미터도 아니고 새로운 하이퍼파라미터도 아니다.
We first compute the standard deviation for each feature in each spatial location over the minibatch. We then average these estimaties over all features and spatial locations to arrive at a single value.
첫번째로 각 미니배치를 넘어 각 공간위치안에 피쳐들에 대한 표준편차를 계산한다. 그리고 각 피쳐 및 각 공간위치에 대한 추정값들을 평균내어 단일 값으로 바꾼다
We replicate the value and concatenate it to all spatial locations and over the minibatch, yielding one additional(constant) feature map. This layer could be inserted anywhere in the discriminator, but we have found it best to insert it towards the end(see Appendix A.1 for details) 우리는 이값을 복제하고 모든 공간위치에 concate 하여 추가적인 하나의 피쳐맵을 만든다. 이 레이어는 discriminator의 어디에든 들어갈 수 있으나 우리가 찾아낸 베스트는 끝쪽에 넣는것이었다
We experimented with a richer set of statistics, but were not able to improve the variation further. In parellel work, Lin et al.(2017) provide theoretical insights about the benefits of showing multiple images to the discriminator.
우리는 더 많은 자료로 실험을했지만, variation 을 더 향상시키진 못했다. Lin의 작업에선 여러 이미지를 discriminator 에게 보여줌으로 얻는 이점에 대한 정보를 제공한다.
Alternative solutions to the variation problem include unrolling the discriminator (Metz et al., 2016) to regularize its updates, and a “repelling regularizer”
(Zhao et al.,2017) that adds a new loss term to the generator, trying to encourage it th orthogonalize the feature vectors in a minibatch.
다양성 문제에 대책으로는 discriminator 에 unrolling 을 포함하여 업데이트시 정규화하는 방법이 있다. ‘repelling regularizer’ 라는 제네레이터에 새로운 로스 부분을 추가해서 미니배치 안에 피쳐들을 직교화 되도록 하는 방법도 있다.
The multiple generators of Ghosh et al. (2017) also serve a similar goal. We acknowledge that these solutions may increase the variation even more than our solution - or possibly be orthogonal to it - but leave a detailed comparison to a later time.
Ghosh의 다중 generator 도 비슷한 목표를 제공한다. 이 솔루션들은 우리의 해법보다 다양성을 더 높여주거나 가능한 젝교화되도록 하는것을 인정하지만 뒤에있는 디테일 비교가 남아있다.
GANs are prone the the escalation of signal magnitudes as a result of unhealthy competition between the two networks. Most if not all earlier solutions discourage this by using a variant of batch normalization (Ioffe & Szegedy,2015; Salimans & Kingma, 2016; Ba et al.,2016) in the generator, and often also in the discriminator.
GAN들은 두가지 네트워크의 경쟁의 비균형적 결과 에 의해 신호크기가 증가하기가 쉽다. 초기가 아닌 대부분 해결책은 제네레이터에서 배치노말에 변형을 주어 이를 감소시켯고, 종종 그것은 discriminator 에도 들어갔다
These normalization methods were originally introduced to eliminate covariate shift. However we have not observed that to be an issue in GANs, and thus believe that the actual need in GANs is constraining signal magnitudes and competition. We use a different approach that consists of two ingredients, neither of which include learnable parameters.
이 정규화 기법은 원래 공변량의 이동을 위해 사용되었다. 그러나 우리는 GAN에서 이 이슈가 관찰되지 않았었고, GAN에서 실제로 필요한것은 신호크기와 경쟁의 제한이라고 믿었다. 우린 어떤 학습가능한 파라미터도 포함하지않는 두가지 로 구성된 다른 접근법을 사용하였다 .
We deviate from the current trend of careful weight initialization, and instead use a trivial $\mathcal{N}(0,1)$ initialization and then explicitly scale the weights at runtime. To be precise, we set $\hat{w_i} = w_i/c$, where $w_i$ are the weights and $c$ is the per-layer normalization constant from He’s initializer(He et al.,2015). The benefit of doing this dynamically instead of during initialization is somewhat subtle, and relates to the scale-invariance in commonly used adaptive stochastic gradient descent methods such as RMSProp (Tieleman & Hinton, 2012) and Adam(Kingma & Ba, 2015). These methods normalize a gradient update by its estimated standard deviation, thus making the update independent of the scale of the parameter.
우리 가중치 초기화를 조심스러워하는 현재 트렌드에서 벗어났다. 그리고 대신 단순한 표준정규분포 초기화를 사용하고 런타임에 가중치를 명시적으로 스케일했다. 정확히하면 가중치를 허 초기화에서 나온 레이어별 정규화 상수로 나누어줬다. 초기화중 대신 동적으로 하는 이점은 다소 미묘하고 RMSProp이나 Adam 처럼 일반적으로 사용되는 SGD 방법과도 관련이 있다. 이 방법들은 그들의 추정된 표준편차를 기준으로 경사 업데이트를 정규화하고 따라서 매개변수의 스케일과 독립적으로 업데이트된다.
As a result, if some parameters have a larger dynamic range than others, they will take longer to adjust. This is a scenario modern initializers cause, and thus it is possible that a learning rate is both too large and too small at the same time. Our approach ensures that the dynamic range and thus the learning speed, is the same for all weights. A similar reasoning was independently used by val Laarhoven(2017).
결과적으로 만약 몇파라미터가 다른것보다 넓은 유동범위를 가지고 있다면 조정되는데 더 오래걸릴 것이다. 이것은 현대의 초기화 방법이 유발하는 현상이고, 따라서 학습률이 동시에 너무 크거나 혹은 너무 작을 수 있다. 우리의 접근은 유동범위와 학습속도가 모든 가중치에서 같은것을 보증한다. 비슷한 연구로는 val Laarhoven의 독립적 사용이 있다.
To disallow the scenario where the magnitudes in the generator and discriminator spiral out of control as a result of competition, we normalize the feature vector in each pixel to unit length in the generator after each convolutional layer.
경쟁의 결과로 생성기와 판별기가 통제 불능의 크기가 되는 현상을 막기 위해. 우리는 생성기의 각 컨볼레이어 뒤에 각 픽셀에서 단위 길이까지의 피쳐벡터를 정규화한다.
We do this using a variant of “local response normalization”(Krizhevsky et al.,2012), configured as $b_{x,y} = a_{x,y} / \sqrt{\frac{1}{N}\sum^{N-1}{j=0}(a^j{x,y})^2+\epsilon}$, where $\epsilon = 10^{-8}$, $N$ is the number of feature maps, and $a_{x,y}$ and $b_{x,y}$ are the original and normalized feature vector in pixel $(x,y)$, respectively. We find it surprising that this heavy-handed constraint does not seem to harm the generator in any way, and indeed with most datasets it does not change the results much, but it prevents the escalation of signal magnitudes very effectively when needed.
이것을 위해 Local response normalization 의 변종을 사용하였다.
위 수식과같은, N은 피쳐맵의 수이고 a,b는 각각 픽셀 x,y 에서의 오리지날, 그리고 정규화된 피쳐이다. 우리는 이 강압적인 규제가 어떤식으로든 제네레이터에 해를 끼치지 않는것을 발견했고, 실제로 대부분의 데이터 셋에서 결과를 많이 바꾸지 않지만 필요할때 매우 효과적으로 신호크기의 증가를 방지하였다
In order to compare the results of one GAN to another, one needs to investigate a large number of images, which can be tedious, difficult, and subjective. Thus it is desirable to rely on automated methods that compute some indicative metric from large image collections. We noticed that existion methods such as MS-SSIM (Odena et al.,2017) find large-scale mode collapses reliably but fail to react to smaller effects such as loss of variation in colors or textures, and they also do not directly assess image quality in terms of similarity to the training set.
어떤 간과 다른 간의 결과를 비교하는방법은 이미지의 많은 숫자를 살펴야하고, 지루하고, 어렵고, 주관적이다. 따라서 큰이미지 모음에서 나타나는 몇가지 지표들을 계산하는 자동화된 방법에 의존하는것이 바람직하다. 우리는 MS-SSIM같은 현존하는 방법들이 큰 스케일에서의 mode collapse 문제를 감지했지만, 색상이나 텍스쳐 변화의 손실같은 작은 영향에 반응하지 못했다. 그리고 이미지 퀄리티가 학습셋과 유사한 정도에 직접적으로 접근하지 못한다.
We build on the intuition that a successful generator will produce samples whose local image structure is similar to the training set over all scales. We propose to study this by considering the multiscale statistical similarity between distributions of local image patches drawn from Laplacian pyramid (Burt & Adelson, 1987) representations of generated and target images, starting at a low-pass resolution of $16\times16$ pixels. As per standard practice, the pyramid progressively doubles until the full resolution is reached, each successive level encoding the difference to an up-sampled version of the previous level.
우리는 직관을 바탕으로 성공적인 생성기를 모든 스케일에 거쳐 이미지 구조가 학습셋과 유사한 샘플을 생성하게 만들었다. 우리는 이것을 위해 생성된 표현의 라플라시안 피라미드에서 끌어낸 지역 이미지 패치들의 분포와 타겟 이미지 간의 다중스케일 통계적 유사도를 고려하는 것을 제안했고 시작은 낮은 해상도인 16픽셀에서 시작하였다. 매 표준 실행마다, 최대해상도에 도달할때까지 두배씩 점차적으로 커지고 각 연속적인 레벨마다 이전 레벨의 업샘플링 버전과의 차이를 인코딩한다.
A single Laplacian pyramid level corresponds to a specific spatial frequency band. We randomly sample 1634 images and extract 128 descriptors from each level in the Laplacian pyramid, giving us $2^21$(2.1M) discriptors per level. Each descriptor is a $7\times7$ pixel neightborgoot with 3 color channels, denoted by $x\in\mathbb{R}^{7 \times 7 \times 3}= \mathbb{R}^{147}$. We denote the patches from level $l$ of the training set and generated set as ${x^l_{i} }^{2^{21}}{i=1}$ and ${y^l{i} }^{2^{21}}{i=1}$, respectively. We first normalize ${x^l_i}$ and ${y^l_i}$ w.r.t. the mean and standard deviation of each color channel, and then estimate the statistical similarity by computing their sliced Wasserstein distance $\mathrm{SWD}({x^l_i}, {y^l_i})$, an efficiently computable randomized approximation to earthmovers distance, using 512 projections ( Rabin et al.,2011)
하나의 라플라시안 피라미드 레벨은 특정한 주파수 대역과 일치한다. 우리는 랜덤하게 1634개의 이미지를 뽑고 라플라시안 피라미드에의 각 레벨에서 추출한 128개의 설명기를 추출하고 각 레벨당 $2^21$개의 디스크립터를 준다. 각 디스크립터는 $7\times7$ 픽셀 과 3가지 색상 채널이고 $x\in\mathbb{R}^{7 \times 7 \times 3}= \mathbb{R}^{147}$ 로 표기한다. 우리는 학습셋의 $l$번째 레벨에서 나온 패치와 생성된 셋을 각각 ${x^l{i} }^{2^{21}}{i=1}$ , ${y^l{i} }^{2^{21}}_{i=1}$ 로 표기한다.
첫번째로 ${x^l_i}, {y^l_i}$ 에 대해 노말라이즈 하고 각 채널에 대해 평균과 표준편차를 구한다. 그뒤 유사도를 측정하는데 Sliced Wasserstein distance SWD를 사용한다. 효과적으로 계산할 수 있는 무작위화한 지형이동거리로 512차원을 이용했다.
Intiitively a small Wasserstein distance indicates that the distribution of the patches is similar, meaning that the training images and generator samples appear similar in both appearance and variation at this spatial resolution. In particular, the distance between the patch sets extracted from the lowest resolution $16\times16$images indicate similarity in large-scale image structures, while the finest-level patches encode information about pixel-level attributes such as sharpness of edges and noise.
직관적인 작은 W거리는 패치들이 얼마나 비슷한지를 표시하고, 이 공간 해상도에서 학습이미지와 생성 샘플의 외형및 변화가 비슷하게 보이는것을 의미한다. 특별히 16의 저해상도 이미지에서 추출된 패치사이의 거리는 큰 이미지 구조에서 유사도를 나타낸다. 가장 좋은 수준의의 패치는 선과 잡음의 날카로움 같은 픽셀레벨 특성에 대한 정보를 인코딩한다
In this section we discuss a set of experiments that we conducted to evaluate the quality of our results. Please refer to Appendix A for detailed description of our network structures and training configurations. We also invite the reader to consult the accompanying video([https://youtu.be/G06dEcz-QTg])for additional result images and latent space interpolations. In this section we will distinguish between the network structure(e.g., convolutional layers, resizing), training configuration (various normalization layers, minibatch-related operations), and training loss (WGAN-GP, LSGAN)
이 섹션에서 우리는 우리의 결과의 퀄리티를 평가하기 행해진 실험들에 대해 토론한다. 부록 A 참조하면 우리 작업 구조와 학습 환경에 대해 세부적으로 설명되어있다. 우리는 또한 상담을 위해 리더를 초대하여 동반한 비디오를 올렸고 추가적인 결과 이미지와 latent space 에 대해 나와있다. 이 섹션에서 네트워크 구조, 학습환경, 그리고 학습 손실에 대해 다른 네트워크와의 차이를 말할것이다.
We will first use the sliced Wasserstein distance (SWD) and multi-scale structural similarty (MS-SSIM)(Odena et al.,2017) to evaluate the importance our individual contributions, and also perceptually validate the metrics themselves. We will do this by building on top of a previous state-of-the-art loss function(WGAN-GP) and training configuration (Gulrajani et al.,2017) in an unsupervised setting using CELEBA (Liu et al.,2015) and LSUN BEDROOM (Yu et al.,2015) datasets in $128^2$ resolution. CELEBA is particularly well suited for such comparison because the training images contain noticeable artifacts(aliasing, compression, blur) that are difficult for the generator to reproduce faithfully. In this test we amplify the differences between training configurations by choosing a relatively low-capacity network structure (Appendix A.2) and termination the training once the discriminator has been shown a total of 10M real images. As such the results are not fully converged.
첫번째로 SWD를 사용하고 MS-SSIM를 우리의 개별 평가를 위해 사용했다. 그리고 또한 지각적으로 그 평가지표들을 입증하였다. 우리는 그것을 위해 기존 최고성능인 WGAN-GP 로스와 128해상도의 CELEBA 와 LSUN BEDROOM 데이터의 학습환경을 사용하였다. CELEBA 는 이 비교에 특히 잘 적합되었다. 왜냐하면 학습 이미지는 뚜렷한 생성기가 뚜렷하게 복사하기 어려운 아티팩트를 포함하기 때문이다. 이 실험에서 학습환경의 차이를 증폭시키기 위해 상대적으로 적은 용량의 네트워크구조를 선택하고 판별기가 10M의 이미지를 보면 학습을 종료시켰다. 그렇게해서 결과가 완전히 수렴되진 않는다.
Table 1 lists the numerical values for SWD and MS-SSIM in several training configurations, where our individual contributions are cumulatively enabled one by on top of the baseline (Gulrajani et al.,2017). The MS-SSIM numbers were averaged from 10000 pairs of generated images, and SWD was calculated as described in Sections 5. Generated CELEBA images from these configurations are shown in Figure 3. Due to space constraints, the figure shows only a small number of examples for each row of the table, but a significantly broader set is available in Appendix H. Intuitively, a good evaluation metric should reward plausible images that exhibit plenty of variation in colors, textures, and viewpoints. However, this is not captured by MS-SSIM:we can immediately see that configuration (h) generates significantly better images than configuration (a), but MS-SSIM remains approximately unchanged because it measures only the variation between outputs, not similarity to the training set. SWD, on the other hand, does indicate a clear improvement.
Table1은 몇개의 환경에서 SWD와 MS-SSIM 의 값 리스트이고, 각각의 환경은 베이스라인에서 하나씩 누적된다. MS-SSIM 값은 10000 쌍의 생성된 이미지의 평균이고, SWD 는 섹션5에서 서술한 방식으로 계산되었다. 이 환경들에서 생성된 CELEBA 이미지는 Figure3에 서 보여진다.
공간제약 때문에, 테이블의 각행에서 적은 샘플만을 보여주지만, 많은 사진을 Appendix H 에서 볼수 있다.
직관적으로 좋은 평가 측도는 색상, 텍스쳐, 시점에서 충분한 변화를 보여주는 그럴듯한 이미지에 보상을 줘야한다. 하지만 이것은 MS-SSIM은 그런 부분이 감지되지 못하고, 우리는 (a)보다 (h)가 확실히 더 나은 이미지를 생성하는것을 즉시 알수있지만 MSSIM은 출력간의 변화만 측정하기에 거의 변하지 않는다, 학습셋의 유사도와 는 다르게. 반면에 SWD는 분명한 향상을 보여준다
The first training configuration (a) corresponds to Gulrajani et al.(2017), featuring batch normalization in the generator, layer normalization in the discriminator, and minibatch size of 64. (b) enables progressive growing of the networks, which result in sharper and more believable output images. SWD correctly finds the distribution of generated images to be more similar to the training set.
첫번째 학습환경인 (a) 는 Gulrajani의 연구에 해당한다. 생성기에서 피쳐 배치노말을 하고, 판별기에선 레이어 노말, 그리고 미니배치사이즈는 64 이다. (b) 는 점점 PG 네트워크를 사용, 더 날카롭고 믿을만한 결과가 나왔다. SWD도 정확히 학습셋에 더 맞는 생성 이미지 분포를 찾았다.
Our primary goal is to enable high output resolutions, and this requires reducing the size of minibatches in order to stay within the available memory budget. We illustrate the ensuing challenges in (c) where we decrease the minibatch size from 64 to 16. The generated images are unnatural, which is clearly visible in both metrics. In (d), we stabilize the training process by adjusting the hyperparameters as well as by removing batch normalization and layer normalization (Appendix A.2). As an intermediate test (e), we enable minibatch discrimination (Salimans et al.,2016), which somewhat surprisingly fails to improve any of the metrics, including MS-SSIM that measures output variation. In contrast, our minibatch standard deviation (e) improves the average SWD scores and images. We then enable our remaining contributions in (f) and (g), leading to an overall improvement in SWD and subjective visual quality. Finally, in (h) we use a non-crippled network and longer training - we feel the quality of the generated images is at least comparable to the best published results so far.
우리의 첫번째 목표는 가능한 출력 해상도를 높이는 것이고, 이때 사용가능한 메모리 범위 내에서 미니배치를 줄이는것이 필요하다. (c)를 그릴때 미니배치를 62에서 16으로 감소시켰다. 생성된 이미지들은
두 지표에서 명확히 보이듯 비정상적이다. (d) 에서는 학습과정에서 하이퍼파라미터를 잘 조정하고 배치노말과 레이어노말을 제거하여 안정되었다. (e) 에서는 minibatch discrimination 을 사용했고 어떤 측도도 눈에띄게 개선시키지 못했다. 반면 우리의 미니배치 표준편차법을 적용한 (e)는 평균 SWD 및 이미지를 향상시켰다. 우리의 기법들을 추가한 (f)와 (g), SWD와 이미지 퀄리티에서 전체적인 개선을 이끌었고 마지막으로 (h)에서 자르지 않은 네트워크와 긴 학습을 거쳤다. 우리는 생성된 이미지중 지금까지 결과중 가장 좋다고 생각한다.
Figure 4 illustrates the effect of progressive growing in terms of the SWD metric and raw image throughput. The first two plots correspond to the training configuration of Gulrajani et al.,(2017) without and with progressive growing. We observe that the progressive variant offers two main benefits : it converges to a considerable better optimum and also reduces the total training time by about a factor of two. The improved convergence is explained by an implicit form of curriculum learning that is imposed by the gradually increasing network capacity. Without progressive growing, all layers of the generator and discriminator are tasked with simultaneously finding succinct intermediate representations for both the large-scale variation and the small-scale detail. With progressive growing, however, the existing low-resolution layers are likely to have already converged early on, so the networks are only tasked with refining the representations by increasingly smaller-scale effects as new layers are introduced. Indeed, we see in Figure 4(b) that the lagest-scale statistical similarity curve (16) reaches its optimal value very quickly and remains consistent throughout the rest of the training. The smaller-scale curves(32,64,128) level off one by one as the resolution is inreased, but the convergence of each curve is equally consistent. With non-progressive training in Figure4(a), each scale of the SWD metric converges roughly in unison, as could be expected.
Figure 4는 SWD와 이미지 처리량에 대한 PG의 영향을 그리고 있다. 앞의 두 그림은 PG가 없는구라자니(WGAN-GP)의 학습환경과 일치한다. 단계별 변화가 제공하는 두가지 이점을 관측했다 : 그것은 매우 좋은 최적점으로 수렴하고 전체 학습시간을 2배가량 감소시켰다. 개선된 수렴은 점차적으로 네트워크 용량이 증가함으로 부가된 학습 과정의 암시적인 형태로 설명된다. PG없이 생성기와 판별기의 모든 레이어는
작은 스케일에서의 디테일과 큰 스케일에서의 변화 양쪽의 가운데 표현을 간결한 중간표현을 동시에 찾는 작업을 수행한다. 그러나 PG와 함께하면 저해상도 레이어들은 미리 수렴되어버리고, 네트워크는 오직 새로운 레이어가 도입됨에 따라 점점 더 작은 효과에 의해 표현을 정제하는 작업을 하게된다. 실제로 Figure 4 에서 가장큰규모의 유사도 곡선은 최적의값에 매우 빨리 도달하고 남은 훈련 내내 유지되는것을 볼수있다. 작은 스케일에서 커브들은
Teh speedup from progressive growing increases as the output resolution grows. Figure4(c) shows training progress, measured in number of real images shown th the discriminator, as a function of training time when the training progresses all the way to $1024^2$ resolution. We see that progressive growing gains a significant head start because the networks are shallow and quick to evaluate at the begining. Once the full resolution is reached, the image throughput is equal between the tWo methods. The plot shows that the progressive variant reaches approximately 6.4 million images in 96 hours, whereas it can be extrapolated that the non-progressive variant would take about 520 hours to reach the same point. In this case, the progressive growing offers roughly a $5.4\times$ speedup
To meaningfully demonstrate our results at high output resolutions, we need a sufficiently varied high-quality dataset. However, virtually all publicly available datasets previously used in GAN literature are limited to relatively low resolutions ranging from $32^2$ to $480^2$. To this end, we created a high-quality version of the CELEBA dataset consisting of 30000 of the images at $1024 \times 1024$resolution. We refer to Appendix C for further details about the generation of this dataset.
Our contributions allow us to deal with high output resolutions in a robust and efficient fashion. Figure 5 shohws selected $1024\times 1024$ images produced by our network. While megapixel GAN results have been shown before in another dataset(Marchesi, 2017), our results are vastly more varied and of higher perceptual quality. Please refer to Appendix F for a larger set of result images as well as the nearest neighbors found from the training data. The accompanying video shows latent space interpolations and visualizes the progressive training. The interpolation works so that we first randimize a latent code for each frame (512 components sampled individually from $\mathcal{N}(0,1))$, then blur the latents across time with a Gaussian ($\sigma = 45$ frames @ 60 Hz), and finally normalize each vector to lie on a hypersphere.
We trained the network on 8 Tesla V100 GPUs for 4 days, after which we no longer observed qualitative differences between the result of consecutive training iterations. Our implementation used an adaptive minibatch size depending on the current output resolution so that the available memory budget was optimally utilized.
In order to demonstrate that our contributions are largely orthogonal to the choice of a loss function, we also trained the same network using LSGAN loss insted of WGAN-GP loss. Figure 1 shows six examples of $1024^2$ images produced using our method using LSGAN. Futher details of this setup are given in Appendix B.
Fugure 6 shows a purely visual comparison between our solution and earlier results in LSUN BEDROOM. Fiture 7 gives selected examples from seven very different LSUN categories at $256^2$. A larger, non-curated set of results from all 30 LSUN categories is available in Appendix G, and the video demonstrates interpolations. We are not aware of earlier results in most of these categories, and while some categories work better than others, we feel that the overall quality is high.
The best inception scores for CIFAR10 (10 categories of $32\times 32$ RGB images) we are aware of are 7.90 for unsupervised and 8.87 for label conditioned setups (Grinblat et al., 2017). The large difference between the two numbers is primarily caused by “ghosts” that necessarily appear between classes in the unsupervised setting while label conditioning can remove many such transitions.
When all of our contributions are enabled, we get 8.80 in the unsupervised setting. Appendix D shows a representative set of generated images along with a more comprehensive list of results from earlier methods. The network and training setup were the same as for CELEBA, progression limited to $32 \times 32$ of course. The only customization was to the WGAN-GP’s regularization term