Selfie : Self-supervised Pretraining for Image Embedding

번역하자면 이미지 임베딩을 위한 자기지도 전처리? 정도겠네요 얼마전부터 구상했던 모델이 있는데 왠지 비슷한 느낌이… 한번 봐야겠네요 비슷하긴한데 조금 틀리긴 한거같애 이거보니 빨리 연구를 해야겠 ㅠㅠ

영어가 딸려 느리지만 번역을 해봅시다 ㅠㅠ

Abstract

We introduce a pretraning technique called Selfie

셀피라 불리는 전처리 테크닉을 소개하겠다

which stands for SELF-supervised Image Embedding

자기주도학습 이미지 임베딩을 위한

Selfie generalizes the concept of masked language modeling to continuous data, such as image

셀피는 마스크된 언어 모델링을 연속된 데이터로 만드는 개념을 일반화한다, 이미지와 에서 처럼

Given masked-out patches in an input image,

입력된 이미지에 masked-out(가려지는) 패치가 주어지고

our method learns to select the correct patch,

우리의 방식은 올바른 패치가 되도록 학습시킨다

among other “distractor” patches sampled from the same image,

다른 distractor 패치들과 같은 이미지에서 샘플링을 한다

to fill in the masked location.

가려진 부분이 채워진

This classification objective sidesteps the need for prediction exact pixel values of the target patches

이 분류 목적은 패치들의 정확한 픽셀값을 예측할 필요를 없애준다

The pretraining architecture of Selfie

셀피의 전처리 아키텍쳐는

includes a network of convolutional blocks to process patches

패치를 처리하기 위한 컨볼루션 블록 을 포함한다

followed by an attention pooling network to summarize the content of unmasked patches

마스크되지 않으 컨텐츠를 요약한 어텐션 풀링 네트워크에 이어

before predicting masked ones

마스크된 것을 예측하기전에

During finetuning. we reuse the convolutional weights found by pretraning.

좋은 튜닝을 위해잘 전처리된 가중치를 재사용한다

We evaluate Selfie on three benchmarks (CIFAR-10, ImageNet32, ImageNet224)

세가지 벤치마크에서 셀피를 평가한다

with varying amounts of labeled data, from 5% to 100% of the traning sets

라벨링된 데이터를 5%에서 100%로 학습 데이터를 바꿔가며

Our pretraining method provides consistent improvements to ResNet-50

우리의 전처리 작업은 일관된 향상을 제공한다

across all settings compared to the standard supervised traning of the same network.

같은 모델을 쓴 일반적인 학습모델의 모든 세팅과 비교해서

Notably, on ImageNet 224 with 60 examples per class(5%),

특히 224x224 에 클래스당 60개의 이미지를 사용했을때

our method improves the mean accuracy of ResNet-50

우리의 방법은 ResNet-50의 평균 정확도를 향상시켰다

from 35.6% to 46.7%, an improvement of 11.1 points in absolute accuracy

35.6에서 46.7로 11.1 의 절대적 정확도가 향상됐다

Our pretraining method also improves ResNet-50 traning stability,

이 전처리 기법은 ResNey의 학습 안정도도 향상시켰는데

especially on low data regime,

특히 적은 데이터 에서

by significantly lowering the standard deviation of test accuracies across datasets

현저하게 낮아지는 테스트 셋의 정확도의 표준편차를

Introduction

A weakness of neural networks is that they often require a large amount of labeled data to perform well

NN의 약점은 잘 작동하기 위해선 많은 라벨된 데이터가 필요하다는 것이다

Although self-supervised/unsupervised representation learning was attempted to address this weakness

이 약점을 해결하는것을 시도하는 자기지도 혹은 비지도라는 대표적인 학습 방법이 있긴 하지만

most practical neural network systems today are trained with supervised learning

현재 대부분 실용적인 NN 시스템은 지도학습으로 학습된 것이다

Making use of unlabeled data through unsupervised representation learning

이 라벨없는 데이터를 사용하여 비지도학습으로

to improve data-efficiency of neural networks remains an open challenge for the field

NN에서 데이터의 능률을 향상시키는것은 현재에도 열린과제로 남아있다

Recentely, language model pretraning has been suggested as a method for unsupervised representation learning in NLP

요새 언어모델 전처리는 NLP의 비지도학습을 사용한다

Most notably Devlin et al. made an observation

가장 주목할 부분은 데블린에 의하면 관측을 만들때

that bidirectional representations from input sentences are better than left-to-right or right-to-left representations alone.

양방향표현의 입력이 단방향보다 성능이 좋다는 것이다

Based on this observation, they proposed the concept of masked language modeling

이 관점에 기반하여 그들은 마스킹된 언어 모델의 개념을 제안했다

by masking out words in a context to learn representations for text, also known as BERT.

텍스트 표현을 학습하기위한 문맥에서 단어를 마스킹하는. BERT라고 불리는

This is crucially achieved by replacing the LSTM architecture with the Transformer feedforward architecture

이것은 LSTM 을 Transformer feedforward식으로 바꿔서 적용한다

The feedforward nature of the architecture makes BERT more ready to be applied to images

아키텍쳐의 feedforward 특성은 BERT를 이미지에 적용하기 더 쉽게 해준다

Yet BERT still cannot be used for images because images are continuous objects unlike disrete words in sentences.

아직 BERT는 이미지엔 적용하지 못하고있는데 이미지는 문장의 분리된 단어처럼 연속된 오브젝트가 아니기 때문이다

We hypothesize that bridging this last gap is key

마지막 갭을 채워줄 방법이 중요하다는 가설을 세웠다

to translating the success of language model pretraining to the image domain

언어 전처리를 이미지 도메인으로 성공적으로 변경하는것을

In this paper, we propose a pretaining method called Selfie, which stands for SELF-supervised Image Emedding

이 논문에선 우리는 Selfie라 불리는 전처리 모델을 제안한다 이미지 임베딩 자기지도학습을 하기 위한

Selfie generalizes BERT to continuous spaces, such as images

셀피는 BERT를 연속적인 공간으로 일반화한다 , 이미지 에서 처럼

In Selfie, we propose to continue to use classification loss

셀피에서 우리는 분류loss를 지속하는것을 제안한다

because it is less sensitive to small changes in the image(such as translation of an edge)

그 이유는 그것은 이미지에 작은변화에 덜 민감하기 때문이다 (선이 바뀌는것처럼)

compared to regression loss which is more sensitive to small perturbations

작은 변동에 더 민감한 회귀 손실과 비교해서

Similar to BERT, we mask out a few patches in an image and try to reconstruct the original image

BERT 처럼 우린 몇몇의 패치로 원본 이미지를 가려서 새로운 이미지를 구성한다

To enable the classification loss,

분류손실을 사용하러면

we sample “distractor” patches from the same image,

같은 이미지에서 패치한 distractor 샘플을 만들고

and ask the model to classify the right patch to fill in a target masked location

모델에게 질문한다 해당 마스킹된 부분를 채우는 올바른 패치를 분류하는지

Experiments show that Selfie works well across many datasets, especially when the datasets have a small number of labeled examples.

실험은 셀피가 많은 데이터에서 잘 작동하는것을 보여준다특히 라벨링된 데이터가 적은 데이터 셋에서

On CIFAR-10, ImageNet32, and ImageNet224, we observe consistent accuracy gains as we vary the amount of labeled data from 5% to 100% of the typical traning sets.

라벨링된 전형적인 트레이닝 셋에서 5%부터 100%까지의 다른 데이터 셋에서 정확도를 관찰했다

The gain tends to be biggist when the labeled set is small.

이득은 라벨링 데이터가 적을때 극대화 되는 경향이 있었다

For example, on ImageNet 224 with only 60 labeled examples per class,

예를들면 이미지넷224 는 오직 클래스당 60개의 샘플만 있다

pretraining using our method improves the mean accuracy of ResNet-50 by 11.1%, going from 35.6% to 46.7%.

우리의 전처리는 ResNet-50에서 평균 정확도를 11.1% 향상 시켰다

additional analysis on ImageNet224 provides evidence that the benefit of self-supervised pretraining

추가적 ImageNet224를 분석할때 제공된 자기지도 전처리의 이점은

significantly takes off when there is at least an order of magnitude (10X) more unlabeled data than labeled data.

현저하게 상승하기 시작했다 레이블이 지정된 대상보다 지정되지 않은대상의 크기가 10배이상에 도달할때

In addition to improving the averaged accuracy, pretraining ResNet-50 on unlabeled data also stabilizes its training on the supervised task.

평균 정확도의 향상 외에 추가로, 라벨링안된 전처리는 지도학습을 안정시키기도 했다

We observe this by computing the standard deviaion of the final test accuracy across 5 different runs for all experiments

모든 실험에 대해 5가지 다른 실행에 거쳐 최종 테스트 정확도의 표준편차를 계산해서 이것을 관찰하였다

On CIFAR-10 with 400 examples per class, 클래스당 400개의 샘플을 가진 CIFAR-10 에서는

the standard deviation of the final accuracy reduces 3 times comparing to training with the original initialization method. 기본 메소드를 활용한것에 비교하여 최종 정확도의 표준편차를 3배 정도 감소시켰다

Similarly, on ImageNet rescaled to 32 our pretraining process gives an 8X reduction on the test accuracy variability when training on 5% of the full training set 비슷하게 32로 스케일된 이미지넷에서도 전체의 5%의 데이터를 사용했을때 우리의 전처리는 8배의 테스트정확도의 변동을 감소시켰다

Method

An overview of Selfie is shown in Figure 1.

전체적인 셀피의 오버뷰는 피규어 1에서 보여준다

Similar to previous works in unsupervised/self-supervised representation learning, 앞의 비지도/자기지도 학습 작업들과 비슷하게

our method also has two stages 우리의 메소드는 두가지 부분으로 나눠진다

(1)Pretrain the model on unlabeled data and then (2) Finetune on the target supervised task.

첫번째는 라벨링 안된데이터로 전처리를 하고 두번째는 해당 지도학습을 잘 튜닝을 한다

To make it easy to understand,

이해를 쉽게하게 하기 위해

let us first focus on the fine-tuning stage

튜닝 스테이지부터 먼저 초점을 맞춘다

In this paper, our goal is to improve ResNet-50,

이 논문에서 우리의 목적은 ResNet-50을 향상 시키는거고

so we will pretrain the first three blocks of this architecture

그러기위해 아키텍쳐의 첫 세부분을 전처리한다

Let us call this network P.

이 네트워크를 P 라고 부른다

The pretraining stage is therefore created for training this P network in an unsupervised fashion

전처리과정은 이 비지도 형태의 P 네트워크를 훈련하기위해 만들어진다

Now let us focus on the pretrining stage

그럼 전처리 스테이지를 봅시다

In the pretraining stage, P, a patch processing network,

전처리 과정에서 P 라고 불리는 패치 네트워크는

will be applied to small patches in an image to produce one feature vector per patch for both the encoder and decoder

패치마다 작은 피쳐백터를 생기게 하기 위해 이미지 안에 작은 패치에 적용된다 이는 인코더와 디코더 양쪽을 위해서

In the encoder, the feature vectors ard pooled together by an attention pooling network A to produce a single vector u.

인코더에서 이 피쳐벡터들은 벡터 u를 만들기 위해 어텐션 풀링 네트워크 A 에 의해 모아진다

In the decoder, no pooling takes place;

디코더에서는 풀링이 일어나지 않는다

instead the feature vectors are sent directly to the computation loss to form an unsupervised classification task.

대신 피쳐벡터들은 비지도 분류작업 구조를 위한 손실을 계산하는데 직접 보내진다

The representations from the encoder and decoder networks are jointly trained

이 인코더와 디코더 네트워크의 표현은 동시에 학습된다

during pretraining to predict what patch is being masked out at a particular location among other distracting patches

다른 많은 패치들 중 가려진 부분의 패치를 예측하기 위해 전처리를 하는동안

In our implementation,

이 이행 과정에서

to make sure the distracting patches are hard,

산만한 패치는 확인이 어렵고

we sample them from the same input image and also mask them out in the input image

우리는 같은 입력 이미지로부터 샘플링하고 또한 마스킹한다

Next we will describe in detail the interaction between the encoder and decoder networks during pretraining as well as different design choices

이제 전처리중 인코더와 디코더의 사이의 상호작용뿐만 아니라 다른 디자인을 선택하는것을 기술할 것이다

###2.1 Pretraining Details

The main idea is to use a part of the input image to predict the rest of the image during this phase

메인 아이디어는 이미지의 부분을 나머지 부분을 예측을 위해 사용하는것이다 이 페이즈에서

To do so, we first sample different square patches from the input.

그러기 위해 우린 첫번째로 입력을 다른 정사각형의 패치로 샘플링한다

These patches are then routed into the encoder and decoder networks depending on whether they are randomized to be masked out or not

이 패치들은 랜덤하게 마스킹 되었는지 여부에 따라 인코더와 디코더에 들어간다

Let us take Figure 1 as an example, where Patch1,2,5,6,7,9 are sent into the encoder, whereas Patch3,4,8 are sent into the decoder

피규어 1을보면 125679는 인코더에 들어간데 반해 348은 디코더에 들어간다

All the patches are processed by the same patch processing network P

모든 패치는 같은 P 패치 과정에 의해 처리된다

On the encoder side, the output vectors produced by P are routed into the attention pooling network to summarize there representations into a single vector u

인코더쪽에서는 P에 의해 생산된 출력 벡터는 어텐션 풀링 네트워크에 들어간다 그들의 표현의 요약인 싱글벡터 u를 만들기 위해서

On the decoder side, P create output vectors h1,2,3.

P는 벡터 h1,2,3 을 출력하고

The decoder then queries the encoder by adding to the output vector u the location embedding of a patch,

디코더는 인코더에게 질문한다 합쳐진 u패치의 위치가 어디인지

아옥 이거 수정좀 도저히 직역이안된다 결론은 u 에 디코더에서 랜덤하게 고른 패치의 위치 임베딩을 더한 v를 만든다는것

selected at random among the patches in the decoder to create a vector v

v를 만들기 위해 디코더에 있는 패치중에서 랜덤하게 선택한다

The vector v is then used in a dot product to compute the similarity between v and each h.

v는 각각의 h와의 유사도를 계산하기 위해 dot 연산을 사용한다

Having seen the dot products between v and h’s, the decoder has to decide which patch is most relevant to fill in the chosen location

v와 h의 곱을 보고 디코더는 선택된 위치를 채우는데 가장 적합한 패치를 결정한다

The cross entropy loss is applied for this classification task,

크로스 엔트로피 손실이 이 분류작업에 사용되고

whereas the encoder and decoder are trained jointly with gradients back-propagated from this loss

인코더와 디코더는 이 손실에 역전파에 의해 동시에 학습된다

During this pretaining precess, the encoder network learns to compress the information in the input image to a vector u such that when seeded by a location of a missing patch, it can recover that patch accurately.

이 전처리 과정에서 인코더는 입력된 이미지의 정보를 u 로 압축하는것을 배우고 이때 빠진 패치에 위치가 부여되고 해당 패치를 정확하게 복구한다

To perform this task successfully, the network needs to understand the global content of the full image,

이 작업이 성공적으로 실행되려면 네트워크는 이미지의 전체를 이해할 필요가 있다

as well as the local content of each individual patch and their relative relationship

각각의 패치들과 그들의 상대적인 관계 처럼

This ability proves to be useful in the downstream task of reconizing and classifying

이 능력은 인식과 분류의 다운스트림 작업에서 유용한것을 증명한다

###Patch sampling method

On small images of size 32x32, we use a patch size of 8, while on largerimiges of size 224, we use a patch size of 32.

32사이즈엔 8사이즈의 패치를적용, 224사이즈에는 32사이즈의 패치를 적용했다

The patch size is intentionally selected to divide the image evenly, so that the image can be cut into a grid as illustrated in Figure 1,

패치사이즈는 이미지를 고르게 나눌수 있도록 의도적으로 선택되었고 그래서 피규어 1에 그려진 그리드처럼 잘릴수 있다

To add more randomness to the position of the image patches,

이미지의 위치에 더많은 무작위성을 더하기 위해

we perform zeropadding of 4pixels on images with size 32 and then random crop the image to its original size

32사이즈 이미지에 4픽셀의 제로페딩을 실행했고 원래 이미지사이즈로 무작위로 잘라냈다

Patch processing network

In this work, we focus on improving ResNet-50 on various benchmarks by pretraining it on unlabeled data.

이 작업에서 다양한 벤치마크에서 ResNet-50을 향상시키는데 주목했다 라벨링안된 데이터를 전처리하는 방식으로

For this reason, we use ResNet-50 as the patch processing network P.

그 이유에서 ResNet-50 을 패치처리 네트워크 P 로 사용한다

As described before, only the first three bolocks of ResNet-50 is used.

앞에 기술한 대로 ResNet-50 의 앞의 3 블록이 사용된다

Since the goal of P is reduce any image patch into a single feature vector,

P의 목적이 어떤 이미지를 싱글 벡터로 압축하는것이기에

we therefore perform average pooling across the spatial dimensions of the output of ResNet-36

따라서 ResNet-36 의 출력의 공간차원을 가로질러 평균풀링을 적용했다

Efficient implementation of mask prediction

For a more efficient use of computation, the decoder is implemented to predict multiple correct patches for multiple locations at the same time.

계산을 효율적으로 하기 위해 디코더는 여러위치를 동시에 하기위해 다수의 올바른 예측을 할수 있게 만든다

For example, in the example above, besides finding the right patch for location4, the decoder also tries to find the right patch for location3 as well as location8

예를들면 위의 예에서 4에 적합한 패치를 찾는것 외에 디코더는 또한 3과 8에서도 올바른 패치를 찾도록 시도한다

This way, we reuse three times as much computation from the encoder-decoder architecture.

이 방법으로 인코더디코더 아키텍쳐보다 3배의 많은 계산을 재사용한다

Our method is, therefore, analogous to solving a jigsaw puzzle where a few patches are knocked out from the image and are required to be put back to their original locations.

게다가 직소퍼즐을 푸는것과 유사하다 작은 조각을 이미지로부터 떼어내고 원래의 자리로 돌려놔야 한다

This procedure is demonstrated in Figure 2. 이 프로시져는 피규어 2에 시연되어있다

2.2 Attention Pooling

In this section, we describe in detail the attention pooling network A introduced in Section 2.1 and the way positional embeddings are built for images in our work

여기서 2.1에서 소개했던 어텐션풀링 네트워크 A에 대해 설명하고 이미지 위치 임베딩을 구축 하는방법을 설명한다

###Transformer as polling operation

We make use of Transformer layers to perform pooling

Given a set of input vectors ${ h_1, h_2,\cdots, h_n }$ produced by applying the patch processing network P on different patches, we want to pool them into a single vector u to represent the entire image.

다른 패치들의 P 처리를 적용해 만들어진 인풋벡터의 집합이 주어지고 이것을 해당 이미지를 대표할 싱글벡터 u 로 만들려 한다

There are multiple choices at this stage including max pooling or average pooling

여기엔 최대, 평균 풀링을 포함한 많은 선택권이 있다

Here, we consider these choices special cases of the attention operation (where the softmax has a temperature approaching zero or infinity respectively) and let the network learn to pool by itself.

이러한 것들은 어텐션 오퍼레이션의 특수한 케이스라 생각하고 ( 소프트맥스가 0 또는 inf 에접근하는 점수를 가짐 ) 네트워크가 그것을 스스로 배우게 한다

To do this, we learn a vector $u_0$ with the same dimension with $h$’s and feed them together through the Transformer layers

이것을 위해 벡터 h와 같은 차원을 가진 u 를 배우고 그것흘 함께 트랜스포머 레이어에 넣는다

\[u, h^{output}_1, h^{output}_2, \cdots, h^{output}_n, = \mathrm{TransformerLayers}(u_0, h_1, h_2, cdots, h_n)\]

The output $u$ corresponding to input $u_0$ is the pooling result.

출력 u는 u0 의 풀링 결과에 해당한다

Attention block

Each self-attention block follows the design in BERT where self-attention layer

BERT의 self-attention later의 디자인을 따른 self-attention 블록 각각은

is followed with two fully connected layers that sequentially project the input vector to an intermediate size and back to the original hidden size 입력벡터를 중간크기로 했다가 원래의 히든사이즈로 돌리는 두개의 FC layer 가 뒤따른다

The only non-linearity used is GeLU and is applied at the intermediate layer. 비선형성은 오직 GeLU 하나만 중간레이어에 적용된다

We perform dropout with rate 0.1 on the output, followed by a residual connection connecting from the block’s input and finally layer normalization

0.1의 드랍아웃을 출력에 적용하고 뒤이어 블록의 인풋으로부터의 잔차연결이 연결되고 마지막으로 레이어 노말라이제이션이 적용된다

Positional embeddings

For images of size 32, we learn a positional embedding vector for each of the 16 patches of size 8.

32 이미지에서는 각각 8사이즈의 16패치들의 위치 임베딩 벡터를 학습시켰다

Images of size 224, on the other hand, are divided into a grid of 7 patches of size 32.

224에서는 다르게 32크기의 7패치 격자로 나뉘었다

Since there are significantly more positions in this case, we decompose each positional embedding into two different components: row and column embeddings.

이 케이스에선 현저하게 많은 포지션이 있기 때문에 포지션 임베딩을 행, 열 두가지 컴포넌트로 분해한다.

The resulting embedding is the sum of these two components.

결과 임베딩은 두 컴포넌트의 합계이다

For example, instead of learning 49 positional embeddings, we only need to learn 7+7=14 positional embeddings.

예를들면 49개의 포지션 임베딩을 학습하는 대신 7+7인 14개의 포지션만 학습하면 된다

This greatly reduces the number of parameters and helps with regularzing the model.

이것은 파라미터의 수를 크게 줄여주고 모델을 일반화시키는데 도움을 준다

Finetunning Details

As mentioned above, in this phase, the first three convolution blocks of ResNet-50 is initialized from the pretrained patch processing network.

기술한바와 같이 이 과정에서는 ResNet-50의 3개의 컨볼루션 블럭을 전처리된 네트워크로 초기화 한다

The last concolution block of ResNet-50 is initialized by the standard initialization method

ResNet-50의 마지막 컨볼류션 블럭은 기본 초기화 기법을 쓴다

ResNet-50 is then appied on the full image and finetuned end-to-end

ResNet-50은 전체 이미지에 적용하고 End-to-end 에도 잘 튜닝된다

3 Experiments and Results

In the following sections, we investigate the performance of our proposed pretraning method, Selfie on standard image datasets, such as CIFAR-10 and ImageNet.

이 섹션에서 우리가 제안한 전처리 모델 셀피가 일반적인 이미지셋인 CIFAR-10이나 ImageNet 에서의 퍼포먼스를 조사했다

To simulate the scenario when we have much more unlabeled data than labeled data,

시나리오를 말하자면 라벨링된 데이터에 비해 많은 라벨없는 데이터가 있고

we sample small fractions of these datasets and use them as labeled datasets,

라벨링 데이터 셋에서 작은 일부분을 샘플링했고

while the whole dataset is used as unlabeled data for the pretraining task.

전체 데이터셋은 전처리를위한 라벨없는 데이터로 사용했다

3.1 Datasets

We consider three different datasets : CIFAR-10, ImageNet resized to 32, ImageNet original 224.

세가지 데이터셋을 고려했다 시파10과 이미지넷을 32로 리사이즈, 그리고 이미지넷 오리지널인 224

For each of these datasets, we simulate a scenario where an additional amount of unlabeled data is available besides the labeled data used for the original supervised task.

각각의 데이터셋에 지도학습에 사용되지 않은 데이터를 라벨없이 사용한다

For that purpose, we create four different subsets of the supervised traning data with approximately 5%, 10%, 20%, and 100% of the total number of training examples.

우리는 네가지 다른 지도학습용 서브셋을 만들었고 훈련용 샘플의 약 5,10,20,100 퍼센트를 사용했다.

On CIFAR-10, we replace the 10% subset with one of 4000 training examples(8%), as this setting is used in.

씨파10에서는 10% 대신 8%인 4000개의 훈련 샘플을 사용했다

In all cases, the whole training set is used for pretraining (50K images for CIFAR-10, and 1.2M images for ImageNet)

모든 케이스에서 전체 훈련셋은 전처리를 사용하였다

3.2 Experimental setup

###Model architecture

We reuse all settings for ResNet convolution blocks from ResNet-50v2 including hidden sizes and initialization

ResNet 컨볼루션 블럭의 히든 사이즈 및 초기화를 포함해 모든 세팅을 재사용한다

Batch normalization is performed at the beginning of each residual block.

배치노말은 각각의 residual 블럭의 시작에 실행한다

For self-attention layers, we apply dropout on the attention weights and befor each residual connection with a drop rate of 10%

self-attention 레이어에는 가중치와 residual 연결 전에 드랍아웃을 10% 적용한다

The sizes of all of our models are chosen such that each architecture has roughly 25M parameters and 50 layers,

모든 모델의 사이즈는 각각의 아키텍쳐가 25백만개의 파라미터와 50레이어를 갖도록 선택되고

the same size and depth of a standard ResNet-50.

이는 일반적인 ResNet-50의 사이즈 및 깊이와 같

For attention pooling, three attention blocks are added with a hidden size of 1024,

어탠센 풀링에서는 3개의 어텐션 블록의 히든유닛수는 1024개

intermediate size 640 and 32 attention heads on top of the patch processing network P

중간 사이즈는 640 그리고 32 개의 attention heads 가 P의 꼭대기에 올라간다

###Model training

Both pretraining and finetuning tasks are trained using Momentum Optimizer with Nesterov coefficient of 0.9.

전처리와 파인튜닝 양쪽작업은 네스트로프 비율이 0.9인 모멘텀을 사용했다

We use a batch size of 512 for CIFAR-10 and 1024 for ImageNeet

배치사이즈는 씨파10은 512, 이미지넷은 1024개를 사용

Learning rate is sceduled to decay in a cosine shape with a warm up phase of 100 steps and the maximum learning rate is tuned in the range of [0.01,0.02,0.05,0.1,0.2,0.4]

학습율은 100페이즈마다 코사인 감쇠를 적용하고 최대 학습율은 0.01부터 0.4까지 했다

We do not use any extra regularization besides an L2 weight decay of magnitude 0.0001

0.0010크기의 L2 가중치 감쇠를 제외한 다른 일반화 기법은 사용하지 않았다

The full training is done in 120,000 steps.

전체학습은 12만 스텝을 사용했다

Furthermore, as described in Section 2.1, we divide the image into non-overlapping square patches of size 8 or 32 during pretraining and sample a fraction $p$ of these patches to predict the remaining

게다가 섹션 2.1에서 설명했듯이 전처리 하는동안 사이즈가 8 혹은 32인 겹치지않는 사각 패치로 나누고 p만큼 예측을 위해 남길부분을 샘플링한다

We try for two values of $p$: 75% or 50% and tune it as a hyper-parameter

이 p값을 75%와 50% 두가지를 시도했고 하이퍼파라미터처럼 튜닝했다

Reporting results.

For each reported experiment, we first tune its hyper-parameters by using 10% of traning data as validation set and train the neural net on the remaining 90%

각 실험 에서 첫번째 튜닝 은 10%의 검증셋을 남기고 남은 90%로 학습을 했다

Once we obtain the best hyper-parameter setting,

가장 좋은 하이퍼파라미터 세팅을 얻고

the neural network is retrained on 100% training data 5 times with different random seeds.

신경망을 5번 다른 랜덤시드를 사용해서 재학습 시켰다

We report the mean and starndard deviation values of these five runs.

그리고 이 다섯번의 표준편차의 평균을 보고한다

3.3 Result

We report the accuracies with and without pretraining across different labeled dataset size in Table 1.

전처리한것과 전처리하지 않은 것의 비교표가 Table 1에 있다

As can be seen from the table, Selife yeilds consistent improvements in test accuracy across all three benchmarks with varying amounts of labeled data

이 표에서 볼수 있는건 셀피는 테스트셋에서 일관된 산출량의 향상을 보여준다 다양한 라벨 데이터의 양을 가진 3가지 벤치마크 전부에 걸쳐

Notably, on ImageNet 224, a gain of 11.1% in absolute accuracy is achieved when we use only 5% of the labeled data.

특히 이미지넷224에서 11%의 정확도를 달성한다 5%의 라벨 데이터만 사용했을때

We find the pretrained models usually converge to a higher training loss, but generalizes significantly better than model with random initialization on test set.

우린 가끔 전처리된 모델이 학습손실에서 높게 수렴하는 것을 발견했다 그러나 일반화에선 현저하게 랜덤 초기화를 한 테스트셋보다 좋게 나타났다

This highlights the strong effect of regularization of our proposed pretraining procedure

주목할점은 우리가 제안한 전처리의 일반화에 대한 강한 영향이다

An example is shown in Figure3 when training on 10% subset of Imagenet224

Figure3에서 보여준 샘플은 10%의 데이터를 학습한 이미지넷224 이다

Beside the gain in mean accuracy, training stability is also enhanced as evidenced by the reduction in standard deviation in almost all experiments

평균 정확도의 향상과 함께 학습 안정도는 모든 실험에서 거의 표준편차가 감소한걸로 입증된다

When the unlabeled dataset is the same with the labeled dataset, the gain becomes small as expected

라벨없는 데이터셋이 라벨된 데이터셋과 같을때는 예상대로 이득이 작아진다

###Baseline Comparison

We want to emphasize that our ResNet baselines are very strong compared to those in (He et al. 2016a)

우린 ResNet 베이스라인이 위 논문에 나온 것들보다 매우 강하길 원한다

Particularly, on CIFAR-10, our ResNet with pure supervised leaning on 100% labeled data achieves 95.5% in accuracy,

특히 시파10에서 순수하게 지도학습만 100프로 사용한 경우 95.5%의 정확도가 나왓고

which is better than the accuracy 94.8% achieved by DenseNet and close to 95.6% obtained by Wide-ResNet

94.8%를 달성한 DenseNet 보다 좋고 95.6% 를 획득한 Wide-ResNet에 가깝다

Likewise, on ImageNet224, our baseline reaches 76.9% in accuracy, which is on par with the result reported in (He et al, 2016a), and supasses the 76.2% accuracy of DenseNet.

마찬가지로 이미지넷224에서 우리 베이스라인은 76.9%으 정확도를 가지고, 허씨 논문 에서 나온 결과와 동등하고 DenseNet보다 훌륭하다.

Our pretrained models further improve on our strong baselines

우리의 전처리 모델은 이 강력한 기본모델을 더 향상시킨다

Contrast to Other Works

Notice that our classification accuracy of 77.0% on ImageNet224 is also significantly better than previously reported results in unsupervised representation learning

주목할것은 우리의 분류 정확도가 이미지넷224에서 77% 나온것 또한 현저하게 좋다 이전 비지도학습의 결과에들에 비하여

For example, in a comprehensive study by (Kolesnikov et al., 2019), the best accuracy on ImageNet of all pretraining methods is around 55.2%, which is well below the accuracy of our models

예를들면 코레느니코프씨의 Comprehensive Study 가 있는데 모든 훈련방법에 있어 최고의 정확도가 는 55%이며 이것은 우리의 모델에 비해 매우 떨어진다.

Similarly, the best accuracy reported by Context Autoencoders and Contrastive Predictive Coding are 56.5% and 48.7% respectively

비슷하게 가장 정확도가 높은 Context Autoencoders 와 Contrastive Predictive Coding 에서도 56, 48%의 정확도가 나왔다

We suspect that such poor performance is perhaps due to the fact that past works did not finetune into the representations learned by unsupervised learning

그런 낮은 퍼포먼스의 원인은 아마도 과거의 작업들이 비지도학습으로 학습한 표현으로 튜닝되지 않아서 그럴것이다

Concurrent to our work, there are also other attempts at using unlabeled data in semi-supervised learning settings

우리의 작업을 포함해 많은 사람들은 또한 라벨없는 데이터를사용한 준지도학습에 다른 시도를하고있다.

Henaff et al. showed the effectiveness of pretraining in low-data regime using cross-entropy loss with negative samples similar to our loss

헤나프씨는 우리처럼 적은 데이터 체재에서 부정적 샘플을 이용한 크로스 엔트로피를 사용한 전처리의 효과를 보여줬다

However, their result are not comparable to ours because they employed a much lager network, ResNet-171, compared to the ResNet-50 architecture that we use through out this work

그러나 우리의 작업과 비교할수없는데 ResNet-171이라는 우리가 사용한 ResNet-50 보다 더 큰 네트워크를 사용했기 때문이다

Consistency training with label propagation has also achieved remarkable result.

Label Propagation 을 사용한 일관성잇는 학습은 놀라운 결과를 달성했다

For example, the recent Unsupervised Data Augmentation (Xie et al., 2019) reported 94.7% accuracy on the 8% subset of CIFAR-10.

예를들면 Unsupervised Data Augmentation는 씨파10의 8%만 사용하고도 94.7%의 정확도를 달성했다

We expect that ur self-supervised pretraining method can be combined with label propagation to provide additional gains, as shown in (Zhai et al., 2019)

우리는 자기지도 전처리기법이 라벨 전파와 결합하면 더 이득을 제공할것이라 예상하고 그것을 보여줬다 (Zhai et al., 2019)

###Finetuning on ResNet-36 + attention pooling

In the previous expriments, we finetune ResNet-50, which is essentially ResNet-36 and one concolution block on top, dropping the attention pooling network used in pretraining.

이전 실험에서 우린 ResNet-50으로 튜닝하고 본질적으로 ResNet-36 과 한개의 컨볼류션 블럭이 탑에 올라가고 전처리시 어텐션 풀링 네트워크에 사용되었다

We also explore finetunning on ResNet36 + attention pooling and find that it slightly outperforms finetunning on Resnet-50 in some cases. More in Section 4.2

우리는 또한 ResNet36 + 어텐션풀링에서 좋은 튜닝을 검색하고 어떤케이스에선 ResNet-50 튜닝보다 우위에 있을때도 있음을 발견했다. 4.2 섹션에서 보자

###Finetuning Sensitivity and Mismatch to Pretraining

Despite the encouraging results, we found that there are difficulties in tansferring pretrained models across tasks such as from ImageNet to CIFAR

좋은 결과에도 불구하고 우린 ImageNet 에서 CIFAR로 전처리 가중치를 이동하기 어려운 문제를 발견했다

For the 100% subset of Imagenet 224, additinal tuning of the pretraining phase using a development set is needed to achieve the result reported in Table 1.

이미지넷 224를 100퍼 사용할때 dev셋을 사용할때 전처리에 추가적인 튜닝 이 필요하다 테이블 1의 결과를 적용하기 위해선

There is also a slight mismatch between our pretraining and finetuning settings : 여기엔 또한 작은 불일치가 있다 우리의 전처리과정과 fineguning 세팅에는

during pretraining we process image patches independently whereas for finetuning the model sees an image as a whole

전처리동안에는 이미지패치를 독립적으로 처리하는 반면에 파이튜닝시엔 이미지 전체를 한번에 본다

we hope to address these concerns in subsequent works

우리는 이 문제를 뒤에 작업에서 잡아주길 바란다

#4 Analysis

##4.1 Pretraining benefits more when there is less labeled data

In this section, we conduct further expriments to better understand our method, Selfie, especially how it performs as we decrease the amount of labeled data

이 섹션에서 우리의 메소드를 잘이해하기위해 더 많은 실험을 수행했다 특히 라벨링된 데이터의 수를 줄이는 방법

To do so, we evaluate test accuracy when finetunning on 2%, 5%, 10%, 20%, and 100% subset of ImageNet224, as well as the accuracy with purely superbised training at each of the five marks

그러러면 우리는 이미지넷224의 2%부터 100퍼까지를 사용해서 테스트 정확도를 평가했다 뿐만 아니라 순수하게 지도학습을 사용한 각각의 정확도도

similar to previous sections, we average results across five different runs for a more stable assessment,

이전 섹션과 비슷하게 좀더 안정적인 평가를 위해서 5가지 다른 실험을 평균내었다

As shown if Figure4, the ResNet mean accuracy improves drastically when there is at least an order of magnitude more unlabeled image than the labeled set(i.e., finetuning on the 10% subset)

피규어4에서 보여주듯이 레즈넷의 평균 정확도는 대폭 향상된다 라벨링되지않은 이미지의 규모가 라벨링된 이미지의 규모보다 클때

With less unlabeled data, the gain quickly diminishes. At the 20% mark there is still a slight improvement of 1.4% mean accuracy, while at the 100% mark the positive gain becomes minimal, 0.1%

적은 라벨안된 데이터에서는, 효과는 매우 미미했다 20%를 사용했을땐 여전히 적은 향상인 1.4의 향상이 있었지만 100프로를 사용했을땐 0.1 로 줄어들었다

##4.2 Self-attention as the last layer helps finetuning performance

As mentioned in Setion 3.3, we explore training ResNet-36 + attention pooling on CIFAR-10 and ImageNet 224 on two settings : limited labeled data and full access to the labeled set.

섹션 3.3에 언급된것처럼 ResNet-36 과 어텐션 풀링을 씨파10과 이미지넷224 두 세팅에서 탐구했다. 제한된 라벨링 데이타와 전체 라벨링 셋에서

The architectures of the two networks ard shown in Figure 5. Experimental results on these two architectures with and without pretraining are reported in Table 2.

이 두가지 네트워크는 피규어5에서 보여진다 전처리 유무 두가지 아키텍쳐의 실험적인 결과 는 테이블 2에 나와있다

With pretraining on unlabeled data, ResNet-36+attention pooling outperforms ResNet-50 on both datasets with limited data.

전처리할땐 ResNet36+attention pooling 이 ResNet50보다 양쪽셋에서 전부 뛰어나다

On the full training set, this hybrid convolution-attention architecture gives 0.5% gain on ImageNet224

전체 학습셋에선 이 하이브리드 아키텍쳐는 이미지넷에서 0.5%의 이득을 주었다

These show great promise for this hybrid architecture which we plan to further explore in future work

이것은 큰 장래성을 보여준다 이 하이브리드 아키텍쳐를 앞으로 계속 연구할 계획이라는것을

#5. Related Work

###supervised representation learning for text

Much of the success in unsupervised representation learning is in NLP

NLP에서 비지도 학습은 많은 성공을 했다

First, using language models to learn embeddings for words is commonplace in many NLP applications

첫번쨰로 단어임베딩을 위한 언어모델의 사용은 많은 NLP 어플에 흔하게 들어있다

Building on this success, similar methods are then proposed for sentence and paragraph representations

이 성공을 바탕으로 에 비슷한 방법들이 문장과 단락표현을 위해 위해 제안되었다

Recent successful methods however focus on the use of language models or ‘masked’ language models as pretraining objectives

최근 성공한 방법으로는 전처리목적의 언어모델 혹은 가려진 언어모델 사용이다

A general principle to all of these successful methods is the idea of context prediction : given some adjacent data and their locations, predict the missing words

이 성공한 모든 법칙의 일반적인 원칙은 맥락예측의 아이디어이다 그 뒤치로부터 몇개의 근접한 데이터 데이터를 주고 빈칸을 맞추는

###Unsupervised representation learning for images

Recent successful methods in unsupervised representation learning for images can be divided into four categories

현재 이미지를 위한 성공한 비지도학습 메소드는 4가지 카테고리로 나뉜다

1) prediction rotation angle from an original image

원본 이미지를 얼마나 회전했는지 예측하기

2) prediction if a perturbed image belongs to the same category with an unperturbed image

동요하는 이미지와 안정적인 이미지를 같은 카테고리로 예측하기

3) predicting relative locations of patches, solving Jigsaw puzzles

패치들의 관련 위치 예측하기

4) and impainting

그리고 그리기

Their success, however, is limited to small datasets or small settings, some resort to expensive jointing training to surpass their purely supervised counterpart

이 성공은 하지만 작은 데이터셋과 적은 세팅으로 제한되었다 그들의 순수한 지도학습보다 비싼 몇몇 학습이 조인트 되어야 하는것에 의존한다

On the challenging benchmark ImageNet, our method is the first to report gain with and without additional unlabeled data as shown in Table1.

이이미지넷에서의 도전에서 우리의 메소드는 라벨이없는 추가 데이터가 있든없든 상관없이 이득을 보여주는 첫번째 레포트다 테이블 1에서 보여주듯

Selfie is also closely related to denosing autoencoders, where various kinds of noise are applied to the input and the model is required to reconstruct the clean input.

셀피는 또한 노이즈를 제거하는 인코딩과 밀접한 관련이 있다 많은 종류의 노이즈가 적용된 많은 종류의 노이즈가 적용된 입력과 모델은 깔끔한 인풋으로 재건축이 필요하다

The main difference between our method and denoising autoencoders is how the reconstruction step is done : our method focuses only on the missing patches, and tries to select the right patch among other distraction patches.

우리의 메소드와 디노이즈 오토인코더의 가장 큰 차이는 어떻게 재구성스텝을 하는가이다 우리의 메소드는 오직 빈칸에만 초점을 맞추고 다른 혼란된 패치중에 옳은 패치를 찾는것을 시도한다

Our method is also related to Contrastive Predictive Coding where negative sampling was also used to classify continuous objects.

우리의 메소드는 Constrastive Predictive Coding 과도 관련이 있다 연속적인 대상을 분류 하기 위해 네거티브 샘플링을 사용하는것이에서

Semi-supervised learning

Semi-superbised learning is another branch of representation learning methods that take advantage of the exitence of labeled data.

준지도학습은 라벨된 데이터에서 탈출구의 이점을 가지는 또다른 기법이다

Unlike pure unsupervised representation learning, semi-supervised learning does not need a separate fine-tuning stage to improve accuracy, which is more common in unsupervised representation learning.

순수한 비지도학습과는 다르게 준지도학습은 정확도 향상을위한 분리된 미세조정 단계를 필요로 하지 않는다, 흔한 비지도학습에서 처럼

Successful recent semi-supervised learning methods for deep learning are based on consistency training.

최근 성공한 준지도학습 일관적인 학습을 기본으로 한다

#6. Conclusion

We introduce Selfie, a self-supervised pretraining technique that generalizes the concept of masked language modeling to continuous data, such as images

우린 셀피를 소게한다. 마스크된 언어모델을 이미지같은 연속된 데이터로 일반화하는 자기지도 전처리 기술을

Given a masked-out position of a square patch in the input image, our method learns to select the target masked patches from negative samples obtained from the same image.

가려진 포지션을 주고 우리의 메소드는 학습한다 선택하는것을 해당 가려진 패치에 동일한 이미지에서 얻은 네거티브 샘플링된 로 부터

This classification objective therefore sidesteps the need for predicting the exact pixel values of the target patches

이 분류의 객관적으로 대상 패치의 정확한 픽셀값을 예측할 필요를 회피한다

Experiments show the Selfie achieves significant gains when labeled set is small compared to the unlabeld set.

실험은 셀피의 적용이 언라벨 데이터가 라벨데이터보다 클때 중대한 이득을 보여준다

Besides the gain in mean accuracy across different runs, the standard deviation of results is also reduced thanks to a better initialization from our pretraining method.

게다가 여러가지 다른 실험의 평균 정확도에서 다른 초기화보다 더 결과의 표준편차 또한 감소했다

Our analysis demonstrates the revived potential of unsupervised pretraining over supervised learning and that a hybrid convolutrion-attention architecture shows promise.

우리의 분석은 설명한다 지도학습 전 비지도 선행학습의 잠재력을 되살리는것과 컨볼루션 어텐션 기법이 보여준 장래성을.