๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ

๋”ฅ๋Ÿฌ๋‹

Vision Transformer๋กœ CIFAR 10 ํ•™์Šตํ•˜๊ธฐ

โœ๐Ÿป EXP  Vision Transformer๋กœ  CIFAR 10 ํ•™์Šตํ•˜๊ธฐ  [Korean] 

ViT ๊ฒฐ๋ก  (TL;DR)
๐Ÿ”– MNIST ๋Š” ํ•™์Šต์ด ์•„์ฃผ ์‰ฝ๋‹ค. 
๐Ÿ”– CIFAR 10 ์„ CrossEntropy๋กœ Scratch ํ•™์Šต์€ ์–ด๋ ต๋‹ค. 
๐Ÿ”– Pretrain ๋œ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๋ฉด 1 epoch ๋งŒ์— ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค. 

 

์ด ์‹คํ—˜์„ ์ง„ํ–‰ํ•˜๊ธฐ ์ „์— ๋ชจ๋ธ์— ๋Œ€ํ•ด์„œ ํ•œ ๊ฐ€์ง€ ๋ฏฟ์Œ์ด ์žˆ์—ˆ๋‹ค.

ํ•™์Šต๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ Loss๋ฅผ ์ค„์ด๋Š” ๊ฒƒ์€ Validation Loss๋ฅผ ์–ด๋Š์ •๋„ ์ค„์ธ๋‹ค.

" Decreasing training loss ensures large portion of validation" 

๊ทธ๋Ÿฌ๋‚˜ ๊ทธ๋ ‡์ง€ ์•Š์€ ๋ชจ๋ธ์ด ์žˆ์Œ์„ ์•Œ๊ฒŒ ๋˜์—ˆ๋‹ค. 

 

โœ๐ŸปPost Structure 
1. ViT ์„ค๋ช… 
2. MNIST ํ•™์Šต
3. CIFAR 10 ํ•™์Šต 
4. Pretrained -> CIFAR 10 ํ•™์Šต

๋จผ์ € ์ด์•ผ๊ธฐ๋Š” ViT๋ฅผ ํ•™์Šต์‹œํ‚ค๋Š”๋ฐ์„œ ์‹œ์ž‘ํ•œ๋‹ค. ViT๋Š” Transformer ์ธ์ฝ”๋” ๋ธ”๋ก์„ ์—ฌ๋Ÿฌ ๊ฐœ ์Œ“์•„์˜ฌ๋ฆฐ ๊ตฌ์กฐ๋กœ, ๊ฐ ๋ธ”๋ก์€ Multi-head Attention๊ณผ MLP ๊ทธ๋ฆฌ๊ณ  Layer Normalization์œผ๋กœ ์ด๋ฃจ์–ด์ ธ์žˆ๋‹ค. ๋ชจ๋ธ์€ input sequence์— ๋Œ€ํ•ด์„œ ์–ดํ…์…˜ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค. [๊ทธ๋ฆผ์ฐธ์กฐ] 

์ด๋Ÿฌํ•œ ๊ตฌ์กฐ๋กœ๋ถ€ํ„ฐ ์˜ค๋Š” ViT์˜ ๋‘ ๊ฐ€์ง€ ํŠน์ง•์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. 

  1. ๋” ์ด์ƒ Translation Equivariance ํ•˜์ง€ ์•Š๋Š”๋‹คโŒ. ์„œ๋กœ ๋‹ค๋ฅธ ํŒจ์น˜๋“ค์€ Positional Encoding์œผ๋กœ ์œ„์น˜ ์ •๋ณด๊ฐ€ ์ถ”๊ฐ€๋˜์–ด ์žˆ๋‹ค. (CNN์—์„œ๋Š” Weight๊ฐ€ ๊ณฑํ•ด์งˆ ๋•Œ ์œ„์น˜์— ๋Œ€ํ•œ ์ •๋ณด๊ฐ€ ์—†๋‹ค.) 
  2. CNN์€ Localํ•œ Neighbor ์ •๋ณด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šตํ•˜๋Š” Local Receptive Field ์ธ ๋ฐ˜๋ฉด, ViT๋Š” Attention ์„ ์‚ฌ์šฉํ•˜๋Š” Global Receptive Field ๋ฅผ ๊ฐ€์ง„๋‹ค. 

๋…ผ๋ฌธ ์ฐธ์กฐ : Transformers in Vision: A Survey (https://arxiv.org/abs/2101.01169)

๊ทธ๋ฆผ ์ถœ์ฒ˜ : ๋‚ด๋ธ”๋กœ๊ทธ https://fxnnxc.github.io/blog/2022/exp_20/

 

CNN์€ inductive bias๊ฐ€ ์‹ฌํ•ด์„œ ํ•™์Šต์ด ์‰ฝ์ง€๋งŒ ์ž์œ ๋„๊ฐ€ ๋‚ฎ๋‹ค. ๋ฐ˜๋ฉด์— ViT๋Š” Attention์„ ๋ชจ๋“  ํŒจ์น˜์— ๋Œ€ํ•ด์„œ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ•™์Šต์˜ ์ž์œ ๋„๊ฐ€ ๋†’๋‹ค. ์ž์œ ๋„๊ฐ€ ๋†’์€ ๋งŒํผ, Classification์—์„œ๋Š” CNN์— ๋น„ํ•ด์„œ ๋” ๋งŽ์€ ์ƒ˜ํ”Œ์ด ํ•„์š”ํ•˜๋‹ค. ์ด๋Ÿฌํ•œ ํŠน์ง•์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋‘ ๊ฐœ ๋ชจ๋ธ์„ ๊ฒฐํ•ฉํ•œ ๋ชจ๋ธ์„ ์ œ์•ˆํ•˜๊ธฐ๋„ ํ•˜์˜€๋Š”๋ฐ, ์—ฌ๊ธฐ์„œ๋Š” ๋…ผ์™ธ๋กœ ํ•˜๊ณ  Pureํ•œ ViT ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•˜์˜€๋‹ค. 

๋…ผ๋ฌธ ์ฐธ์กฐ :  Convolution Vision Transformer (https://arxiv.org/abs/2103.15808)


๐ŸŒŠ Story 1: MNIST Training  ํ•™์Šต

์ผ๋‹จ ๊ธฐ๋ณธ์ ์œผ๋กœ ViT์„ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด์„œ MNIST ๋กœ ํ•™์Šตํ•ด๋ดค๋‹ค. ๋ชจ๋ธ ๊ตฌ์กฐ๋Š” Pytorch ์—์„œ ๊ตฌํ˜„ํ•œ vit_16 ๋ชจ๋ธ ๊ตฌ์กฐ๋ฅผ ๋”ฐ๋ž๋‹ค.  ๊ทธ๋ƒฅ ํ•™์Šตํ•˜๋ฉด ์‹ฌ์‹ฌํ•˜๋‹ˆ, Layer ๊ฐœ์ˆ˜๋ฅผ ๋‹ค์–‘ํ•˜๊ฒŒ ์„ค์ •ํ•ด๋ดค๋‹ค. 

์‹คํ—˜ ๊ฒฐ๊ณผ, Training Loss๋Š” ์ž˜ ์ค„์–ด๋“ค์—ˆ๊ณ , Validation Loss๋„ ๋Œ€๋ถ€๋ถ„์˜ 3๊ฐœ ์ด์ƒ์˜ ๋ ˆ์ด์–ด์—์„œ๋Š” 99% ์ด์ƒ์˜ ์ ์ˆ˜๐Ÿฆธ๐Ÿป‍โ™‚๏ธ๋ฅผ ์–ป์—ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ํ•˜๋‚˜์˜ ๋ ˆ์ด์–ด๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ, ์ œ๋Œ€๋กœ ํ•™์Šตํ•˜์ง€ ๋ชปํ•˜์˜€๋‹ค. ํ•™์Šต์— ์‚ฌ์šฉ๋œ hyperparameter๋“ค์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. 

 

image size : 56, patch_size: 4 , batch_size:32 learning rate : 1e-4
trainining for 200 epochs. StepLR with gamma 0.5 with every 100 epochs
hidden_dim : 128,  dropout=0.5

 

(์‹คํ—˜ ๊ฒฐ๋ก ) ์ด ์‹คํ—˜์œผ๋กœ ViT๋ฅผ ํ•™์Šตํ•˜๋Š”๋ฐ ์žˆ์–ด์„œ, ๋ ˆ์ด์–ด ์ˆ˜๋ฅผ ๋งŽ์ด ํ•  ํ•„์š”๊ฐ€ ์—†์œผ๋ฉฐ,  3๊ฐœ ์ •๋„ ์ด์ƒ์˜ ๋ชจ๋ธ์ด ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ ์ œ๋Œ€๋กœ ํ•™์Šตํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ๊ฒŒ๋˜์—ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด์•ผ๊ธฐ๋Š” CIFAR 10 ์„ ํ•™์Šตํ•˜๋Š”๋ฐ๋กœ ๋„˜์–ด๊ฐ„๋‹ค. 

 

์ฐธ๊ณ ๋กœ ViT ๊ณ„์—ด์€ 12๊ฐœ ๋ ˆ์ด์–ด๋ฅผ ๊ฐ€์ง€๋Š” ๊ฒƒ์ด ๋ณดํ†ต์ด๋‹ค.  
๋” ํฐ ๋ชจ๋ธ์€ ๋ธ”๋ก๋„ ๋งŽ๊ณ  ์ฐจ์›๋„ ์ปค์„œ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ํ›จ์”ฌ ๋งŽ๋‹ค. 

๋…ผ๋ฌธ ์ฐธ์กฐ :  An Empirical Study of Training Self-Supervised Vision Transformers

https://openaccess.thecvf.com/content/ICCV2021/html/Chen_An_Empirical_Study_of_Training_Self-Supervised_Vision_Transformers_ICCV_2021_paper.html

 

 


๐ŸŒŠ Story 2 : Cifar 10 ํ•™์Šต ์‹คํŒจ ์Šคํ† ๋ฆฌ 

 

MNIST์™€ ๋™์ผํ•œ ๋ชจ๋ธ๊ตฌ์กฐ๋กœ ์ด๋ฏธ์ง€๋ฅผ ํ•™์Šต์‹œ์ผฐ๋‹ค. ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๋‹ค๋ฅธ ๋ ˆ์ด์–ด ์ˆ˜์— ๋Œ€ํ•ด์„œ ๊ฒ€์ฆํ•˜์˜€๊ณ  ๊ฒฐ๊ณผ๋Š” ๋ณ„๋กœ ์ข‹์ง€ ๋ชปํ•˜๋‹ค.

 

๋จผ์ € Training Loss๋ฅผ ๋ณด์ž. ํ—ค๋“œ ๊ฐœ์ˆ˜๋ฅผ ๋Š˜๋ฆด์ˆ˜๋ก Loss๋Š” ๋น ๋ฅด๊ฒŒ ์ค„์–ด๋“ ๋‹ค. ์ด๋Š” ๋ ˆ์ด์–ด๊ฐ€ ๋งŽ์•„์ง์œผ๋กœ ์ธํ•ด์„œ ํ‘œํ˜„๊ณต๊ฐ„์ด ๋” ๋งŽ์€ ์ •๋ณด๋ฅผ ๋‹ด๊ธฐ ๋•Œ๋ฌธ์œผ๋กœ ๋ณด์ธ๋‹ค. Loss ๊ฐ€ 0์— ์ ์  ๊ฐ€๊นŒ์›Œ์ง€๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ํ•™์Šต๋œ ๋ชจ๋ธ๋กœ Validation์„ ์ง„ํ–‰ํ•˜๋Š” ๊ฒฝ์šฐ, ์•„์ฃผ ์—‰๋ง์ธ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด์— ๋Œ€ํ•ด์„œ ํ•œ ๊ฐ€์ง€ ๊ฐ€์ •์€ Attention์— ๋Œ€ํ•ด์„œ ๋ฐฐ์šฐ๊ธฐ ์ „์—, Training ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ Cheating ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์› ์ง€๋งŒ, ๊ทธ๊ฒŒ Validation์—๋Š” ํ†ตํ•˜์ง€ ์•Š์•˜๊ธฐ ๋•Œ๋ฌธ์œผ๋กœ ๋ณด์ธ๋‹ค. Attention์€ ๊ธฐ๋ณธ์ ์œผ๋กœ ์ด๋ฏธ์ง€์˜ ์ค‘์š”ํ•œ ๋ถ€๋ถ„๋“ค์— ๋Œ€ํ•œ ๊ฒฐ๋‹จ์„ ๋‚ด๋ ค์•ผ ํ•˜๋Š”๋ฐ, ํ•ด๋‹น ํ‘œํ˜„ํ•™์Šต์„ ์ง„ํ–‰ํ•˜๊ธฐ ์ „์— ์ข€๋” ์‰ฌ์šด ๋ฐฉ์‹์„ ๋ฐฐ์šฐ๋Š” ๊ฒƒ์ด๋‹ค. ์ •ํ™•๋„๊ฐ€ 0.7 ์„ ๋„˜์ง€๋ชปํ•˜๊ณ , ๋” ์ด์ƒ ์ฆ๊ฐ€ํ•˜์ง€ ์•Š๋Š”๋‹ค. ๋ฌผ๋ก Regularization์ด๋‚˜ Drop-out, Self-supervised learning ๊ณผ ๊ฐ™์€ ์—ฌ๋Ÿฌ๊ฐ€์ง€ ํŠธ๋ฆญ์„ ์ ์šฉํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ์—ฌ๊ธฐ์„œ๋Š” ๋‹จ์ˆœํžˆ Supervised loss ๋งŒ ๊ณ ๋ คํ•˜์˜€๋‹ค. 

 

์ด์ œ ๋ฐ์ดํ„ฐ์˜ ๊ตฌ์กฐ๋ฅผ ๋”์šฑ ์ž˜ ํŒŒ์•…ํ•˜๊ณ  ์žˆ๋Š” ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ : ๐Ÿฆธ๐Ÿป‍โ™‚๏ธPRETRAIN ๋ชจ๋ธ๐Ÿฆธ๐Ÿป‍โ™‚๏ธ๋กœ ํ•™์Šต์‹œ์ผœ๋ณด์ž.

๋˜ํ•œ ์ด์ œ ๋ ˆ์ด์–ด ๊ฐœ์ˆ˜๋ฅผ Base model์ธ 12 โ…ซ๊ฐœ๋กœ ๋งž์ถ”์ž. 

 


๐ŸŒŠ Story 3 : Pretrained Model --> CIFAR 10

 

Pretrained Model ์„ ๊ตฌํ•  ์ˆ˜ ์žˆ๋˜ ๊ณณ์€ torchvision์ด๋‹ค.
๋ฌผ๋ก  HuggingFace์—์„œ๋„ ์ œ๊ณตํ•˜๋Š”๋ฐ, torch๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์œ ์ €๋กœ์„œ ๋” ์‚ฌ์šฉํ•˜๊ธฐ ๊ฐ„๋‹จํ•œ ๊ฑด torchvision ์ด์—ˆ๋‹ค. 

 

 

torchvision models ์—๋Š” ๊ธฐ๋ณธ์ ์œผ๋กœ 3 ์ข…๋ฅ˜์˜ Vision Transformer Weights ๊ฐ€ ์กด์žฌํ•œ๋‹ค. 

 

  1. SWAG๋กœ  ImageNet 1K - Finetuning๐Ÿ‹๏ธ‍โ™€๏ธ
  2. SWAG๋กœ ImageNet1K (Frozen) + ImageNet 1K Finetuning๐Ÿ‹๏ธ‍โ™€๏ธ
  3. ImageNet 1K Scratch ํ•™์Šต ๐Ÿ‹๏ธ‍โ™€๏ธ

๋…ผ๋ฌธ ์ฐธ์กฐ : SWAG(Revisiting Weakly Supervised Pre-Training of Visual Perception Models)

https://arxiv.org/abs/2201.08371

 

 

์„ธ ๊ฐ€์ง€ ๋ชจ๋ธ ๋ชจ๋‘ ๋ฐ์ดํ„ฐ ImageNet์— ๋Œ€ํ•ด์„œ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค. ์ด ๋ชจ๋ธ๋“ค์„ ๊ฐ€์ ธ์™€์„œ CIFAR 10 ์„ ํ•™์Šต์‹œ์ผฐ๋‹ค. ์ฐธ๊ณ ๋กœ ์„ธ ๋ชจ๋ธ์€ ๊ฐ๊ฐ Resize, CropSize, Patch Size๊ฐ€ ๋‹ค๋ฅด๋‹ค. ๋”ฐ๋ผ์„œ ๋ชจ๋‘ ๋‹ค๋ฅธ ๊ฐ’์„ ์‚ฌ์šฉํ•ด์ค˜์•ผ ํ•œ๋‹ค. 

 

batch_size ๋Š” finetuning ์‹œ ํ•„์ž๊ฐ€ ์‚ฌ์šฉํ•œ ์‚ฌ์ด์ฆˆ์ด๋‹ค.

 

Fine Tuning ์ง„ํ–‰ํ•˜๊ธฐ ๐ŸŽฏ

Accuracy ๊ฐ€ ์ œ์ผ ๋†’์•˜๋˜ ๊ฒƒ์€ 1 epoch ์˜ ๐ŸŒŸ94.1%๐ŸŒŸ ์˜€๋‹ค.

 

๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด, ์Šคํฌ๋ž˜์น˜์— ๋น„ํ•ด์„œ ์ดˆ๋ฐ˜์— accuracy๊ฐ€ ํ™• ๋†’์•„์ง€๋Š” ๊ฒƒ์„ ๊ด€์ฐฐํ•  ์ˆ˜ ์žˆ๋‹ค. ๋˜ํ•œ Frozen ๋˜์—ˆ๋˜ Linear ๋ชจ๋ธ์€ ์‚ฌ์‹ค์ƒ ImageNet-1K์— ๋Œ€ํ•ด์„œ ํŠœ๋‹์‹œํ‚จ๊ฒŒ ์•„๋‹ˆ๋ฏ€๋กœ Classification์— ๋‹ค์‹œ ํ•™์Šตํ•˜๋Š”๋ฐ ์‹œ๊ฐ„์ด ์†Œ์š”๋˜์—ˆ๋‹ค. ํ™•์‹คํ•œ ์ ์€ Pretrain์ด ์Šคํฌ๋ž˜์น˜๋ณด๋‹ค ๋‚˜์œผ๋ฉฐ, Classification์— Bias ๋œ ๋ชจ๋ธ์€ ๋” ์ข‹๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. 

 

์œ„์˜ ์‹คํ—˜๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ training loss๋ฅผ ์ค„์ด๋Š” ๊ฒƒ์€ ์‰ฌ์šฐ๋‚˜ Validation ์˜ ์„ฑ๋Šฅ์„ ๋†’์ด๋Š” ๊ฒƒ์€ ์‰ฝ์ง€ ์•Š์•˜๋‹ค. ์ถ”๊ฐ€์ ์œผ๋กœ ํŠœ๋‹์„ ํ•˜๋ฉด ํ•™์Šต์„ ๋”์šฑ ์ž˜ ์‹œํ‚ฌ ์ˆ˜ ์žˆ์œผ๋‚˜, Validation์— overfitting๋  ๊ฑฐ ๊ฐ™์•„์„œ, ๊ฐ€์žฅ ๊ฐ„๋‹จํ•œ Adam Optimizer + learning rate 0.0001 ์„ ์‚ฌ์šฉํ•˜์˜€๋‹ค. ๋‹จ์ˆœํžˆ CrossEntropy Loss๋กœ ์„ฑ๋Šฅ์„ ๋”์šฑ ๋†’์ด๋Š”๋ฐ๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ๋Š” ๊ฒƒ ๊ฐ™๋‹ค. 

FINAL ๊ฒฐ๋ก  ๐Ÿ”š

 

์‹คํ—˜์„ ๊ธฐํšํ•˜๋ฉด์„œ ์ œ์ผ ๊ถ๊ธˆํ–ˆ๋˜ ๊ฒƒ์€ "Transformer ๋ฅผ Supervised Learning์œผ๋กœ ํ•™์Šตํ•˜๋ฉด ์–ด๋–ป๊ฒŒ ๋˜๋Š”๊ฐ€" ์˜€๋‹ค.  ์‹คํ—˜์ ์œผ๋กœ ๋ณด์ธ ๊ฒƒ์€ ์Šคํฌ๋ž˜์น˜๋กœ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด ์•ˆ์ข‹๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ๋ชจ๋ธ์€ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ ์˜๋ฏธ์žˆ๋Š” ์ •๋ณด๋ฅผ ์ถ”์ถœํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ๋‹จ์ˆœํžˆ ํ•™์Šต๋ฐ์ดํ„ฐ์— Overfitting ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด์—ˆ๊ณ , ์ด๋Š” Self-Supervised Learning์œผ๋กœ๋ถ€ํ„ฐ ๋ฐฐ์šด ๋ชจ๋ธ์ด ๊ฐ€์ง€๋Š” Global Receptive๋ฅผ ๊ฐ–์ง€ ๋ชปํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋‚˜ํƒ€๋‚ธ๋‹ค.

 

๋ชจ๋ธ์„ ํ•™์Šตํ•  ๋•Œ ๋ณดํ†ต Trainining Loss ๊ฐ€ ์ค„์–ด๋“ค๋ฉด Validation๋„ ์ค„์–ด๋“ค ๊ฒƒ์ด๋ผ๊ณ  ์˜ˆ์ƒํ•˜์ง€๋งŒ ViT ๋ชจ๋ธ์€ ๊ทธ๋Ÿฌํ•œ ๊ธฐ๋Œ€๋ฅผ ๊ณผ๊ฐํ•˜๊ฒŒ ๋ถ€์ˆ˜๊ณ  ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ Cheating ํ•˜๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์คฌ๋‹ค. ๊ฒฐ๊ตญ Pretraining ์ž์ฒด๊ฐ€ ๊ต‰์žฅํžˆ ์ค‘์š”ํ•˜๋ฉฐ, ์ด๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ Space์— ๋Œ€ํ•ด์„œ ์˜๋ฏธ์žˆ๋Š” ๊ณต๊ฐ„์ด ๋”ฐ๋กœ ์กด์žฌํ•˜๋Š” ๊ฒƒ์„ ๋‚˜ํƒ€๋‚ธ๋‹ค. ์ฆ‰ Initialization / Supervised / Self-Supervised ์— ๋Œ€ํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ณต๊ฐ„์ด ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์—, ๊ฐ ๊ณต๊ฐ„์—์„œ ์‹œ์ž‘ํ•ด์„œ Downstream Task์— ๋Œ€ํ•œ ํ•™์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค๋ฉด ๋ชจ๋ธ์€ ๊ทธ ๊ณต๊ฐ„์˜ ๊ทผ์ฒ˜์—์„œ Loss๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ฒƒ์„ ์ฐพ๋Š”๋‹ค. ์ด๋Ÿฌํ•œ ๊ด€์ฐฐ์€ ๋ชจ๋ธ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์ด ๋ฐ์ดํ„ฐ๋ฅผ ์ตœ๋Œ€ํ•œ ์•Œ๊ณ ์žˆ๋Š”์ง€ ๊ฒ€์ฆํ•˜๋Š” ๊ณผ์ •๊ณผ ์ ˆ์ฐจ๋ฅผ ํ•„์š”๋กœ ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

 

 ์š”์•ฝ :  ๐ŸŸจ ์Šคํฌ๋ž˜์น˜ < ๐ŸŸจ Self-Supervised < ๐ŸŸจ Classification Pretraining