์ตœ๋Œ€ 1 ๋ถ„ ์†Œ์š”

image

ICLR 2022์—์„œ ๋ฐœํ‘œ๋œ iBOT ๐Ÿค–: Image BERT Pre-Training with Online Tokenizer์ž…๋‹ˆ๋‹ค.

Language model training์— ์žˆ์–ด Masked Language Modeling(MLM)์€ ์„ฑ๊ณต์ ์ธ paradigm์œผ๋กœ ์ž๋ฆฌ์žก์•˜์Šต๋‹ˆ๋‹ค.

๋Œ€ํ‘œ์ ์œผ๋กœ BERT๊ฐ€ ๊ทธ๋Ÿฌํ–ˆ์ฃ .

์ด๋Ÿฌํ•œ ์„ฑ๊ณต ๊ธฐ๋ฐ˜์—๋Š” lingual tokenizer (ex. WordPiece, BPE, Unigram)๋ฅผ ํ™œ์šฉํ•ด input์„ semantically meaningful token์œผ๋กœ ๋งŒ๋“ค์–ด์ฃผ๋Š” ๊ฒƒ์ด ์ค‘์š”ํ–ˆ์Šต๋‹ˆ๋‹ค.

ํ•˜์ง€๋งŒ Visual semantics๋Š” ligual semantics๊ณผ ๋‹ฌ๋ฆฌ image์˜ ์—ฐ์†์ ์ธ ํŠน์„ฑ์œผ๋กœ ์ธํ•ด ์‰ฝ๊ฒŒ ๋ฝ‘์•„๋‚ด๊ธฐ๊ฐ€ ์–ด๋ ต์Šต๋‹ˆ๋‹ค.

์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด BEIT : Pre-Training of Image Transformer์—์„œ๋Š” DALL-E์˜ pre-trained VAE๋ฅผ visual tokenizer๋กœ ํ™œ์šฉํ–ˆ์œผ๋‚˜ ์ด๋กœ ์ธํ•ด multi-stage training pipeline์ด ๋ถˆ๊ฐ€ํ”ผํ–ˆ๊ณ  ๋˜ํ•œ tokenizer๊ฐ€ high-level semantics์„ ์žก์•„๋‚ด๋Š”๋ฐ ์–ด๋ ค์›€์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

๊ทธ๋ ‡๊ธฐ์— ์ด ๋…ผ๋ฌธ์—์„œ ์ €์ž๋Š” Vision transformer๋ฅผ ์ž˜ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด Online tokenizer์™€ Knowlege distillation์„ ํ†ตํ•ด ์ƒˆ๋กœ์šด Masked Image Modeling (MIM) framework๋ฅผ ์ œ์‹œํ•˜์˜€์Šต๋‹ˆ๋‹ค.

์•„๋ž˜์˜ ์œ ํŠœ๋ธŒ ๋™์˜์ƒ์œผ๋กœ ๋…ผ๋ฌธ์— ๋Œ€ํ•œ ์„ค๋ช…์„ ์ง„ํ–‰ํ•˜์˜€์œผ๋‹ˆ ํ•œ๋ฒˆ ๋ด์ฃผ์‹œ๋ฉด ๊ฐ์‚ฌํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

๋Œ“๊ธ€๋‚จ๊ธฐ๊ธฐ