VLM๋“ค์€ ์—ฐ์—ญ์  ์ถ”๋ก ์„ ์–ผ๋งˆ๋‚˜ ์ž˜ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์„๊นŒ?

์•ˆ๋…•ํ•˜์„ธ์š”, ์—ฌ๋ฆ„๊ฐ๊ธฐ์— ๊ฑธ๋ฆฐ ๋ธ”๋กœ๊ทธ ์ฃผ์ธ์žฅ์ž…๋‹ˆ๋‹ค.

์˜ค๋Š˜์€ VLM(Vision and Language Model)๊ณผ ๊ด€๋ จ๋œ ๋…ผ๋ฌธ์„ ์†Œ๊ฐœํ•ด๋“œ๋ฆฌ๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

 

ํ•ด๋‹น ๋…ผ๋ฌธ์˜ ์ œ๋ชฉ์€ How Far Are We from Intelligent Visual Deductive Reasoning? ์œผ๋กœ APPLE ์‚ฌ์—์„œ 2024๋…„ 3์›”์— ๊ณต๊ฐœํ•œ ๋…ผ๋ฌธ์ด๋ฉฐ, ICLR 2024 AGI Workshop์—์„œ ๋ฐœํ‘œํ•œ ๋‚ด์šฉ์ž…๋‹ˆ๋‹ค.

 

์ผ๋‹จ Background ์ง€์‹์„ ์งš๊ณ  ๋„˜์–ด๊ฐˆ๊ฒŒ์š”.

๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ(Multi-modal)์ด ๋ญ˜๊นŒ์š”?

์ฒจ๋ถ€์‚ฌ์ง„์ด ๋„ˆ๋ฌด ์งœ์น˜๊ธด(?)ํ•œ๋ฐ, ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ์ด๋ž€ ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ๊ฐ๊ฐ์ด๋‚˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋™์‹œ์— ๋‹ค๋ฃฌ๋‹ค๋Š” ๋œป์ž…๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด, ์‚ฌ๋žŒ์€ ๋ˆˆ์œผ๋กœ ๋ณธ ๊ฒƒ๊ณผ ๊ท€๋กœ ๋“ค์€ ๊ฒƒ์„ ๋™์‹œ์— ์ดํ•ดํ•  ์ˆ˜ ์žˆ์ฃ . ์ปดํ“จํ„ฐ์—๊ฒŒ ์ด๋Ÿฐ ๋Šฅ๋ ฅ์„ ์ฃผ๋Š” ๊ฒƒ์ด ๋ฐ”๋กœ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๊ธฐ์ˆ ์ž…๋‹ˆ๋‹ค. 

์ž์—ฐ์–ด๋งŒ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ชจ๋ธ์ด๋‹ค, ์ด๋ฏธ์ง€๋งŒ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ชจ๋ธ์ด๋‹ค, ์Œ์„ฑ๋งŒ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ชจ๋ธ์ด๋‹ค -> ์ด๋Ÿฐ๊ฒฝ์šฐ ์œ ๋‹ˆ๋ชจ๋‹ฌ(Uni-modal)๋ชจ๋ธ์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

์ด๋Ÿฐ ๋‹ค๋ฅธ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ฅผ ๊ฒฐํ•ฉํ•ด์„œ ํ•™์Šตํ•˜๊ณ  ์ถ”๋ก ํ•˜๊ณ  ์ƒ์„ฑํ•˜๋Š” ๋ชจ๋ธ์„ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ(Multi-modal)๋ชจ๋ธ์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

 

Vision Language Model = Multimodal Model ์ด๋ผ๊ณ  ์ •์˜๋ฅผ ํ•˜๊ณ  ๊ธ€์„ ์ด์–ด ์จ๋‚ด๋ ค๊ฐ€๊ฒ ์Šต๋‹ˆ๋‹ค.

 

๊ณผ๊ฑฐ์—๋Š” ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ task๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ ๊ฐ๊ฐ์˜ ๋ชจ๋ธ์„ ํ†ตํ•ด ๊ฐ๊ฐ์˜ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ณ , ์ตœ์ข… ์ถœ๋ ฅ์œผ๋กœ ๋‚˜์˜จ ๋ฒกํ„ฐ๊ฐ’๋“ค์„ ๊ฒฐํ•ฉํ•˜์—ฌ ํ•ด๊ฒฐํ•˜๋Š” ๋ฐฉ์‹์ด์˜€์Šต๋‹ˆ๋‹ค.

๊ณผ๊ฑฐ์˜ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ฐฉ์•ˆ (๋ถˆ๊ณผ 2~3๋…„์ „)

 

ํ•˜์ง€๋งŒ ์ด์ œ๋Š” pre-training ๋ฐฉ์•ˆ ๋ฐ model์˜ parameter๋ฅผ ๋Œ€ํญ ํ‚ค์šฐ๋Š” ๋ฐฉ์•ˆ์ด ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š”๋ฐ ํšจ๊ณผ๊ฐ€ ์žˆ์Œ์ด ๊ฒ€์ฆ๋˜์—ˆ์ฃ ? ๊ทธ๋ ‡๊ธฐ ๋–„๋ฌธ์— ๋Œ€ํ˜•๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ๋ชจ๋ธ(Large Multimodal Model, LMM)์„ ๋งŒ๋“ค๊ธฐ ์‹œ์ž‘ํ–ˆ๊ณ , ์ถœ๋ ฅ๋ถ€๋ถ„์—์„œ ํ•ฉ์น˜๋Š”๊ฒƒ์ด ์•„๋‹Œ ์ž…๋ ฅ๋ถ€๋ถ„์—์„œ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ฅผ ํ•ฉ์ณ์„œ ํ•™์Šตํ•˜๋Š” ๋ฐฉ์•ˆ์ด ๋งŽ์•„์กŒ์Šต๋‹ˆ๋‹ค.

 

์ด๋Ÿฐ ๋ฐฉ์‹์œผ๋กœ ๊ฐ๊ฐ์˜ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ฅผ ๊ฐ™์€ ์ฐจ์›์œผ๋กœ ๋งคํ•‘ํ•˜์—ฌ ์ด์งˆ์„ฑ์„ ์ตœ์†Œํ™”์‹œํ‚จ ๋’ค ๋ชจ๋ธ์—๊ฒŒ ์ž…๋ ฅ์‹œํ‚ค๋Š” ๋ฐฉ์‹์ด ๋งŽ์•„์กŒ์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ  ๋‹ค์Œ์€ 2024๋…„ 3์›”๊นŒ์ง€์˜ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ชจ๋ธ ํ๋ฆ„๋„์ž…๋‹ˆ๋‹ค.

์ถœ์ฒ˜ ๊นŒ๋จน์Œ ๋‚˜์ค‘์— ์ƒ๊ฐ๋‚ ๋–„ ์ฐพ์•„์„œ ์˜ฌ๋ ค๋†“์„๊ฒŒ์š”

 

chatGPT ์„œ๋น„์Šค๋กœ ๋ณด์‹œ๋ฉด ์•„์‹œ๊ฒ ์ง€๋งŒ, gpt-4o๋งŒ ๋ด๋„ multi modal task๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ๋Šฅ๋ ฅ์ด ๋Œ€ํญ ํ–ฅ์ƒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

- Captioning(์ด๋ฏธ์ง€ ๋ณด์—ฌ์ฃผ๋ฉด ํ•ด๋‹น ์ด๋ฏธ์ง€ ์„ค๋ช…ํ•ด์คŒ), Multimodal world knowledge and commonsense(๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•œ ์ƒ์‹ ์ดํ•ด), VQA(์ด๋ฏธ์ง€๋„ฃ๊ณ  ์งˆ๋ฌธํ•˜๋ฉด ๋Œ€๋‹ตํ•ด์คŒ), OCR(์ด๋ฏธ์ง€ ์† ํ…์ŠคํŠธ ์ •ํ™•ํžˆ ์ถ”์ถœ), ์ฐจํŠธ ํ‘œ ์ดํ•ด ๋“ฑ๋“ฑ ์ด์ „๊ณผ ๋‹ฌ๋ฆฌ ์—„์ฒญ๋‚˜๊ฒŒ ๋Šฅ๋ ฅ์ด ํ–ฅ์ƒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

 

VLMs์•„ ์ด๊ฑฐ ํ’€ ์ˆ˜ ์žˆ๋‹ˆ?

APPLE ์ง์›๋“ค์€ ์ด์ œ ์˜๋ฌธ์„ ๋˜์ง‘๋‹ˆ๋‹ค.

 

VLMs์€ ์ ์  ๋ฐœ์ „ ์ค‘์ด๊ธด ํ•œ๋ฐ... ์•„์ง ์–ด๋ ค์šด ๋ฌธ์ œ๋“ค์ด ๋งŽ์ด ์กด์žฌํ•œ๋‹ค๊ณ  ์–˜๊ธฐํ•ฉ๋‹ˆ๋‹ค.
๊ทธ ์ค‘ ํ•˜๋‚˜๊ฐ€ โ€œ๋ณต์žกํ•œ ๋‹ค๋‹จ๊ณ„์˜ ๊ด€๊ณ„ํ˜• ๋ฐ ์—ฐ์—ญ์  ์ถ”๋ก  ๋Šฅ๋ ฅโ€œ์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.


๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” SOTA VLMs์„ ํ†ตํ•ด ์ด์ „๊นŒ์ง€ ๊ณ ๋ ค๋˜์ง€ ์•Š๋˜ ๋งน์ ์„ ์–ธ๊ธ‰ํ•˜๊ณ , ์•ž์œผ๋กœ์˜ ๋ฐœ์ „ ๋ฐฉํ–ฅ์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.

<Key Point>
1. VLM์˜ ์‹œ๊ฐ์  ์—ฐ์—ญ ๋Šฅ๋ ฅ ์„ฑ๋Šฅ ๋ถ„์„์„ ์œ„ํ•œ ๋ฒค์น˜๋งˆํฌ ๋ฐ ํ”„๋ ˆ์ž„์›Œํฌ ๊ตฌ์ถ•
2. 3๊ฐœ์˜ ๋ฐ์ดํ„ฐ์…‹ (Mensa IQ Test, IQ, Intelligence Test , RAVEN)์—์„œ VLM ํ‰๊ฐ€
3. ํ˜„์žฌ ํšจ๊ณผ์ ์ธ LLM ์ „๋žต(few-shot learining, self-consistency) ์ด VLM์—๋„ ํšจ๊ณผ์ ์ธ์ง€
4. ๋ชจ๋ธ ์ž…์žฅ์—์„œ Vision ์ •๋ณด์™€ Language์ •๋ณด ์ค‘ ์–ด๋–ค๊ฒŒ ๋” ํšจ๊ณผ์ ์œผ๋กœ ๊ณ ๋ ค๋˜๋Š”์ง€ ๋น„๊ต
5. ํ˜„์žฌ๊นŒ์ง€์˜ VLM ํ•œ๊ณ„ ์„ค๋ช…, ์•ž์œผ๋กœ์˜ ๋ฐฉํ–ฅ ์ œ์•ˆ

 

Datasets

์‹คํ—˜์— ์‚ฌ์šฉ๋œ ๋ฐ์ดํ„ฐ์…‹์€ ํฌ๊ฒŒ ์„ธ๊ฐ€์ง€์ž…๋‹ˆ๋‹ค.

 

1. Mensa Test  :  35๊ฐœ ์ค‘ 1๊ฐœ๋งŒ One-shot์œผ๋กœ ์‚ฌ์šฉํ•˜์—ฌ 34๊ฐœ ์‚ฌ์šฉ
2. IntelligenceTest (IT)  : ์–ธ์–ด, ํŒจํ„ด์ธ์‹, ์ˆ˜ํ•™ ๋“ฑ๋“ฑ ์žˆ์Œ. RPM๋ฌธ์ œ๋กœ 66๊ฐœ ์‚ฌ์šฉ
3. RAVEN  : 14000๊ฐœ ์žˆ๊ณ , 7๊ฐ€์ง€ ์œ ํ˜•์ด ์žˆ๋Š”๋ฐ ์œ ํ˜• ๋‹น 20๊ฐœ์”ฉ 140๊ฐœ ์‚ฌ์šฉ

 

*RPM์ด๋ž€?

๋”๋ณด๊ธฐ

-----------------------------------------------------------------

RPM์ด๋ž€?

RPM์€ ์ผ๋ฐ˜์ ์œผ๋กœ ์ผ๋ฐ˜์ ์ธ ์ธ๊ฐ„ ์ง€๋Šฅ๊ณผ ์ถ”์ƒ์  ์ถ”๋ก ์„ ์ธก์ •ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋˜๋Š” ๋น„์–ธ์–ด์  ํ…Œ์ŠคํŠธ -wiki

-----------------------------------------------------------------

 

์™ผ์ชฝ๋ถ€ํ„ฐ ๋ฉ˜์‚ฌ, IT, ๋ ˆ์ด๋ธ ๋ฐ์ดํ„ฐ์…‹ ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค.

 

Prompts & Models

Mensa Test ๋ฐ์ดํ„ฐ์…‹์—์„œ zero-shotํ•  ๋–„์˜ prompt๋งŒ ์˜ˆ์‹œ๋กœ ๊ฐ€์ ธ์™”์Šต๋‹ˆ๋‹ค.

You can see a grid of 9 boxed, one of which is empty (marked as ?). You have to choose which of the 6 alternative shapes (A-F) should be placed in the empty box in order to complete the pattern that connects the shapes. Finally, provide your prediction as Answer:โ€œXโ€. {query image}

 

์ด๋Ÿฐ์‹์œผ๋กœ ์‹คํ—˜์— ์‚ฌ์šฉ๋œ ๋ชจ๋“  ํ”„๋กฌํ”„ํŠธ๋Š” ๋…ผ๋ฌธ ๋ถ€๋ก์— ์ ํ˜€์žˆ์œผ๋‹ˆ ๊ถ๊ธˆํ•˜๋‹ค๋ฉด ํ™•์ธํ•˜์‹œ๋ฉด ๋  ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

 

์‹คํ—˜์— ์‚ฌ์šฉ๋œ ๋ชจ๋ธ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

- gpt-4-vision-preview
- Gemini-pro
- Qwen-VL-Max
- LLaVa-1.5-13B

 

VLMs์€ ์ด๋Ÿฐ ๋ฌธ์ œ๋„ ์ž˜ ํ’€ ์ˆ˜ ์žˆ๋‚˜?

์ž, ์ด์ œ ์‹คํ—˜์˜ ๊ฒฐ๊ณผ๋ฅผ ๋ณผ๊นŒ์š”?

 

* seed๋ฅผ ๋ฐ”๊ฟ”์„œ 10๋ฒˆ ๋ฐ˜๋ณต ํ›„ ํ‰๊ท  ๊ณ„์‚ฐํ•œ ๊ฐ’ * IT ํ…Œ์ŠคํŠธ์˜ ๊ฒฝ์šฐ ์‚ฌ๋žŒ์˜ ์ •๋‹ต๋ฅ ์€ 30~93.4% * RAVEN์˜ ๊ฒฝ์šฐ ์‚ฌ๋žŒ์˜ ํ‰๊ท  ์ •๋‹ต๋ฅ ์€ 84.67%

 

Entropy๋ž€?

๋”๋ณด๊ธฐ

-----------------------------------------------------------------

Entropy
๋ชจ๋ธ์ด ์˜ˆ์ธกํ•  ๋•Œ์˜ ๋ถˆํ™•์‹ค์„ฑ์„ ๋‚˜ํƒ€๋‚ด๋Š” ์ง€ํ‘œ
๋†’์„์ˆ˜๋ก ๋ถˆํ™•์‹ค(๋ฌผ์–ด๋ณผ๋•Œ๋งˆ๋‹ค ๋‹ต๋ฐ”๋€œ), ๋‚ฎ์„์ˆ˜๋ก ๊ฒฐ์ •๋ก ์ (ํ™•์‹ค)

์ด๋ฅผ ํŒŒ์ด์ฌ ์ฝ”๋“œ๋กœ ์“ฐ๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Œ

probabilities = [0.90, 0.05, 0.02, 0.01, 0.01, 0.01]
entropy = -sum(p * np.log(p) for p in probabilities)

-----------------------------------------------------------------

 

์ด ์‹คํ—˜์˜ ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

 

1. ํ˜„์žฌ ๋ชจ๋ธ์€ ๋žœ๋ค ์ˆ˜์ค€์— ๋จธ๋ฌด๋ฅด๊ณ  ์žˆ์Œ
2. ํ‹€๋ฆฐ ๋‹ต์„ ํ™•์‹ ์„ ๊ฐ€์ง€๊ณ  ์žˆ์Œ (ํ›ˆ๋ จ๊ณผ์ •์—์„œ ์ด๋Ÿฌํ•œ ๋ถˆํ™•์‹ค์„ฑ์„ ์กฐ์ •ํ•˜์ง€ ์•Š์€ ๊ฒƒ์ด ๋ฌธ์ œ๋ผ๊ณ  ํŒ๋‹จํ•˜๊ณ  ์žˆ์Œ)
3. ๋ชจ๋ธ์ด ์‹œ๊ฐ์  ํŒจํ„ด์„ ์ธ์‹ํ•˜๊ณ  ์„ค๋ช…ํ•˜๋Š” ๋Šฅ๋ ฅ์— ํ•œ๊ณ„๊ฐ€ ์žˆ๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์ž„

 

LLM ํ‘œ์ค€ ์ „๋žต์ด ์‹œ๊ฐ์  ์—ฐ์—ญ์  ์ถ”๋ก  ๊ณผ์ œ์—์„œ๋„ ํšจ๊ณผ์ ์ผ๊นŒ?

* SC = Self-Consistency (Chain-of-Thoughts ์—ฌ๋Ÿฌ ๋ฒˆํ•œ ๋’ค ์ผ๊ด€๋œ ๋‹ต๋ณ€ ๊ณ ๋ฅด๋Š” ๊ธฐ๋ฒ•)

 

์ด ์‹คํ—˜์˜ ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.


1. ์„ฑ๋Šฅ์ด ์‚ด์ง ์˜ค๋ฅด๊ธด ํ•˜๋‚˜ ํšจ๊ณผ ์—†์Œ
2. ์˜คํžˆ๋ ค ์˜ค๋‹ต์„ ๋”์šฑ ํ™•์‹คํ•˜๊ฒŒ ๋งํ•จ
3. LLM ํ‘œ์ค€ ์ „๋žต๊ณผ๋Š” ๋‹ค๋ฅธ ์ƒˆ๋กœ์šด ์ ‘๊ทผ ๋ฐฉ๋ฒ• ๋ฐ ์ „๋žต์ด ํ•„์š”ํ•จ

 

์–ด๋–ค ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๊ฐ€ ๊ฐ€์žฅ ๋„์›€์ด ๋ ๊นŒ?

GPT-4V๋ฅผ ๋Œ€์ƒ์œผ๋กœ ์‹คํ—˜์„ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

์„ค๋ช…(Description), ๋…ผ๋ฆฌ์  ์ด์œ (Rationale), ๋‹ต๋ณ€(Answer)์„ ํฌํ•จํ•œ ์˜ˆ์ œ๋ฅผ ์ œ๊ณตํ•œ ํ›„, ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

 

์ด ์‹คํ—˜์˜ ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

 

1. ์„ค๋ช…๋งŒ ์ฃผ์—ˆ์„ ๋•Œ ๊ฐ€์žฅ ์„ฑ๋Šฅ ๋†’์Œ 
2. ์ด๋ฏธ์ง€ ์‚ฌ์šฉํ•˜๋ฉด ์˜คํžˆ๋ ค ์„ฑ๋Šฅ ๋–จ์–ด์ง
3. ์ด์งˆ์„ฑ ์ตœ์†Œํ™”ํ•˜๋Š” ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•จ

 

 

๋‚œ์ด๋„๋ฅผ ์กฐ์ •ํ•ด์ค„๊ฒŒ. ํ’€์–ด๋ด!

* gpt-4v ๋กœ ์‹คํ—˜ ์ง„ํ–‰

 

์ž…๋ ฅํ•  ๋•Œ ๊ฐ๊ฐ ์š”์†Œ๋ฅผ ๋ถ„๋ฆฌ(๋””ํ…Œ์ผํ•˜๊ฒŒ ํŒจํ„ด์„ ํ•˜๋‚˜์”ฉ ๋ถ„๋ฆฌํ•˜์—ฌ ์ž‘์„ฑ)ํ•˜์˜€์„ ๋•Œ ์„ฑ๋Šฅ์ด ๋” ๋†’์•„์ง

 

์ €์ž๋“ค์€ ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•ด์„œ ๋‚œ์ด๋„๋ณ„๋กœ ๋‚˜๋ˆ„์–ด ์ƒˆ๋กœ ๊ตฌ์ถ•ํ•˜์˜€์Šต๋‹ˆ๋‹ค. (Easy, Medium, Hard)

์—ฌ๊ธฐ์„œ ์ด ๋‘๊ฐ€์ง€ ์˜ค๋ฅ˜๋ฅผ ๋ฐœ๊ฒฌํ–ˆ์Šต๋‹ˆ๋‹ค.

 

1. ๋ณตํ•ฉ ์˜ค๋ฅ˜(Compounding Error) :๋ชจ๋ธ์ด ์ด์ „ ํŒจํ„ด ์„ค๋ช…์—์„œ ๋ฐœ์ƒํ•œ ์˜ค๋ฅ˜๋ฅผ ๋ฐ˜๋ณตํ•˜๋ฉด์„œ ์ ์  ์˜ค๋ฅ˜๊ฐ€ ์ฆํญ๋˜๋Š” ํ˜„์ƒ ๋ฐœ๊ฒฌ
2. ํ˜ผ๋™ ์˜ค๋ฅ˜(Confounding Error): ๋ชจ๋ธ์ด ์œ ์‚ฌํ•œ ํŒจํ„ด๋“ค ์‚ฌ์ด์—์„œ ํ˜ผ๋™์ด ์ผ์–ด๋‚˜ ์ด ํŒจํ„ด๋“ค์„ ํ˜ผํ•ฉํ•˜์—ฌ ์ด์ƒํ•œ(์ƒˆ๋กœ์šด) ํŒจํ„ด์„ ์„ค๋ช…ํ•˜๋Š” ๊ฒฝ์šฐ ๋ฐœ๊ฒฌ

 

์ด ์‹คํ—˜์˜ ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค

1. ์ผ๋ฐ˜์ ์ธ ์ด๋ฏธ์ง€ ๋ฐ ์‰ฌ์šด ์ถ”๋ก  ์ž‘์—…์€ ์ž˜ ํ•˜๋‚˜, ์ถ”์ƒ์  ํŒจํ„ด ์„ค๋ช… ๋“ฑ์—๋Š” ๋งŽ์€ ์˜ค๋ฅ˜ ๋ฐœ์ƒ
2. ๊ฐ ํŒจํ„ด์„ ๊ฐœ๋ณ„์ ์œผ๋กœ ๋ถ„๋ฆฌํ•˜์—ฌ ์ž์„ธํ•˜๊ฒŒ ์„ค๋ช…ํ•œ ๋’ค ๋ชจ๋ธ์—๊ฒŒ ์ž…๋ ฅํ•˜๋ฉด ์ •ํ™•๋„ ํ–ฅ์ƒ ๋ฐ ํ˜ผ๋™ ์˜ค๋ฅ˜ ๊ฐ์†Œ
3. ๋ณด๋‹ค ์ •๊ตํ•œ ๋ฐ์ดํ„ฐ์…‹ ํ•„์š”

 

๋‚œ์ด๋„ ๋ณ„๋กœ ๋ถ„์„์„ ํ•ด๋ณด์ž

* Gen. Desc (CoT): CoT ์‚ฌ์šฉํ•ด์„œ ์ƒ์„ฑ๋œ ์„ค๋ช…์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ฌธ์ œ ํ•ด๊ฒฐ * Oracle Desc: ์ €์ž๊ฐ€ ์ž‘์„ฑํ•œ ์ •ํ™•ํ•œ ์„ค๋ช…(Oracle Description)๊ธฐ๋ฐ˜์œผ๋กœ ๋ฌธ์ œ ํ•ด๊ฒฐ * - Vision: ์ด๋ฏธ์ง€ ๋นผ๊ณ  ์ง„ํ–‰ * + Rationale: ์ •ํ™•ํ•œ ์„ค๋ช… + ๋…ผ๋ฆฌ์  ์ด์œ (Rationale)์ œ๊ณต

 

์ด ์‹คํ—˜์˜ ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค


1. ๊ฐ„๋‹จํ•˜๊ฑฐ๋‚˜ ์ค‘๊ฐ„ ๋‚œ์ด๋„ task์—์„œ๋Š” ์ด๋ฏธ์ง€ ์—†์–ด๋„ ์„ฑ๋Šฅ์ด ๋‚˜์˜ค์ง€๋งŒ ์–ด๋ ค์šด task๋Š” ์ด๋ฏธ์ง€๊ฐ€ ํ•„์š”ํ•จ
2. ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ ๋†’์ด๋ ค๋ฉด ์ •ํ™•ํ•˜๊ณ  ๋ช…ํ™•ํ•œ ํ…์ŠคํŠธ ์„ค๋ช…๊ณผ ๋…ผ๋ฆฌ์ ์ธ ์ด์œ  ์ œ๊ณตํ•ด์•ผํ•จ

 

์ž ๊น ์ง€๊ธˆ๊นŒ์ง€์˜ ๊ฒฐ๊ณผ ์ค‘ ํ—ท๊ฐˆ๋ฆด๋งŒํ•œ ๋‚ด์šฉ์„ ์š”์•ฝ ๋ฐ ์ •๋ฆฌํ•˜์ž๋ฉด,

1. ๋ชจ๋ธ์˜ ์ž…์žฅ์—์„œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ๋•Œ ๋„์›€์ด ๋˜๋Š”๊ฑด ํ…์ŠคํŠธ > ์ด๋ฏธ์ง€ ์ด๋‹ค.

2. ์‰ฌ์šด task(์ผ์ƒ์ ์ธ ์‚ฌ์ง„ ์„ค๋ช… ๋ฐ ์งˆ๋ฌธ, ๊ฐ„๋‹จํ•œ ์ถ”๋ก )์—์„œ๋Š” ํ…์ŠคํŠธ๋งŒ ์žˆ์–ด๋„ ์„ฑ๋Šฅ ๋‚˜์˜ค๊ณ , ์˜คํžˆ๋ ค ์ด๋ฏธ์ง€ ๋„ฃ์œผ๋ฉด ์„ฑ๋Šฅ ๋‚ฎ์•„์ง„๋‹ค.

3. ํ•˜์ง€๋งŒ!!! ์œ„์™€ ๊ฐ™์ด ๋ชจ๋ธ์ž…์žฅ์—์„œ ์–ด๋ ค์šด task์—์„œ๋Š” ์ด๋ฏธ์ง€๊ฐ€ ๊ผญ ํ•„์š”ํ•˜๋‹ค. ํ…์ŠคํŠธ ์„ค๋ช…๋งŒ์œผ๋กœ ๋ถ€์กฑํ•˜๋‹ค.

 

์ด์ •๋„๋กœ ์š”์•ฝํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ ๊ฐ™์•„์š”.

 

์ถ”๊ฐ€์‹คํ—˜!

* [{BEGIN/END} OF EXAMPLE]

 

์ด๋ฏธ์ง€๋ฅผ ๋จผ์ € ๋„ฃ๋Š” ๊ฒƒ ๋ณด๋‹ค ํ…์ŠคํŠธ๋ฅผ ๋จผ์ € ๋„ฃ๋Š” ๊ฒƒ์ด ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ๋„์›€์ด ๋œ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

๋˜ํ•œ, ์ฃผ์„์ฒ˜๋Ÿผ Sentinel Token์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ๋„ ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ๋„์›€์ด ๋œ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

 

๊ฒฐ๋ก !!

์ตœ์‹  ๋ชจ๋ธ๋“ค์€ ํ…์ŠคํŠธ ๊ธฐ๋ฐ˜์˜ ์ถ”๋ก ์—์„œ๋Š” ๋›ฐ์–ด๋‚œ ์„ฑ๊ณผ๋ฅผ ๋ณด์ด์ง€๋งŒ ์‹œ๊ฐ์  ์—ฐ์—ญ ์ถ”๋ก ์€ ์•„์ง ์–ด๋ ค์›Œํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

์ฆ‰, ํ˜„์žฌ VLM๋“ค์€ ๋ณต์žกํ•˜๊ณ  ํ˜ผ๋ž€์Šค๋Ÿฌ์šด ์ถ”์ƒ ํŒจํ„ด์„ ์ดํ•ดํ•˜๊ณ  ์ธ์‹ํ•˜๋Š” ๋Šฅ๋ ฅ์ด ๋ถ€์กฑํ•˜๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

๋”ฐ๋ผ์„œ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด ๋ณธ ๋…ผ๋ฌธ์—์„œ ๋งํ•˜๋Š” ๋‚ด์šฉ์„ ๊ณ ๋ คํ•˜๋Š” ๊ฒƒ์ด VLM์˜ ์‹œ๊ฐ์  ์ •๋ณด์ฒ˜๋ฆฌ ๋Šฅ๋ ฅ์„ ๋ฐœ์ „์‹œํ‚ค๋Š”๋ฐ ๋„์›€์„ ์ค„ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

 

๋˜ํ•œ, ์ž์—ฐ์–ด์—์„œ์˜ ์œ ๋ช…ํ•œ ๋ช‡๊ฐ€์ง€ ์ถ”๋ก  ์ „๋žต(In-context learing, self-consistency ๋“ฑ๋“ฑ)์€ ์—ฐ์—ญ์  ์‹œ๊ฐ ์ถ”๋ก (or ๋ชจ๋ธ ์ž…์žฅ์—์„œ ์–ด๋ ค์šด task)์—์„œ๋Š”  ํšจ๊ณผ์ ์ด์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์—  ๋‹ค๋ฅธ ์ƒˆ๋กœ์šด ์ „๋žต์ด ํ•„์š”ํ•˜๋‹ค๊ณ  ๋งํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

 

์• ํ”Œ์ด ๊ณ„์† ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋…ผ๋ฌธ์„ ๋‚ด๊ณ  ๊ณต๊ฐœ๋ฅผ ํ•˜๋Š” ๋ฐ, ์• ํ”Œ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋งŽ์€ ํšŒ์‚ฌ์—์„œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋…ผ๋ฌธ์„ ๋‚ด๋Š” ๊ฒƒ์ด LLM์—์„œ LMM์œผ๋กœ ํŠธ๋ Œ๋“œ๊ฐ€ ๋”์šฑ ๋น ๋ฅด๊ฒŒ ์˜ฎ๊ฒจ๊ฐ€๊ณ  ์žˆ๋Š” ๊ฒƒ ๊ฐ™๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.

๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํŒŒ์ดํŒ…! ใ…Žใ…Ž

 

 

 
 
 
 
 
 
728x90
๋ฐ˜์‘ํ˜•
Liky