Learning rate๋Š” backpropagation ๊ณผ์ •์—์„œ ๋ชจ๋ธ์˜ weight์ธ gradient์˜ ๋ณ€ํ™”(์—…๋ฐ์ดํŠธ์˜ ๋ณดํญ step-size)์ด๋‹ค.
์—ญ์ „ํŒŒ ๊ณผ์ •์—์„œ ๋ชจ๋ธ์˜ ๊ฐ€์ค‘์น˜(Weight)๋Š” ์†์‹ค ํ•จ์ˆ˜์˜ ์˜ค๋ฅ˜ ์ถ”์ •์น˜๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด ์—…๋ฐ์ดํŠธ๋œ๋‹ค.
ํ•™์Šต๋ฅ  * ์ถ”์ • ๊ฐ€์ค‘์น˜ ์˜ค๋ฅ˜(๊ฐ€์ค‘์น˜์— ๋Œ€ํ•œ ๊ธฐ์šธ๊ธฐ or ์ „์ฒด ์˜ค๋ฅ˜ ๋ณ€ํ™”) >>>> Weight ์—…๋ฐ์ดํŠธ

Learning rate๋Š” Optimizer๊ฐ€ Loss function์˜ ์ตœ์†Œ๊ฐ’์— ๋„๋‹ฌํ•˜๋„๋ก ๋งŒ๋“œ๋Š” ๋ณ€ํ™”์˜ ํฌ๊ธฐ๋ฅผ ์ œ์–ดํ•œ๋‹ค.
์„ฑ๋Šฅ์— ์˜ํ–ฅ์„ ์ฃผ๋Š” ์š”์†Œ์ธ learning rate๋ฅผ ์ž˜๋ชป ์„ค์ •ํ•˜๋ฉด ์•„์˜ˆ ํ•™์Šต์ด ์•ˆ๋  ์ˆ˜๋„ ์žˆ๋‹ค.
๊ทธ๋ž˜์„œ  ๋ชจ๋ธ ํ•™์Šต์—์„œ๋Š” learning rate๋ฅผ ์–ด๋–ป๊ฒŒ ์„ค์ •ํ•  ์ง€๊ฐ€ ๋งค์šฐ ์ค‘์š”ํ•œ ์š”์†Œ์ด๋‹ค.

 

 

learning rate ์„ค์ •์— ๋”ฐ๋ฅธ loss ๊ทธ๋ž˜ํ”„์—์„œ์˜ ์ตœ์ ์˜ weight๋ฅผ ์ฐพ์•„๋‚˜๊ฐ€๋Š” ๊ณผ์ •

 

ํ•™์Šต๋ฅ ์ด ํฌ๋ฉด ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋น ๋ฅด๊ฒŒ ํ•™์Šตํ•จ.
but ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์ตœ์†Œ๊ฐ’์œผ๋กœ ๊ฐ€์ง€์•Š๊ณ  ์™”๋‹ค๊ฐ”๋‹คํ•˜๊ฑฐ๋‚˜ ๋‹ค๋ฅธ ๊ณณ์œผ๋กœ ๋›ฐ์–ด๋„˜์„ ์ˆ˜๋„ ์žˆ์Œ.
+ ๊ฐ€์ค‘์น˜ ์—…๋ฐ์ดํŠธํ•˜๋Š” ์ˆ˜์ค€์ด ์ปค์„œ ๊ฐ€์ค‘์น˜๊ฐ€ overflow ๋  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

 

ํ•™์Šต๋ฅ ์ด ๋‚ฎ์œผ๋ฉด ๊ฐ€์ค‘์น˜ ์—…๋ฐ์ดํŠธ๊ฐ€ ์ž‘์•„์„œ ์˜ตํ‹ฐ๋งˆ์ด์ €๊ฐ€ ์ตœ์†Œ๊ฐ’์„ ํ–ฅํ•ด ์ฒœ์ฒœํžˆ ์›€์ง์ž„.

= ์˜ตํ‹ฐ๋งˆ์ด์ €๊ฐ€ ์ˆ˜๋ ดํ•˜๋Š” ๋ฐ ๋„ˆ๋ฌด ์‹œ๊ฐ„์ด ์˜ค๋ž˜ ๊ฑธ๋ฆฌ๊ฑฐ๋‚˜ ์ •์ฒด ์ƒํƒœ ๋˜๋Š” ๊ธ€๋กœ๋ฒŒ ์ตœ์†Œ๊ฐ’์ด ์•„๋‹Œ ๋กœ์ปฌ ์ตœ์†Œ๊ฐ’์— ๊ฐ‡ํž ์ˆ˜ ์žˆ์Œ.

 

ํ•™์Šต๋ฅ ์„ ์ž˜ ์„ค์ •ํ• ์ˆ˜๋งŒ ์žˆ๋‹ค๋ฉด ์„ฑ๋Šฅ ๋Œ€๋ฐ•๋‚จ.

์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๋น ๋ฅด๊ฒŒ ์ˆ˜๋ ด ๊ฐ€๋Šฅํ•  ์ˆ˜ ์žˆ์Œ.

์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๊ธ€๋กœ๋ฒŒ ์ตœ์†Œ๊ฐ’์— ๋„๋‹ฌํ•˜์ง€ ์•Š๊ณ  ์•ž๋’ค๋กœ ์ ํ”„ํ•  ๊ฐ€๋Šฅ์„ฑ์ด ์ค„์–ด๋“ฌ.

๊ทผ๋ฐ ์ด ํ•™์Šต๋ฅ ์„ ์ •ํ™•ํ•˜๊ฒŒ ์ฐ๊ธฐ๊ฐ€ ์–ด๋ ค์›€. ๋…ธ๊ฐ€๋‹ค์ž„.

 

์ ์ ˆํ•œ ํ•™์Šต๋ฅ ์„ ์ฐพ๋Š” ์ด๋ก ์ ์ธ ์›๋ฆฌ๋Š” ๊ฐ„๋‹จํ•˜์ง€๋งŒ, ๋„ˆ๋ฌด ํฌ์ง€๋„ ์•Š๊ณ  ๋„ˆ๋ฌด ์ž‘์ง€๋„ ์•Š๋Š”, ์ •๋ง ๋”ฑ ๋งž๋Š” ํ•™์Šต๋ฅ ์„ ์ฐพ๋Š” ๊ฒƒ์€ ์–ด๋ ค์›€.

 

์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋‚˜์˜จ ๊ฒƒ >>> "Learning rate schedule"

 

๋ชจ๋ธ ํ•™์Šต ๊ณผ์ •์—์„œ ์ฒ˜์Œ๋ถ€ํ„ฐ ๋๊นŒ์ง€ ๋˜‘๊ฐ™์€ learning rate๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜๋„ ์žˆ์ง€๋งŒ, ํ•™์Šต๊ณผ์ •์—์„œ learning rate๋ฅผ ์กฐ์ •ํ•˜๋Š” learning rate scheduler๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๊ธฐ๋Œ€ํ•  ์ˆ˜ ์žˆ์Œ.

 

์ฒ˜์Œ์—” ํฐ learning rate(๋ณดํญ)์œผ๋กœ ๋น ๋ฅด๊ฒŒ optimize๋ฅผ ํ•˜๊ณ , ์ตœ์ ๊ฐ’์— ๊ฐ€๊นŒ์›Œ์งˆ์ˆ˜๋ก learning rate(๋ณดํญ)๋ฅผ ์ค„์—ฌ ๋ฏธ์„ธ์กฐ์ •์„ ํ•˜๋Š” ๊ฒƒ์ด ํ•™์Šต์ด ์ž˜๋œ๋‹ค๊ณ  ์•Œ๋ ค์ ธ์žˆ์Œ.

learning rate๋ฅผ ๊ทธ๋ƒฅ ์ญ‰ decay(๊ฐ์‡ )ํ•˜๋Š” ๋ฐฉ๋ฒ• ๋ง๊ณ , learning rate๋ฅผ ์ค„์˜€๋‹ค ๋Š˜๋ ธ๋‹ค ํ•˜๋Š” ๊ฒƒ์ด ๋” ์„ฑ๋Šฅํ–ฅ์ƒ์— ๋„์›€์ด ๋œ๋‹ค๋Š” ์—ฐ๊ตฌ๊ฒฐ๊ณผ๊ฐ€ ์žˆ์Œ.

 

 

Learning rate schedule = ํ•™์Šต์ด ์ง„ํ–‰๋จ์— ๋”ฐ๋ผ epoch or iteration ๊ฐ„์— ํ•™์Šต๋ฅ ์„ ์กฐ์ •ํ•˜๋Š” ์‚ฌ์ „ ์ •์˜๋œ ํ”„๋ ˆ์ž„์›Œํฌ.

 

Learning rate schedule์— ๋Œ€ํ•œ ๊ฐ€์žฅ ์ผ๋ฐ˜์ ์ธ ๋‘ ๊ฐ€์ง€ ๊ธฐ์ˆ 

 

- ์ผ์ •ํ•œ Learning rate:
ํ•™์Šต๋ฅ ์„ ์ดˆ๊ธฐํ™”ํ•˜๊ณ  ํ›ˆ๋ จ ์ค‘์— ๋ณ€๊ฒฝ X

 

- Learning rate ๊ฐ์†Œ:
์ดˆ๊ธฐ ํ•™์Šต๋ฅ ์„ ์„ ํƒํ•œ ๋‹ค์Œ ์Šค์ผ€์ค„๋Ÿฌ์— ๋”ฐ๋ผ ์ ์ฐจ์ ์œผ๋กœ ๊ฐ์†Œ.
ํ›ˆ๋ จ ์ดˆ๊ธฐ์˜ ํ•™์Šต๋ฅ > ์ถฉ๋ถ„ํžˆ ๊ดœ์ฐฎ์€ ๊ฐ€์ค‘์น˜์— ๋„๋‹ฌํ•˜๊ธฐ ์œ„ํ•ด ํฌ๊ฒŒ ์„ค์ •.
์‹œ๊ฐ„์ด ์ง€๋‚จ์— ๋”ฐ๋ผ ๊ฐ€์ค‘์น˜๋Š” ์ž‘์€ ํ•™์Šต๋ฅ ์„ ํ™œ์šฉํ•˜์—ฌ ๋” ๋†’์€ ์ •ํ™•๋„์— ๋„๋‹ฌํ•˜๋„๋ก ๋ฏธ์„ธ ์กฐ์ •.
์ฆ‰, ์ ์  ์ค„์–ด๋“œ๋Š”๊ฑฐ.


๋‹ค๋ฅธ ์ž‘์—…์—์„œ ๋ชจ๋ธ ๋Œ๋ฆด ๋•Œ ๋งˆ๋‹ค, ์ƒํ™ฉ๋งˆ๋‹ค ์กฐ๊ฑด์ด ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์—, ์–ธ์ œ ํ•™์Šต๋ฅ ์˜ ์ฆ๊ฐ€๋ฅผ ๊ทœ์ œ(Learning Rate Decay)ํ•ด์•ผ ํ•˜๋Š” ์ง€๋ฅผ ํ™•์‹คํžˆ ์•„๋Š” ๊ฒƒ์€ ์‰ฝ์ง€ ์•Š์„ ์ˆ˜ ์žˆ์Œ.

Learning Rate๊ฐ€ ๋†’์œผ๋ฉด, loss ๊ฐ’์„ ๋น ๋ฅด๊ฒŒ ๋‚ด๋ฆด ์ˆ˜๋Š” ์žˆ์ง€๋งŒ, ์ตœ์ ์˜ ํ•™์Šต์„ ๋ฒ—์–ด๋‚˜๊ฒŒ ๋งŒ๋“ค๊ณ ,

Learning Rate๊ฐ€ ๋‚ฎ์œผ๋ฉด, ์ตœ์ ์˜ ํ•™์Šต์„ ํ•  ์ˆ˜ ์žˆ์ง€๋งŒ ๊ทธ ๋‹จ๊ณ„๊นŒ์ง€ ๋„ˆ๋ฌด ์˜ค๋žœ ์‹œ๊ฐ„์ด ๊ฑธ๋ฆฌ๊ฒŒ ๋Œ.

 

๋”ฐ๋ผ์„œ, ์ฒ˜์Œ ๋ชจ๋ธ ๋Œ๋ฆด ๋•Œ Learning Rate ๊ฐ’์„ ํฌ๊ฒŒ ์ค€ ๋‹ค์Œ, ์ผ์ • epoch ๋งˆ๋‹ค ๊ฐ’์„ ๊ฐ์†Œ์‹œ์ผœ์„œ ์ตœ์ ์˜ ํ•™์Šต๊นŒ์ง€ ๋” ๋น ๋ฅด๊ฒŒ ๋„๋‹ฌํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ Learning Rate Decay๋ผ๊ณ  ๋งํ•จ.

 

Step Decay > Learning Rate Decay๋ฅผ ํŠน์ • epoch๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋‹จ๊ณ„๋ณ„๋กœ learning rate์„ ๊ฐ์†Œ์‹œํ‚ค๋Š” ๊ฒƒ์ž„.

ํŠน์ • epoch ๊ตฌ๊ฐ„(step) ๋งˆ๋‹ค ์ผ์ •ํ•œ ๋น„์œจ๋กœ ๊ฐ์†Œ ์‹œ์ผœ์ฃผ๋Š” ๋ฐฉ๋ฒ•์„ Step Decay๋ผ๊ณ  ๋ถ€๋ฆ„.

์‹ค์ œ๋กœ ํ•™์Šต๋ฅ ์„ ์กฐ์ •ํ•˜๋Š”๋ฐ ์žˆ์–ด ๊ฐ€์žฅ ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๋Š” ๋ฐฉ๋ฒ•์ด Step Decay์ž„.

์ƒ์ˆ˜๊ฐ’์˜ ๋น„์œจ๊ณผ Epoch ๋‹จ์œ„๋งŒ์œผ๋กœ ํ•™์Šต๋ฅ ์„ ์กฐ์ •ํ•˜๋Š” ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— ์‚ฌ์šฉํ•˜๊ธฐ ์‰ฝ๊ณ , ๋ณต์žกํ•œ ์—ฐ์‚ฐ์˜ ์ดํ•ด๋ฅผ ์š”๊ตฌํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ ์ง๊ด€์ ์ธ ๋ฐฉ๋ฒ•์ž„.

 

๊ทผ๋ฐ, Step Decay๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๊ธฐ์กด๋ณด๋‹ค ์ƒ๊ฐํ•ด์•ผํ•˜๋Š” Hyper Parameter๊ฐ€ ๋Š˜์–ด๋‚˜๋‹ˆ ์„ธํŒ…ํ•ด์•ผํ•˜๋Š” ๊ฐ’๋“ค์ด ์ฆ๊ฐ€ํ•˜๊ณ  ์„ธํŒ… ๊ฐ’์˜ ๊ฒฝ์šฐ์˜ ์ˆ˜๊ฐ€ ๋” ๋งŽ์ด ์ƒ๊ฒจ์„œ, ๋‹น์‹œ ์‚ฌ๋žŒ๋“ค์€ Hyper Parameter๋ฅผ ์กฐ๊ธˆ ๋œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์„ ์ฐพ๊ฒŒ ๋˜์—ˆ์Œ.

 

๊ทธ๋ ‡๊ฒŒํ•ด์„œ ๋‚˜์˜จ ์—ฐ์†์ ์ธ ๋ฐฉ๋ฒ•์ด Cosine Decay.

Cosine Decay์˜ loss ๊ทธ๋ž˜ํ”„๋ฅผ ๋ณด๋ฉด Step Decay์™€๋Š” ๋‹ฌ๋ฆฌ ์•ˆ์ •์ ์œผ๋กœ ๋Š๊น€์—†์ด loss๊ฐ€ ๊ฐ์†Œํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Œ.

Hyper Paremeters๋ฅผ ์ข€ ๋œ ์‚ฌ์šฉํ•˜๊ณ  learning rate๋„ ์—ฐ์†์ ์œผ๋กœ ์‚ฌ์šฉํ•˜๋‹ˆ Cosine Decay๋ฅผ Learning Rate Decay๋กœ ์ข€ ๋” ๊ณ ๋ ค๋ฅผ ๋งŽ์ด ํ•œ๋‹ค๊ณ  ํ•จ.

 

๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์œผ๋กœ Linear Decay, Inverse Sqrt(Inverse Square Root) Decay ๋“ฑ์ด ์žˆ์Œ.

from transformers import get_cosine_schedule_with_warmup
from transformers import AdamW

# ์˜ตํ‹ฐ๋งˆ์ด์ € ์„ค์ •
optimizer = AdamW(
		model.parameters(),
		lr = 1e-5, # ํ•™์Šต๋ฅ 
		eps = 1e-8 # 0์œผ๋กœ ๋‚˜๋ˆ„๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•œ epsilon ๊ฐ’,
		)

epochs = 3

# ์ด ํ›ˆ๋ จ ์Šคํ…
total_steps = len(train_dataloader) * epochs

# Learning rate decay๋ฅผ ์œ„ํ•œ ์Šค์ผ€์ค„๋Ÿฌ
scheduler = get_cosine_schedule_with_warmup(
                                            optimizer, 
                                            warmup_steps=5, 
                                            base_lr=0.3, 
                                            final_lr=0.01
					)

ex)

Optimizer = AdamW

Epoch= 3

์ด ํ•™์Šต ๋‹จ๊ณ„ ์ˆ˜ = ( ํŠน์ •ํ•œ batch๋กœ ๋‚˜๋‰˜์–ด์ง„ ๋ฐ์ดํ„ฐ์…‹์˜ batch ์ˆ˜ *  epoch )

 

Scheduler์— ์ด๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ด ํ•™์Šต ๋‹จ๊ณ„ ์ˆ˜(num_training_steps)๋ฅผ ์ •์˜ํ•ด์คŒ.

num_warmup_steps๋Š” ํ™œ์šฉํ•˜์ง€ ์•Š์•„ 0์œผ๋กœ ์„ค์ •์ด ๋˜์–ด์žˆ์Œ.

 

์ด๋ ‡๊ฒŒ ํ•ด์„œ ์ •ํ•ด์ง„ ํ•™์Šต๋ฅ (lr)์ด, ๋งค ํ•™์Šต ๋‹จ๊ณ„๋งˆ๋‹ค ์ œ์–ด๋˜๋ฉฐ, Step Decay๋ฅผ ํ™œ์šฉํ•œ ํ•™์Šต๋ฅ  Scheduling์„ ์ ์šฉํ•˜๊ฒŒ ๋Œ.


LambdaLR

Lambda ํ‘œํ˜„์‹์œผ๋กœ ์ž‘์„ฑํ•œ ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด learning rate๋ฅผ ์กฐ์ ˆ.

์ดˆ๊ธฐ learning rate์— lambdaํ•จ์ˆ˜์—์„œ ๋‚˜์˜จ ๊ฐ’์„ ๊ณฑํ•ด์ค˜์„œ learning rate๋ฅผ ๊ณ„์‚ฐ

optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
scheduler = optim.lr_scheduler.LambdaLR(optimizer=optimizer,
                                        lr_lambda=lambda epoch: 0.95 ** epoch)

 

MultiplicativeLR

LambdaLR๋ž‘ ๋˜‘๊ฐ™์Œ

optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
scheduler = optim.lr_scheduler.MultiplicativeLR(optimizer=optimizer,
                                                lr_lambda=lambda epoch: 0.95 ** epoch)

 

StepLR

step size๋งˆ๋‹ค gamma ๋น„์œจ๋กœ lr์„ ๊ฐ์†Œ์‹œํ‚ด. (step_size ๋งˆ๋‹ค gamma๋ฅผ ๊ณฑํ•จ)

optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.5)

 

MultiStepLR

step size๊ฐ€ ์•„๋‹ˆ๋ผ learning rate๋ฅผ ๊ฐ์†Œ์‹œํ‚ฌ epoch์„ ์ง€์ •ํ•ด์คŒ.

optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
scheduler = optim.lr_scheduler.MultiStepLR(optimizer, milestones=[30,80], gamma=0.5)

 

ReduceLROnPlateau

์„ฑ๋Šฅ์ด ํ–ฅ์ƒ์ด ์—†์„ ๋•Œ learning rate๋ฅผ ๊ฐ์†Œ์‹œํ‚ด.

validation loss๋‚˜ metric(ํ‰๊ฐ€ ์ง€ํ‘œ)์„ learning rate stepํ•จ์ˆ˜์˜ input์œผ๋กœ ๋„ฃ์–ด์ฃผ์–ด์•ผ ํ•จ.

๊ทธ๋ž˜์„œ metric์ด ํ–ฅ์ƒ๋˜์ง€ ์•Š์„ ๋•Œ, patienceํšŸ์ˆ˜(epoch)๋งŒํผ ๋Œ€๊ธฐํ–ˆ๋‹ค๊ฐ€, ๊ทธ ์ดํ›„์—๋Š” learning rate๋ฅผ ์ค„์ž„.

optimizer์— momentum์„ ์„ค์ •ํ•ด์•ผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
scheduler = ReduceLROnPlateau(optimizer, 'min') # min์€ metric์ด ๊ฐ์†Œ๋ฅผ ๋ฉˆ์ถœ ๋•Œ/ max๋Š” metric์ด ์ฆ๊ฐ€๋ฅผ ๋ฉˆ์ถœ ๋•Œ
for epoch in range(100):
     train(...)
     val_loss = validate(...)

     # Note that step should be called after validate()
     scheduler.step(val_loss)
     
#patience: metric์ด ํ–ฅ์ƒ ์•ˆ๋  ๋•Œ, ๋ช‡ epoch์„ ์ฐธ์„ ๊ฒƒ์ธ๊ฐ€?
#threshold: ์ƒˆ๋กœ์šด optimum์ด ๋  ์ˆ˜ ์žˆ๋Š” threshold (์–ผ๋งˆ ์ฐจ์ด๋‚˜๋ฉด optimum update๋˜์—ˆ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‚˜?)
#threshold_mode: dynamic threshold๋ฅผ ์„ค์ •ํ•  ์ˆ˜ ์žˆ๋‹ค. 'rel' ์ด๋‚˜ 'abs' ์ค‘ ํ•˜๋‚˜์˜ ๋ชจ๋“œ๋กœ ์„ค์ •ํ•œ๋‹ค. 'rel'๋ชจ๋“œ์ด๋ฉด min์ผ ๋•Œ, best(1-threshold) max์ผ ๋•Œ, best(1+threshold)/ 'abs'๋ชจ๋“œ์ด๋ฉด best+threshold
#cool_down: lr์ด ๊ฐ์†Œํ•œ ํ›„ ๋ช‡ epoch๋™์•ˆ lr scheduler๋™์ž‘์„ ์‰ด์ง€
#min_lr: ์ตœ์†Œ lr
#eps: ์ค„์ด๊ธฐ ์ „, ์ค„์ธ ํ›„ lr์˜ ์ฐจ์ด๊ฐ€ eps๋ณด๋‹ค ์ž‘์œผ๋ฉด ๋ฌด์‹œํ•œ๋‹ค.

 

get_cosine_schedule_with_warmup

0 ~ ์˜ตํ‹ฐ๋งˆ์ด์ €์˜ ์„ค์ •ํ•œ ์ดˆ๊ธฐ ํ•™์Šต๋ฅ 

์ด ์‚ฌ์ด์—์„œ ์„ ํ˜•์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜๋Š” ์›Œ๋ฐ์—… ๊ธฐ๊ฐ„ ํ›„์— ์˜ตํ‹ฐ๋งˆ์ด์ €์—์„œ 0์œผ๋กœ ์„ค์ •๋œ ์ดˆ๊ธฐ lr ์‚ฌ์ด์˜ ์ฝ”์‚ฌ์ธ ํ•จ์ˆ˜ ๊ฐ’์— ๋”ฐ๋ผ ๊ฐ์†Œํ•˜๋Š” ํ•™์Šต๋ฅ ๋กœ ์Šค์ผ€์ค„์„ ์ƒ์„ฑ.

get_cosine_schedule_with_warmup

total_samples = 968
bs = 32
n_epochs = 10

num_warmup_steps = (total_samples // bs) * 2
num_total_steps = (total_samples // bs) * n_epochs

model = nn.Linear(2, 1)
optimizer = optim.SGD(model.parameters(), lr=0.01)
scheduler = transformers.get_cosine_schedule_with_warmup(optimizer, 
                                                         num_warmup_steps=num_warmup_steps, 
                                                         num_training_steps=num_total_steps)
lrs = []
for i in range(num_total_steps):
    optimizer.step()
    lrs.append(optimizer.param_groups[0]["lr"])
    scheduler.step()

 

 

 

 

https://i0.wp.com/neptune.ai/wp-content/uploads/Learning-rate-scheduler.png?resize=1024%2C384&ssl=1
https://wikidocs.net/157282

 

728x90
๋ฐ˜์‘ํ˜•
Liky