<content>

  • gradient descent
  • momentum
  • RMSProp
  • Adam

 

 

https://www.slideshare.net/yongho/ss-79607172

 

1. Gradient Descent

Gradient Descent : loss๋ฅผ parameter์— ๋Œ€ํ•ด ๋ฏธ๋ถ„ํ•œ ๊ฐ’์„ ์ด์šฉํ•˜์—ฌ parameter๋ฅผ ๊ฐฑ์‹ ํ•˜์ž

 

  • batch gradient descent : ์ „์ฒด ๋ฐ์ดํ„ฐ๋ฅผ ํ•œ ๋ฒˆ์— ํ•™์Šต
    • ๋ชจ๋“  ์ž๋ฃŒ๋ฅผ ๋‹ค ๊ฒ€ํ† ํ•ด์„œ ๋‚ด ์œ„์น˜์˜ ๊ธฐ์šธ๊ธฐ๋ฅผ ๊ณ„์‚ฐํ•ด์„œ ๊ฐˆ ๋ฐฉํ–ฅ์„ ์ฐพ๊ฒ ๋‹ค 
    • iteration 1๋ฒˆ์— parameter update 1๋ฒˆ ๐Ÿ˜ฅ -> ์‹œ๊ฐ„์ด ์˜ค๋ž˜ ๊ฑธ๋ฆฐ๋‹ค๋Š” ๋‹จ์ 
    • vectorize ๊ฐ€๋Šฅํ•˜๋‹ค ๐Ÿ™‚

  • mini batch gradient descent : mini batch ๋‹จ์œ„๋กœ ํ•™์Šต
    • iteration 1๋ฒˆ์— ์—ฌ๋Ÿฌ ๋ฒˆ์˜ parameter update ๐Ÿ™‚
    • vectorize ๊ฐ€๋Šฅํ•˜๋‹ค ๐Ÿ™‚

  • stochastic gradient descent : ๋ฐ์ดํ„ฐ ํ•˜๋‚˜ ๋‹จ์œ„๋กœ ํ•™์Šต
    • vectorize ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค ๐Ÿ˜ฅ
    • ์ง€์—ญ์ ์ธ ํŠน์ง•์„ ๋ฏผ๊ฐํ•˜๊ฒŒ ๋ฐ˜์˜ํ•œ๋‹ค

  • ๊ทธ๋ ‡๋‹ค๋ฉด ์–ด๋– ํ•œ ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•ด์•ผ ํ• ๊นŒ?
    • data์˜ ํฌ๊ธฐ๊ฐ€ ์ž‘๋‹ค๋ฉด (2,000๊ฑด ์ดํ•˜) -> batch gradient descent ์‚ฌ์šฉ
    • data์˜ ํฌ๊ธฐ๊ฐ€ ํฌ๋‹ค๋ฉด -> mini batch gradient descent
      • ์ผ๋ฐ˜์ ์œผ๋กœ mini batch์˜ ํฌ๊ธฐ๋Š” 64, 128, 256, 512 ์ค‘ ์„ ํƒํ•œ๋‹ค.
      • minibatch ์† ๋ชจ๋“  ๋ฐ์ดํ„ฐ๊ฐ€ CPU ํ˜น์€ GPU ๋ฉ”๋ชจ๋ฆฌ์— ์˜ฌ๋ผ๊ฐ€๋„๋ก ํ•ด์•ผ ํ•œ๋‹ค.
    • ํ•˜์ง€๋งŒ ๋” ์ข‹์€ optimizer๊ฐ€ ์กด์žฌํ•œ๋‹ค! ๋” ์•Œ์•„๋ณด์ž

โœ” Momentum optimizer์— ๋“ค์–ด๊ฐ€๊ธฐ ์ „, ์•Œ์•„๋‘๋ฉด ์ข‹์„ ์ง€์‹! 

 

  • 1. exponentially weighted average (์ง€์ˆ˜๊ฐ€์ค‘ํ‰๊ท )
    • $$ V_t = \beta V_{t-1}+(1-\beta)\theta_{t} $$
      • V_t๋Š” ๋ˆ„์ ๋œ ์ •๋ณด (๊ฐ€์ค‘ํ‰๊ท )
      • theta_t๋Š” ํ˜„์žฌ ์‹œ์ ์˜ ์ •๋ณด
    • V_t๋Š” 1/(1-beta) ์‹œ์ ๋™์•ˆ์˜ ํ‰๊ท  ์ •๋ณด๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค๊ณ  ๋ณด๋ฉด ๋œ๋‹ค.
      • beta = 0.9 -> ์ง€๋‚œ 10์ผ ๋™์•ˆ์˜ ๊ฐ€์ค‘ํ‰๊ท 
      • beta = 0.99 -> ์ง€๋‚œ 100์ผ ๋™์•ˆ์˜ ๊ฐ€์ค‘ํ‰๊ท 
    • beta๊ฐ€ ์ปค์งˆ์ˆ˜๋ก ์–ด๋–ค ํšจ๊ณผ๊ฐ€ ์žˆ๋Š”๊ฐ€?
      • ๊ทธ๋ž˜ํ”„๊ฐ€ smooth ํ•ด์ง€๋Š” ํšจ๊ณผ๊ฐ€ ์žˆ๊ณ 
      • ์ตœ๊ทผ์˜ ๊ฒฝํ–ฅ์„ ๋Šฆ๊ฒŒ ๋ฐ˜์˜ํ•˜๋Š” ํšจ๊ณผ๊ฐ€ ์žˆ๋‹ค(shift to left)

  • 2. bias correction
    • ์ดˆ๊ธฐ V_0์„ 0์œผ๋กœ ์„ค์ •ํ•˜๋ฉด ์ดˆ๊ธฐ ์‹œ์ ์—๋Š” ํ˜„์žฌ ๋ฐ์ดํ„ฐ์˜ ๊ฒฝํ–ฅ์„ ์ž˜ ๋ฐ˜์˜ํ•˜์ง€ ๋ชป ํ•œ๋‹ค. ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•์ด bias correction์ด๋‹ค.
    • $$ V_t = \frac{V_t}{1-\beta^{t}} $$
      • ์ดˆ๊ธฐ์—๋Š” V_t๋ฅผ ์กฐ๊ธˆ ๋” ํ‚ค์›Œ์ฃผ๋Š” ํšจ๊ณผ๊ฐ€ ์žˆ๋‹ค.
      • ํ•˜์ง€๋งŒ t๊ฐ€ ์ปค์งˆ์ˆ˜๋ก beta_t๋Š” 0์— ์ˆ˜๋ ดํ•˜๊ณ , ๋”ฐ๋ผ์„œ V_t์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ๋ ฅ์ด ๊ฑฐ์˜ ์—†๋‹ค.

 

2. Mometum

 

momentum : ์•„๊นŒ ๋‚ด๋ ค์˜ค๋˜ ๊ด€์„ฑ ๋ฐฉํ–ฅ์œผ๋กœ ๊ฐ€์ž! (๋ฐฉํ–ฅ ์กฐ์ ˆ)

https://www.slideshare.net/yongho/ss-79607172

  • $$ VdW = \beta VdW+(1-\beta)dW $$
  • $$ Vdb = \beta Vdb + (1-\beta)db $$
  • $$ W := W-\alpha VdW $$
  • $$ b := b-\alpha Vdb $$
  • ์ผ๋ฐ˜์ ์œผ๋กœ beta๋Š” 0.9๋กœ ์„ค์ •ํ•œ๋‹ค. (์ด์ „ 10 ์‹œ์ ์˜ ์ •๋ณด๋ฅผ ํ™œ์šฉํ•œ๋‹ค๊ณ  ๋ณด๋ฉด ๋œ๋‹ค)
  • ๋ณธ๋ž˜ W := W-alpha*dW, b := b-alpha*db๋ฅผ ํ•ด์ฃผ๋Š” ๊ฒƒ๊ณผ ์–ด๋– ํ•œ ์ฐจ์ด๊ฐ€ ์žˆ๋Š”๊ฐ€?
  • dW ๋Œ€์‹  VdW, db ๋Œ€์‹  Vdb๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ด์ „ gradient ์ •๋ณด๋ฅผ ๊ณ ๋ คํ•ด์ฃผ๋Š” ๊ฒƒ์ด๋‹ค.
    (์—ฌ๊ธฐ์„œ VdW, Vdb๋Š” gradient์˜ exponentially weighted average์ด๋‹ค. ์ด๊ฒƒ์ด ์šฐ๋ฆฌ๊ฐ€ ์œ„์—์„œ ๋ณธ ๊ฐœ๋…์„ ๋‹ค๋ค˜๋˜ ์ด์œ ์ด๋‹ค.
  • ์ฆ‰ ํ˜„์žฌ์˜ gradient๊ฐ€ 0์— ๊ฐ€๊น๋”๋ผ๋„ ์ด์ „์˜ gradient ์ •๋ณด๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋” ๋‚˜์•„๊ฐˆ ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์ด๋‹ค.
  • ๊ฐ„๋‹จํžˆ ์š”์•ฝํ•˜๋ฉด, ์•„๊นŒ ๋‚ด๋ ค์˜ค๋˜ ๊ด€์„ฑ ๋ฐฉํ–ฅ์œผ๋กœ ๊ฐ€์ž! ๋ผ๋Š” ์•„์ด๋””์–ด์ด๋‹ค.


3. RMSProp

RMSProp(Root Mean Square PROPagation) : ์ƒํ™ฉ์— ๋”ฐ๋ผ step size ์กฐ์ ˆํ•˜์ž! (์Šคํ… ์กฐ์ ˆ) 

https://www.slideshare.net/yongho/ss-79607172

  • $$ SdW = \beta SdW + (1-\beta)dW^2 $$
  • $$ Sdb = \beta Sdb + (1-\beta)db^2 $$
  • $$ W := W - \alpha \frac{dW}{\sqrt{SdW}+\epsilon} $$
  • $$ b := b - \alpha \frac{db}{\sqrt{Sdb}+\epsilon} $$
  • ์œ„์˜ ์ œ๊ณฑ์€ element-wise product์ด๋‹ค.
  • RMSProp์—์„œ beta๋Š” 0.999๋ฅผ ๋งŽ์ด ์‚ฌ์šฉํ•œ๋‹ค.
  • epsilon์„ ๋”ํ•ด์ฃผ๋Š” ์ด์œ ๋Š” SdW, Sdb๊ฐ€ 0์ธ ์ƒํ™ฉ์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•จ์ด๋‹ค.
  • ๋ณธ๋ž˜ W := W-alpha*dW, b := b-alpha*db๋ฅผ ํ•ด์ฃผ๋Š” ๊ฒƒ๊ณผ ์–ด๋– ํ•œ ์ฐจ์ด๊ฐ€ ์žˆ๋Š”๊ฐ€?
  • ๋ณด๋ฉด dW๋ฅผ sqrt(SdW)๋กœ ๋‚˜๋ˆ ์ฃผ๊ณ  ์žˆ๋‹ค.
    ์ด๊ฒƒ์˜ ์˜๋ฏธ๋Š” dW์˜ ๊ฐ’์ด ํฌ๋‹ค๋ฉด ๋ณดํญ์„ ์ž‘๊ฒŒ, dW์˜ ๊ฐ’์ด ์ž‘๋‹ค๋ฉด ๋ณดํญ์„ ํฌ๊ฒŒ ํ•ด์ฃผ๊ฒ ๋‹ค๋Š” ์˜๋ฏธ์ด๋‹ค.
  • ์ฆ‰ ๋งค ์ˆœ๊ฐ„ ๋™์ผํ•œ ๋ณดํญ ์‚ฌ์ด์ฆˆ๋ฅผ ์œ ์ง€ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, gradient์˜ ๊ฐ’์— ๋”ฐ๋ผ ๋ณดํญ ์‚ฌ์ด์ฆˆ๋ฅผ ๋‹ค๋ฅด๊ฒŒ ์กฐ์ ˆํ•˜๊ฒ ๋‹ค๋Š” ์˜๋ฏธ์ด๋‹ค. (์ด ๋•Œ๋„ ์—ญ์‹œ SdW, Sdb๋Š” exponentially weighted average์ด๋‹ค)



4. Adam(ADAptive Momentum estimation) : momentum + RMSProp

Adam(Adaptive momentum estimation) : ๋ฐฉํ–ฅ๋„, ์Šคํ…์‚ฌ์ด์ฆˆ๋„ ์ ์ ˆํžˆ ์กฐ์ ˆํ•˜์ž!

 

https://www.slideshare.net/yongho/ss-79607172

  • $$ VdW = \beta_1 VdW+(1-\beta_1)dW $$
  • $$ Vdb = \beta_1 Vdb + (1-\beta_1)db $$
  • $$ SdW = \beta_2 SdW + (1-\beta_2)dW^2 $$
  • $$ Sdb = \beta_2 Sdb + (1-\beta_2)db^2 $$

--- bias correction---

  • $$ VdW = \frac{VdW}{1-beta_1^t} $$
  • $$ Vdb = \frac{Vdb}{1-beta_1^t} $$
  • $$ SdW = \frac{SdW}{1-beta_2^t} $$
  • $$ Sdb = \frac{Sdb}{1-beta_2^t} $$

-- parameter update --

  • $$ W := W - \alpha \frac{VdW}{\sqrt{SdW}+\epsilon} $$
  • $$ b := b - \alpha \frac{Vdb}{\sqrt{Sdb}+\epsilon} $$
  • alpha๋Š” ์ƒํ™ฉ์— ๋”ฐ๋ผ ์ ์ ˆํžˆ ์กฐ์ ˆ๋˜์–ด์•ผ ํ•˜๊ณ ,
  • ์ผ๋ฐ˜์ ์œผ๋กœ beta_1์€ 0.9, beta_2๋Š” 0.999, epsilon์€ 10-8์œผ๋กœ ์„ค์ •ํ•œ๋‹ค.
  • ๋ฐฉํ–ฅ์„ ์กฐ์ ˆํ•˜๋Š” momentum + ์Šคํ… ์‚ฌ์ด์ฆˆ๋ฅผ ์กฐ์ ˆํ•˜๋Š” RMSProp ๋ชจ๋‘ ์‚ฌ์šฉํ•œ optimizer์ด๋‹ค.

 

 

+ Recent posts