<content>

  • learning rate decay
    • ํ•™์Šต์ด ์ง„ํ–‰๋ ์ˆ˜๋ก learning rate๋ฅผ ์ค„์ด๋Š” ๋ฐฉ๋ฒ•
  • problem of local optima
    • ์‹ค์ œ๋กœ local optima์˜ ๋ฌธ์ œ๋ณด๋‹ค plateau์˜ ๋ฌธ์ œ๊ฐ€ ๋” ์‹ฌ๊ฐํ•˜๋‹ค.
  • hyperparameter tuning process
    • hyperparameter์˜ ์šฐ์„ ์ˆœ์œ„ 
  • appropriate scale of hyperparameter search
    • linear scale๋ณด๋‹ค๋Š” log scale๋กœ ํƒ์ƒ‰
  • re-test hyperparameter occasionally

1. Learning rate decay

 

  • ์ผ๋ฐ˜์ ์œผ๋กœ ํ•™์Šต์ด ์ง„ํ–‰๋ ์ˆ˜๋ก learning rate๋ฅผ ์ค„์—ฌ์ค€๋‹ค. (์ดˆ๊ธฐ์—๋Š” ํฐ step, ๊ฐˆ์ˆ˜๋ก ์ž‘์€ step)
    • ํƒ์ƒ‰์—์„œ ๊ณ ๋ คํ•˜๋Š” ๋‘ ๊ฐ€์ง€๊ฐ€ ์žˆ๋Š”๋ฐ, ๊ทธ๊ฒƒ์€ ๋ฐ”๋กœ exploitation๊ณผ exploration์ด๋‹ค.
    • ํ•™์Šต ์ดˆ๊ธฐ์—๋Š” exploration์„ ์œ„ํ•ด ๋น„๊ต์  ํฐ learning rate๋ฅผ ์‹œ๋„ํ•˜๊ณ ,
    • ํ•™์Šต ํ›„๊ธฐ์—๋Š” exploitation์„ ์œ„ํ•ด ๋น„๊ต์  ์ž‘์€ learning rate๋ฅผ ์ด์šฉํ•œ๋‹ค.
  • $$ \alpha = \frac{1}{1+decayRate*epochNum}\alpha_0 $$
  • ์—ฌ๊ธฐ์„œ alpha_0๋Š” ์ดˆ๊ธฐ learning rate๋ฅผ ์˜๋ฏธํ•œ๋‹ค.
  • ์œ„์˜ ๊ณต์‹์— ๋”ฐ๋ฅด๋ฉด epochNum์ด ์ปค์งˆ์ˆ˜๋ก alpha๊ฐ’์ด ์ž‘์•„์ง€๊ฒŒ ๋œ๋‹ค.
  • ์œ„์˜ ๊ณต์‹ ์™ธ์—๋„ ๋‹ค์–‘ํ•œ learning rate decay ๋ฐฉ๋ฒ•์ด ์กด์žฌํ•œ๋‹ค.

 


2. Problem of local optima

 

์ถœ์ฒ˜ : https://sacko.tistory.com/38

 

  • ๊ณ ์ฐจ์›์˜ ๊ณต๊ฐ„์—์„œ๋Š” ์ƒ๊ฐ๋ณด๋‹ค local optima์— ๋น ์ง€๊ธฐ๋Š” ์‰ฝ์ง€ ์•Š๋‹ค ๐Ÿ™‚
    local optima์ธ ์ง€์ ์€ ๋ชจ๋“  parameter์— ๋Œ€ํ•ด convexํ•˜๊ฑฐ๋‚˜, concave ํ•ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.
  • ๋Œ€๋ถ€๋ถ„์˜ ๊ธฐ์šธ๊ธฐ๊ฐ€ 0์ธ ์ง€์ ์€ local optima๋ณด๋‹ค๋Š” saddle point์ด๋‹ค.
  • ๊ณ ์ฐจ์›์—์„œ๋Š” local optima์˜ ๋ฌธ์ œ๋ณด๋‹ค problem of plateau๊ฐ€ ๋” ์‹ฌ๊ฐํ•˜๋‹ค. 
  • problem of plateau๋Š” ๊ธฐ์šธ๊ธฐ๊ฐ€ 0์— ๊ทผ์ ‘ํ•œ ๊ธด ๊ตฌ๊ฐ„์„ ๋งํ•˜๋ฉฐ, ์ด ๊ตฌ๊ฐ„์— ๋น ์ง€๋ฉด ํ•™์Šต์‹œ๊ฐ„์ด ๊ธธ์–ด์ง„๋‹ค.
    ์ด๋Ÿฐ ๊ฒฝ์šฐ Adam Optimizer์˜ ์‚ฌ์šฉ์ด ๋„์›€์ด ๋˜๊ธฐ๋„ ํ•œ๋‹ค. 

 

 


 

3. Tuning process

 

  • hyperparameter tuning์˜ ์ค‘์š”๋„ (๋Œ€๋žต์ )
    • 1. alpha (learning rate)
    • 2. # hidden units, mini-batch size
    • 3. # layers, learning rate decay
    • 4. beta1, beta2, epsilon (Adam optimizer)
  • hidden units์€ ์„ ์˜ ์ˆ˜๋ฅผ ์˜๋ฏธํ•˜๊ณ , layer์˜ ์ˆ˜๋Š” ์„ ์˜ ๊ตฌ๋ถ€๋Ÿฌ์ง„ ์ •๋„๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค.

  • ๊ทธ๋ ‡๋‹ค๋ฉด hyperpameter tunning์„ ์–ด๋–ป๊ฒŒ ์ง„ํ–‰ํ•ด์•ผ ํ• ๊นŒ?
    • grid search๋ณด๋‹ค๋Š” random search๋ฅผ ์ด์šฉํ•˜๋ผ.
      • ํ•˜๋‚˜์˜ hyperparameter์— ๋Œ€ํ•ด grid seach๋Š” random search๋ณด๋‹ค ์ฃผ๋Š” ์ •๋ณด๊ฐ€ ์ ๋‹ค.
    • coarse to fine
      • ์„ฑ๋Šฅ์ด ์ข‹๊ฒŒ ๋‚˜์˜จ hyper-parameter ์˜์—ญ์„ ์ข€ ๋” ์„ธ๋ฐ€ํžˆ ํƒ์ƒ‰ํ•˜๋ผ.

 


4. Appropriate scale to pick hyperparameter

 

  • ๋งŒ์•ฝ learning rate๋ฅผ 0.0001๋ถ€ํ„ฐ 1๊นŒ์ง€ ํƒ์ƒ‰ํ•œ๋‹ค๊ณ  ํ•˜์ž.
    ์ด ๋•Œ linear ํ•˜๊ฒŒ searchํ•œ๋‹ค๋ฉด ๊ฑฐ์˜ ๋Œ€๋‹ค์ˆ˜์˜ random sample์ด 0.1๊ณผ 1 ์‚ฌ์ด์— ์†ํ•˜๊ณ ,
    0.0001๋ถ€ํ„ฐ 0.1 ์‚ฌ์ด์˜ ๊ฐ’์€ ๊ฑฐ์˜ ๋ฝ‘ํžˆ์ง€ ์•Š์„ ๊ฒƒ์ด๋‹ค.
  • ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด linear scale๋กœ parameter๋ฅผ ํƒ์ƒ‰ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, log scale๋กœ ํƒ์ƒ‰ํ•ด์•ผ ํ•œ๋‹ค.
  • log_10(0.0001) = -4, log_10(1) = 0 ์ด๋ฏ€๋กœ [-4, 0] ๊ตฌ๊ฐ„์˜ ๊ฐ’์„ randomํ•˜๊ฒŒ ๋ฝ‘๊ณ  ์ด ๊ฐ’์— 10^(random) ์ทจํ•œ ๊ฐ’์„ ํƒ์ƒ‰ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์ด๋Ÿฐ ์‹์œผ๋กœ samplingํ•˜์—ฌ ํ•˜์ดํผ๋ผ๋งˆ๋ฏธํ„ฐ ํŠœ๋‹์„ ์ง„ํ–‰ํ•œ๋‹ค๋ฉด ์ข€ ๋” ๊ท ์ผํ•œ ๋ถ„ํฌ๋กœ ํƒ์ƒ‰ํ•  ์ˆ˜ ์žˆ๋‹ค.

5. Re-test hyperparameter occasionally

 

  • 1. panda approach
    • ํ•˜๋‚˜์˜ ๋ชจ๋ธ์„ ํ•˜๋ฃจ์”ฉ baby sitting
    • ์ปดํ“จํŒ… ์„ฑ๋Šฅ ์—ฌ๊ฑด์ด ์•ˆ ๋˜๋Š” ๊ฒฝ์šฐ ์ด๋Ÿฌํ•œ ๋ฐฉ์‹์œผ๋กœ ํ•™์Šต ์ง„ํ–‰
  • 2. cavier approach
    • ๋ณ‘๋ ฌ์ ์œผ๋กœ ์—ฌ๋Ÿฌ๊ฐœ์˜ ๋ชจ๋ธ์„ ํ•™์Šต
    • ์ปดํ“จํŒ… ์„ฑ๋Šฅ ์—ฌ๊ฑด์ด ๋œ๋‹ค๋ฉด ์‹œ๋„ํ•ด๋ณด์ž

+ Recent posts