WEEK4 : problem of optimization :: 순간 기록

Processing math: 100%

WEEK4 : problem of optimization

2020. 12. 19. 13:46

<content>

learning rate decay
- 학습이 진행될수록 learning rate를 줄이는 방법
problem of local optima
- 실제로 local optima의 문제보다 plateau의 문제가 더 심각하다.
hyperparameter tuning process
- hyperparameter의 우선순위
appropriate scale of hyperparameter search
- linear scale보다는 log scale로 탐색
re-test hyperparameter occasionally

1. Learning rate decay

일반적으로 학습이 진행될수록 learning rate를 줄여준다. (초기에는 큰 step, 갈수록 작은 step)
- 탐색에서 고려하는 두 가지가 있는데, 그것은 바로 exploitation과 exploration이다.
- 학습 초기에는 exploration을 위해 비교적 큰 learning rate를 시도하고,
- 학습 후기에는 exploitation을 위해 비교적 작은 learning rate를 이용한다.
$\alpha = \frac{1}{1+decayRate*epochNum}\alpha_0$
여기서 alpha_0는 초기 learning rate를 의미한다.
위의 공식에 따르면 epochNum이 커질수록 alpha값이 작아지게 된다.
위의 공식 외에도 다양한 learning rate decay 방법이 존재한다.

2. Problem of local optima

출처 : https://sacko.tistory.com/38

고차원의 공간에서는 생각보다 local optima에 빠지기는 쉽지 않다 🙂
local optima인 지점은 모든 parameter에 대해 convex하거나, concave 해야 하기 때문이다.
대부분의 기울기가 0인 지점은 local optima보다는 saddle point이다.
고차원에서는 local optima의 문제보다 problem of plateau가 더 심각하다.
problem of plateau는 기울기가 0에 근접한 긴 구간을 말하며, 이 구간에 빠지면 학습시간이 길어진다.
이런 경우 Adam Optimizer의 사용이 도움이 되기도 한다.

3. Tuning process

hyperparameter tuning의 중요도 (대략적)
- 1. alpha (learning rate)
- 2. # hidden units, mini-batch size
- 3. # layers, learning rate decay
- 4. beta1, beta2, epsilon (Adam optimizer)
hidden units은 선의 수를 의미하고, layer의 수는 선의 구부러진 정도를 나타낸다.
그렇다면 hyperpameter tunning을 어떻게 진행해야 할까?
- grid search보다는 random search를 이용하라.
  - 하나의 hyperparameter에 대해 grid seach는 random search보다 주는 정보가 적다.
- coarse to fine
  - 성능이 좋게 나온 hyper-parameter 영역을 좀 더 세밀히 탐색하라.

4. Appropriate scale to pick hyperparameter

만약 learning rate를 0.0001부터 1까지 탐색한다고 하자.
이 때 linear 하게 search한다면 거의 대다수의 random sample이 0.1과 1 사이에 속하고,
0.0001부터 0.1 사이의 값은 거의 뽑히지 않을 것이다.
이러한 문제를 방지하기 위해 linear scale로 parameter를 탐색하는 것이 아니라, log scale로 탐색해야 한다.
log_10(0.0001) = -4, log_10(1) = 0 이므로 [-4, 0] 구간의 값을 random하게 뽑고 이 값에 10^(random) 취한 값을 탐색하는 것이다. 이런 식으로 sampling하여 하이퍼라마미터 튜닝을 진행한다면 좀 더 균일한 분포로 탐색할 수 있다.

5. Re-test hyperparameter occasionally

1. panda approach
- 하나의 모델을 하루씩 baby sitting
- 컴퓨팅 성능 여건이 안 되는 경우 이러한 방식으로 학습 진행
2. cavier approach
- 병렬적으로 여러개의 모델을 학습
- 컴퓨팅 성능 여건이 된다면 시도해보자

저작자표시 비영리 변경금지

'🙂 > Coursera_DL' 카테고리의 다른 글

WEEK5 : Machine Learning Strategy (0)	2020.12.20
WEEK4 : batch normalization (배치 정규화) (0)	2020.12.19
WEEK4 : Optimizer (최적화 알고리즘) (0)	2020.12.19
WEEK3 : weight initialization (가중치 초기화) (0)	2020.12.19
WEEK3 : normalizing input (입력 정규화) (0)	2020.12.19

+ Recent posts

Powered by Tistory, Designed by wallel

티스토리툴바