CS231n Lecture 3: Loss Functions and Optimization

강의링크: youtube

1. Last Lecture Summary

image 데이터에 대한 linear classification을 가정
간단하게 다음 그림같이 \(32x32x3\)의 이미지가 입력으로 들어왔을 때 그 그림이 10개의 클래스 중에 어디에 분류되는지를 맞추는 모델을 생각
모델은 다음 수식으로 정의 : \(f(x,W)=Wx+b\)
\(W\)가 학습의 대상이 되는 parameters of weight

다음과 같은 그림으로 간단히 어떤 식으로 이미지를 분류하는지 알 수 있음
이미지 데이터의 형태를 유지하지 않고 일자로 펴서 weight matrix와 곱함
그로부터 나온 값이 score이고 입력으로 들어온 이미지가 각 클래스에 얼마나 맞는지를 수치화함
다른 표현 : learning templates for the class. each row of weight matrix is template for the class
또 다른 표현 : linear classification is learning linear decision boundaries(svm이 떠오르는 말이군) between pixels in some high dimensional space (각 dimension은 pixel)
lecture 3에서 다룰 내용
- 어떻게 좋은 \(W\)를 얻을 수 있는지?
- 어떤 score가 더 안 좋고 어떤 건 더 좋은데 그것을 어떻게 수치적으로 측정할 것인가: loss function
- 모든 가능한 \(W\)에 대해서 효율적인 과정을 통해 search 하는 것: optimization

\[L=\frac1N\sum_iL_i(f(x_i, W), y_i)\]

multiclass svm loss function: hinge loss
- 수식은 다음과 같음
- \(s=f(x,W)\), \(S_{y_i}\): score of correct category, \(S_{j}\): score of incorrect category
- \(1\)은 safety margin
- 이 loss function은 다음 그래프와 같이 hinge와 같은 모양이므로 hinge loss라고도 부름

방금 구한 loss는 data loss로서 train data에 대한 loss
하지만 그 loss만 가지고 학습을 하면 train data에 대해서만 잘 되고 test data에 대해서는 분류를 잘 못할 것임
따라서 “regularization” loss 항이 더 필요함. 다음 수식에서 뒤 부분인데 regularization은 model이 simple해지도록 하는 것임. simple 할수록 overfitting의 요소는 줄어들고 test data에 대해서 분류를 잘 하게 됌.
penelize the complexity of the model rather than explicitly trying to fit the training data
L2 regularization : \(R(W)=\sum_k\sum_lW_{k,l}^2\)
- MAP inference using a Gaussian prior on W –> 무슨 뜻인지..
L1 regularization : \(R(W)=\sum_k\sum_l \vert W_{k,l} \vert\)
- generally preter sparse solution
- increasing sparsity of matrix W
Elastic net(L2 + L1)
Max norm regularization
Dropout
Batch normalization
Question
- Q1. How L2 regularization measure the complexity of model?
  - 어떤 label인지 판단하는데 있어서 x 벡터의 특정 element만 보지 않고 spreadout inference
  - 다음 그림과 같은 경우에서 w1보다 w2의 regular 텀의 크기가 작음. L2는 따라서 w2를 더 선호함. L1은 w1과 같은 sparse한 경우를 더 선호함
  - 간단히 생각해보면 L2를 미분하면 w가 남아서 w의 크기에 따라서 decay를 할텐데 L1은 w의 크기에 상관없이 다 decay를 하기 때문에 쉽게 weight가 0이 되어버린다.

scores = unnormalized log probabilities of the classes
loss는 다음과 같이 계산함. exponential을 취한 다음에 normalize를 한 이후 나온 값 중에서 label에 해당하는 값만 가져온 것이 \(L_i\)

Question
- Q1. what is the min/max possible loss L_i
  - min= 0 / max = \(\infty\)
- Q2. w is small so all s~0, what is the loss?
  - log C
svm loss와 cross entropy loss를 둘 다 나타내면 다음 그림

random search
- take a bunch of Ws randomly, see how well they do
- don’t use
Follow the slop
- 함수의 미분값을 따라서 조금씩 이동해서 최적점에 도달하는 것
- 1-dimension은 그냥 미분
- multi-dimension은 gradient \(\nabla\)
- the direction of steepest descent is the negative gradient
- linear 1st order approximation to your function at your current point
- gradient를 계산하고 그 값을 가지고 parameter vector를 iteratively update
how to compute gradient?
- finite difference method: 각 element를 조금씩 바꾸면서 loss의 변화를 보고 gradient 계산 –> 겁나 느림
- 보통은 수식적으로 미분식을 구해서 gradient 구함. numerical gradient는 gradient check 용도로 사용 가능함
gradient descent

모든 데이터에 대해서 한꺼번에 gradient를 계산하는 건 too expensive
따라서 minibatch를 만들어서 이 데이터에 대해 gradient 계산. 보통 32, 64, 128을 많이 사용함.
(random하게 minibatch를 sampling할 경우에 전체 데이터셋에 대해 계산한 gradient와 expectation 값이 같을 것이므로 결국 비슷하게 수렴할 수 있음. 또한 현재 데이터에 대해서 잘 학습하는 것보다 일반화 성능을 높이는 것이 머신러닝과 딥러닝의 목표임. stochastic하게 gradient를 계산하는 것 자체가 일반화 성능에 도움을 준다는 내용도 어딘가에서 봤음.)