• Tistory
    • 태그
    • 위치로그
    • 방명록
    • 관리자
    • 글쓰기
Carousel 01
Carousel 02
Previous Next

[Deep Learning] Xavier Initialization에 관하여..

전공관련/Deep Learning 2016. 2. 22. 10:06




CNN 을 이용하다보면 weight initialize 방법중에 Xavier 방법이 있다.


이녀석이 뭘 하는 방법인지 설명 된 블로그가 있어서 복사 해두기..

정리는 나중에..


============================================================================================


An Explanation of Xavier Initialization

If you’re having trouble viewing the formulas below, try turning off Adblock (thanks /u/BeatLeJuce)


If you work through the Caffe MNIST tutorial, you’ll come across this curious line

weight_filler { type: "xavier" }

and the accompanying explanation

For the weight filler, we will use the xavier algorithm that automatically determines the scale of initialization based on the number of input and output neurons.

Unfortunately, as of the time this post was written, Google hasn’t heard much about “the xavier algorithm”. To work out what it is, you need to poke around the Caffe source until you find the right docstring and then read the referenced paper, Xavier Glorot & Yoshua Bengio’s Understanding the difficulty of training deep feedforward neural networks.

Why’s Xavier initialization important?

In short, it helps signals reach deep into the network.

  • If the weights in a network start too small, then the signal shrinks as it passes through each layer until it’s too tiny to be useful.
  • If the weights in a network start too large, then the signal grows as it passes through each layer until it’s too massive to be useful.

Xavier initialization makes sure the weights are ‘just right’, keeping the signal in a reasonable range of values through many layers.

To go any further than this, you’re going to need a small amount of statistics - specifically you need to know about random distributions and their variance.

Okay, hit me with it. What’s Xavier initialization?

In Caffe, it’s initializing the weights in your network by drawing them from a distribution with zero mean and a specific variance,

Var(W)=1ninVar(W)=1nin

where WW is the initialization distribution for the neuron in question, and ninnin is the number of neurons feeding into it. The distribution used is typically Gaussian or uniform.

It’s worth mentioning that Glorot & Bengio’s paper originally recommended using

Var(W)=2nin+noutVar(W)=2nin+nout

where noutnout is the number of neurons the result is fed to. We’ll come to why Caffe’s scheme might be different in a bit.

And where did those formulas come from?

Suppose we have an input XX with nn components and a linear neuron with random weights WW that spits out a number YY. What’s the variance of YY? Well, we can write

Y=W1X1+W2X2+⋯+WnXnY=W1X1+W2X2+⋯+WnXn

And from Wikipedia we can work out that WiXiWiXi is going to have variance

Var(WiXi)=E[Xi]2Var(Wi)+E[Wi]2Var(Xi)+Var(Wi)Var(ii)Var(WiXi)=E[Xi]2Var(Wi)+E[Wi]2Var(Xi)+Var(Wi)Var(ii)

Now if our inputs and weights both have mean 00, that simplifies to

Var(WiXi)=Var(Wi)Var(Xi)Var(WiXi)=Var(Wi)Var(Xi)

Then if we make a further assumption that the XiXi and WiWi are all independent and identically distributed, we can work out that the variance of YY is

Var(Y)=Var(W1X1+W2X2+⋯+WnXn)=nVar(Wi)Var(Xi)Var(Y)=Var(W1X1+W2X2+⋯+WnXn)=nVar(Wi)Var(Xi)

Or in words: the variance of the output is the variance of the input, but scaled by nVar(Wi)nVar(Wi). So if we want the variance of the input and output to be the same, that means nVar(Wi)nVar(Wi) should be 1. Which means the variance of the weights should be

Var(Wi)=1n=1ninVar(Wi)=1n=1nin

Voila. There’s your Caffe-style Xavier initialization.

Glorot & Bengio’s formula needs a tiny bit more work. If you go through the same steps for the backpropagated signal, you find that you need

Var(Wi)=1noutVar(Wi)=1nout

to keep the variance of the input gradient & the output gradient the same. These two constraints can only be satisfied simultaneously if nin=noutnin=nout, so as a compromise, Glorot & Bengio take the average of the two:

Var(Wi)=2nin+noutVar(Wi)=2nin+nout

I’m not sure why the Caffe authors used the ninnin-only variant. The two possibilities that come to mind are

  • that preserving the forward-propagated signal is much more important than preserving the back-propagated one.
  • that for implementation reasons, it’s a pain to find out how many neurons in the next layer consume the output of the current one.

That seems like an awful lot of assumptions.

It is. But it works. Xavier initialization was one of the big enablers of the move away from per-layer generative pre-training.

The assumption most worth talking about is the “linear neuron” bit. This is justified in Glorot & Bengio’s paper because immediately after initialization, the parts of the traditional nonlinearities - tanh,sigmtanh,sigm - that are being explored are the bits close to zero, and where the gradient is close to 11. For the more recent rectifying nonlinearities, that doesn’t hold, and in a recent paper by He, Rang, Zhen and Sun they build on Glorot & Bengio and suggest using

Var(W)=2ninVar(W)=2nin

instead. Which makes sense: a rectifying linear unit is zero for half of its input, so you need to double the size of weight variance to keep the signal’s variance constant.

출처 : http://andyljones.tumblr.com/post/110998971763/an-explanation-of-xavier-initialization

저작자표시 비영리 (새창열림)

'전공관련 > Deep Learning' 카테고리의 다른 글

[Caffe] NVIDIA DIGITS 버전 업데이트 하기.  (0) 2016.03.18
[Deep Learning] Weight update method 정리.  (0) 2016.02.24
[Caffe] Caffe를 이용하여 학습할때 snapshot을 이용하여 이어서 학습하기  (0) 2015.12.16
[Deep Learning] 컴퓨터 비전을 위한 심층학습 기법 강의정리  (0) 2015.12.15
[Caffe] Caffe + opencv + CUDA 설치방법 ( in windows )  (25) 2015.07.16
블로그 이미지

매직블럭

작은 지식들 그리고 기억 한조각

,

카테고리

  • 살다보니.. (448)
    • 주절거림 (3)
    • 취미생활 (36)
      • 지식과 지혜 (3)
      • 풍경이 되어 (4)
      • Memories (17)
      • 엥겔지수를 높여라 (2)
    • mathematics (6)
      • Matrix Computation (2)
      • RandomProcesses (3)
    • English.. (8)
    • Programming (147)
      • C, C++, MFC (51)
      • C# (1)
      • OpenCV (17)
      • Python (58)
      • Git, Docker (3)
      • Matlab (4)
      • Windows (3)
      • Kinect V2 (2)
      • 기타 etc. (8)
    • 전공관련 (80)
      • Algorithm (6)
      • Deep Learning (54)
      • 실습 프로그램 (4)
      • 주워들은 용어정리 (8)
      • 기타 etc. (8)
    • Computer (118)
      • Utility (21)
      • Windows (31)
      • Mac (4)
      • Ubuntu, Linux (58)
      • NAS (2)
      • Embedded, Mobile (2)
    • IT, Device (41)
      • 제품 사용기, 개봉기 (14)
      • 스마트 체험단 신청 (27)
    • Wish List (3)
    • TISTORY TIP (5)
    • 미분류. 수정중 (1)

태그목록

  • Deep Learning
  • portugal
  • LIBSVM
  • ColorMeRad
  • 일본
  • matlab
  • 오봉자싸롱
  • 후쿠오카
  • DSLR
  • 크롬
  • 매트랩 함수
  • Computer Tip
  • DeepLearning
  • SVM
  • matlab function
  • 칼로리 대폭발
  • function
  • 큐슈
  • review
  • 매트랩
  • Convolutional Neural Networks
  • 스마트체험단
  • 에누리닷컴
  • ReadString
  • random variable
  • 딥러닝
  • 포르투갈
  • utility
  • 갤럭시노트3
  • CStdioFile

달력

«   2025/05   »
일 월 화 수 목 금 토
1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31
05-22 15:07

LATEST FROM OUR BLOG

RSS 구독하기

BLOG VISITORS

  • Total :
  • Today :
  • Yesterday :

Copyright © 2015 Socialdev. All Rights Reserved.

티스토리툴바