You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/tutorial/solver.md
+14-14Lines changed: 14 additions & 14 deletions
Original file line number
Diff line number
Diff line change
@@ -8,12 +8,12 @@ The responsibilities of learning are divided between the Solver for overseeing t
8
8
9
9
The Caffe solvers are:
10
10
11
-
- Stochastic Gradient Descent (`SGD`),
12
-
- AdaDelta (`ADADELTA`),
13
-
- Adaptive Gradient (`ADAGRAD`),
14
-
- Adam (`ADAM`),
15
-
- Nesterov's Accelerated Gradient (`NESTEROV`) and
16
-
- RMSprop (`RMSPROP`)
11
+
- Stochastic Gradient Descent (`type: "SGD"`),
12
+
- AdaDelta (`type: "AdaDelta"`),
13
+
- Adaptive Gradient (`type: "AdaGrad"`),
14
+
- Adam (`type: "Adam"`),
15
+
- Nesterov's Accelerated Gradient (`type: "Nesterov"`) and
16
+
- RMSprop (`type: "RMSProp"`)
17
17
18
18
The solver
19
19
@@ -51,7 +51,7 @@ The parameter update $$\Delta W$$ is formed by the solver from the error gradien
51
51
52
52
### SGD
53
53
54
-
**Stochastic gradient descent** (`solver_type: SGD`) updates the weights $$ W $$ by a linear combination of the negative gradient $$ \nabla L(W) $$ and the previous weight update $$ V_t $$.
54
+
**Stochastic gradient descent** (`type: "SGD"`) updates the weights $$ W $$ by a linear combination of the negative gradient $$ \nabla L(W) $$ and the previous weight update $$ V_t $$.
55
55
The **learning rate**$$ \alpha $$ is the weight of the negative gradient.
56
56
The **momentum**$$ \mu $$ is the weight of the previous update.
57
57
@@ -113,7 +113,7 @@ If learning diverges (e.g., you start to see very large or `NaN` or `inf` loss v
113
113
114
114
### AdaDelta
115
115
116
-
The **AdaDelta** (`solver_type: ADADELTA`) method (M. Zeiler [1]) is a "robust learning rate method". It is a gradient-based optimization method (like SGD). The update formulas are
116
+
The **AdaDelta** (`type: "AdaDelta"`) method (M. Zeiler [1]) is a "robust learning rate method". It is a gradient-based optimization method (like SGD). The update formulas are
The **adaptive gradient** (`solver_type: ADAGRAD`) method (Duchi et al. [1]) is a gradient-based optimization method (like SGD) that attempts to "find needles in haystacks in the form of very predictive but rarely seen features," in Duchi et al.'s words.
142
+
The **adaptive gradient** (`type: "AdaGrad"`) method (Duchi et al. [1]) is a gradient-based optimization method (like SGD) that attempts to "find needles in haystacks in the form of very predictive but rarely seen features," in Duchi et al.'s words.
143
143
Given the update information from all previous iterations $$ \left( \nabla L(W) \right)_{t'} $$ for $$ t' \in \{1, 2, ..., t\} $$,
144
144
the update formulas proposed by [1] are as follows, specified for each component $$i$$ of the weights $$W$$:
145
145
@@ -159,7 +159,7 @@ Note that in practice, for weights $$ W \in \mathcal{R}^d $$, AdaGrad implementa
159
159
160
160
### Adam
161
161
162
-
The **Adam** (`solver_type: ADAM`), proposed in Kingma et al. [1], is a gradient-based optimization method (like SGD). This includes an "adaptive moment estimation" ($$m_t, v_t$$) and can be regarded as a generalization of AdaGrad. The update formulas are
162
+
The **Adam** (`type: "Adam"`), proposed in Kingma et al. [1], is a gradient-based optimization method (like SGD). This includes an "adaptive moment estimation" ($$m_t, v_t$$) and can be regarded as a generalization of AdaGrad. The update formulas are
@@ -181,7 +181,7 @@ Kingma et al. [1] proposed to use $$\beta_1 = 0.9, \beta_2 = 0.999, \varepsilon
181
181
182
182
### NAG
183
183
184
-
**Nesterov's accelerated gradient** (`solver_type: NESTEROV`) was proposed by Nesterov [1] as an "optimal" method of convex optimization, achieving a convergence rate of $$ \mathcal{O}(1/t^2) $$ rather than the $$ \mathcal{O}(1/t) $$.
184
+
**Nesterov's accelerated gradient** (`type: "Nesterov"`) was proposed by Nesterov [1] as an "optimal" method of convex optimization, achieving a convergence rate of $$ \mathcal{O}(1/t^2) $$ rather than the $$ \mathcal{O}(1/t) $$.
185
185
Though the required assumptions to achieve the $$ \mathcal{O}(1/t^2) $$ convergence typically will not hold for deep networks trained with Caffe (e.g., due to non-smoothness and non-convexity), in practice NAG can be a very effective method for optimizing certain types of deep learning architectures, as demonstrated for deep MNIST autoencoders by Sutskever et al. [2].
186
186
187
187
The weight update formulas look very similar to the SGD updates given above:
@@ -206,10 +206,10 @@ What distinguishes the method from SGD is the weight setting $$ W $$ on which we
206
206
207
207
### RMSprop
208
208
209
-
The **RMSprop** (`solver_type: RMSPROP`), suggested by Tieleman in a Coursera course lecture, is a gradient-based optimization method (like SGD). The update formulas are
209
+
The **RMSprop** (`type: "RMSProp"`), suggested by Tieleman in a Coursera course lecture, is a gradient-based optimization method (like SGD). The update formulas are
0 commit comments