NN Tips
From Jeong-Yoon Lee's Wiki
Terminology
j be an index for cases X or X_j be an input vector W be a collection (vector, matrix, or some more complicated structure) of weights and possibly other parameter estimates y or y_j be a target scalar M(X,W) be the output function computed by the network (the letter M is used to suggest "mean", "median", or "mode") p or p_j = M(X_j,W) be an output (the letter p is used to suggest "predicted value" or "posterior probability") r or r_j = y_j - p_j be a residual Q(y,X,W) be the case-wise error function written to show the dependence on the weights explicitly L(y,p) be the case-wise error function in simpler form where the weights are implicit (the letter L is used to suggest "loss" function) D be a list of indices designating a data set, including inputs and target values DL designate the training (learning) set DV designate the validation set DT designate the test set #(D) be the number of elements (cases) in D NL be the number of cases in the training (learning) set NV be the number of cases in the validation set NT be the number of cases in the test set TQ(D,W) be the total error function AQ(D,W) be the average error function
Howto Select Parameters
- Howto choose activation functions?.
- Sigmoid for hidden units
- Linear for output units
Howto Select Nonlinear Optimization Algorithms
- For a small number of weights
- Stabilized Newton algorithm
- Gauss-Newton algorithm
- including various Levenberg-Marquardt and trust-region algorithms
- The memory required by these algorithms is proportional to the square of the number of weights.
- For a moderate number of weights
- Quasi-Newton algorithms
- The memory required by these algorithms is proportional to the square of the number of weights.
- For a large number of weights
- Conjugate-gradient algorithms
- The memory required by these algorithms is proportional to the number of weights.