Lecture: Automatic Step-Size Control for Minimization Iterations
William Morton Kahan
Abstract:
The "Training" of "Deep Learning" for "Artificial Intelligence" is a process that minimizes a "Loss Function" ƒ(w) subject to memory constraints that allow the computation of ƒ(w) and its Gradients G(w) := dƒ(w)/dw` but not the Hessian
d2ƒ(w)/dw2 nor estimates of it from many stored pairs {G(w), w}. Therefore the process is iterative using "Gradient Descent" or an accelerated modification of it like "Gradient Descent Plus Momentum". These iterations require choices of one or two scalar "Hyper-Parameters" which cause divergence if chosen badly. Fastest convergence requires choices derived from the Hessian's two attributes, its "Norm" and "Condition Number", that can almost never be known in advance. This retards Training, severely if the Condition Number is big. A new scheme chooses Gradient Descent's Hyper-Parameter, a step-size called "the Learning Rate", automatically without any prior information about the Hessian; and yet that scheme has been observed always to converge ultimately almost as fast as could any acceleration of Gradient Descent with optimally chosen Hyper-Parameters. Alas, a mathematical proof of that scheme's efficacy has not been found yet.
Details, a work in progress continually evolving, are posted on:
https://people.eecs.berkeley.edu/~wkahan/STEPSIZE.pdf
Download link to the slides of Kahan's lecture