Karl Welzel, University of Oxford
Optimization algorithms which aim to find a set of inputs that minimize a given function are incredibly useful. In deep learning, most algorithms restrict themselves to inspecting only evaluations of the first derivative. The classical optimization literature has also investigated second-order methods at length. But what happens when we allow algorithms to make use of derivatives all the way up to pth-order (for p > 2)? Extending existing results in the literature, I show that close to the solution the order of convergence increases for larger p and that on certain problems, knowledge of higher derivatives is necessary to achieve superfast convergence. I will talk, in particular, about adaptive regularization methods (which do not require knowledge of a global Lipschitz constant of the objective function) and the difficulties they face with non-convex local models.