BFGS in a Nutshell: An Introduction to Quasi-Newton Methods (2024)

Getting Started

Demystifying the inner workings of BFGS optimization

Published in

Towards Data Science

17 min read

Nov 26, 2020

BFGS in a Nutshell: An Introduction to Quasi-Newton Methods (3)

Surely anyone who has dabbled in machine learning is familiar with gradient descent, and possibly even its close counterpart, stochastic gradient descent. If you have more than dabbled, then you’re surely also aware of the fancier extensions like gradient descent with momentum and Adam optimization.

Perhaps less well-known are a class of optimization algorithms known as quasi-Newton methods. Though these optimization methods are less fervently advertised in popular accounts of machine learning, they hold an important place in the arsenal of machine learning practitioners.

The goal of this article is to provide an introduction to the mathematical formulation of BFGS optimization, by far the most widely used quasi-Newton method. As such, the focus will be on the mathematical derivation of results, rather than the application of BFGS in code. It is my hope that by the end of this article, you will have gained an appreciation for what BFGS is, how it works, and why it was developed. A top priority of mine when crafting this article was to make it as accessible as possible. To this end, any non-trivial derivation will be explicitly shown, and the dreaded phrase “it is straightforward to show” will never make an appearance.

Instead of jumping right into quasi-Newton methods and BFGS, my strategy is to start off by doing a run-through of a few of the more basic optimization methods first, and explore their deficiencies. This would then provide a natural segue to quasi-Newton methods and how they aim to address these deficiencies. This might seem like a long-winded way of getting to the main topic, but I believe it is well-justified if we are to truly appreciate the motivation behind developing BFGS, and to get a sense of where it stands within the landscape of optimization methods.

So without further ado, let’s start from the very beginning, with gradient descent.

We begin with a lightning quick review of gradient descent, which is an iterative method for finding a local minimum of a real-valued, differentiable objective function f(x).

To find a local minimum, we start off at a random initial point and iteratively take steps proportional to the negative gradient of the function f at the current point. Since the gradient of a function points in the direction of steepest ascent, the negative gradient points in the direction of steepest descent, and thus at each step of gradient descent, we are moving in the direction where f(x) decreases the fastest. Symbolically, an iteration of gradient descent is written as

BFGS in a Nutshell: An Introduction to Quasi-Newton Methods (4)

where 𝛼 is a positive real number known as the learning rate, which controls the step size taken at each iteration. Too large of a learning rate and our model diverges out of control; too small of a learning rate and our model becomes highly inefficient, taking unnecessarily long to converge to the minimum. There is no a priori way of choosing a good learning rate that lies in the sweet spot between these two extremes, and in practice the optimal learning rate is typically found empirically, by looping through various values and seeing how they perform.

Looking at equation (1), we see that gradient descent is a first-order optimization method, as it uses first-order information (ie. the gradient) to find the minimum. While this often reliably gets the job done, its main disadvantage lies in the fact that it is quite inefficient, even for a suitably chosen learning rate. By approximating our objective function linearly at each step, we are only working with very limited local information, and we thus have to be cautious and restrain ourselves to small step sizes at each iteration.

Perhaps we can do better, by obtaining more local information of the objective function at each iteration, in hopes that we can make more well-informed steps. A natural extension would be to look at the second-order behavior of the objective function.

For simplicity, let’s first consider a function f of one variable. Consulting Taylor, we know that the second order approximation of f about a point xᴋ+ϵ is given by

BFGS in a Nutshell: An Introduction to Quasi-Newton Methods (5)

Our Taylor approximation is minimized when

BFGS in a Nutshell: An Introduction to Quasi-Newton Methods (6)

which corresponds to a step size of

BFGS in a Nutshell: An Introduction to Quasi-Newton Methods (7)

We can thus try a new iteration scheme

BFGS in a Nutshell: An Introduction to Quasi-Newton Methods (8)

which also takes into account the second-order behavior of the objective function.

Generalizing to n dimensions, the first derivative is replaced by the gradient ∇f and the second derivative is replaced by the Hessian H.

BFGS in a Nutshell: An Introduction to Quasi-Newton Methods (9)

In n dimensions, our new iterative scheme is thus written as:

BFGS in a Nutshell: An Introduction to Quasi-Newton Methods (10)

This method of optimization, where we take into account the objective function’s second order behavior in addition to its first order behavior, is known as Newton’s method. In each iteration k, Newton’s method approximates the function f at the point xᴋ with a paraboloid, and then proceeds to minimize that approximation by stepping to the minimum of that paraboloid (there is actually a caveat here that we will get to later). Notice that in contrast to regular gradient descent, there is no longer a need to set a learning rate parameter, because now our step size is determined exactly by the distance to the minimum of the fitted parabola at that point.
To see how much we stand to gain from computing the Hessian, consider figure 1 below where we want to minimize our loss function, starting from the point (-5, 5). For a suitably chosen learning rate, gradient descent takes 229 steps to converge to the minimum. On the other hand, Newton’s method converges to the minimum in only six steps!

BFGS in a Nutshell: An Introduction to Quasi-Newton Methods (11)

If this is your first time learning about Newton’s method, you’re probably just as shocked as I was when it was my first time. “Surely this is too good to be true!” you might exclaim. Why ever use gradient descent if Newton’s method converges to the minimum so much faster?

It turns out that the benefits we gain from Newton’s method comes at a cost. There are two main issues with Newton’s method:

It is sensitive to initial conditions. This is especially apparent if our objective function is non-convex. Unlike gradient descent, which ensures that we’re always going downhill by always going in the direction opposite the gradient, Newton’s method fits a paraboloid to the local curvature and proceeds to move to the stationary point of that paraboloid. Depending on the local behavior of our initial point, Newton’s method could take us to a maximum or a saddle point instead of a minimum. This problem is exacerbated in higher dimensional non-convex functions, where saddle points are much more likely to occur compared to minimum points. We thus see that Newton’s method is really only appropriate for minimizing convex objective functions. The silver lining here, though, is that a considerable amount of optimization problems in machine learning are convex problems, such as linear (and ridge) regression, logisitic regression, and support vector machines. An obvious example of a non-convex model would be neural networks. For the rest of this article, we will restrict our attention to only convex objective functions. Correspondingly, we require the Hessian to be positive-definite.
Newton’s method is very computationally expensive. While the computation of the gradient scales as O(n), the computation of the inverse Hessian scales as O(n³) (computing the Hessian scales as O(n²), inverting it scales as O(n³)). As the dimensions of our problem increases, the overhead in memory and time gets out of hand very quickly. For example, in 50 dimensions, we’ll have to calculate 50(50+1)/2 = 1275 values for the Hessian at each step, and then perform approximately another 50³ operations to invert it. It’s clear at this point that the benefit of an increased convergence rate will be far outweighed by the large cost of the additional computation time.

Despite its limited practical use, Newton’s method is still of great theoretical value. We did see how efficient the second-order optimization can be if used correctly. What if we could somehow leverage the efficiency gained from considering second-order behavior, but avoid the computational cost of calculating the inverse Hessian? In other words, can we get a sort of hybrid between gradient descent and Newton’s method, where we can have faster convergence than gradient descent, but lower operational cost per iteration than Newton’s method?

It turns out there’s a class of optimization methods, called quasi-Newton methods, that does just that.

We went through Newton’s method for optimization, which, in contrast to vanilla gradient descent, leverages second-order behavior in addition to first-order behavior at each step, making for a much faster convergence to the minimum. However, we also saw the downsides to Newton’s method — one of which is how computationally expensive it is to both calculate the Hessian and to invert it, especially when dimensions get large. Quasi-Newton methods are a class of optimization methods that attempt to address this issue.

Recall that in Newton’s method, we make the following update at each iteration:

BFGS in a Nutshell: An Introduction to Quasi-Newton Methods (12)

where the Hessian is computed and inverted at each step. In quasi-Newton methods, instead of computing the actual Hessian, we just approximate it with a positive-definite matrix B, which is updated from iteration to iteration using information computed from previous steps (we require B to be positive-definite because we are optimizing a convex function, and this automatically takes care of the symmetry requirement of the Hessian).We immediately see that this scheme would yield a much less costly algorithm compared to Newton’s method, because instead of computing a large amount of new quantities at each iteration, we’re largely making use of previously computed quantities.

At this stage the nature of the update scheme for B has been left vague, because the specific update for B is given by the specific quasi-Newton method used. There is however one condition that all quasi-Newton methods must share, and that is the Hessian approximation B must satisfy the quasi-Newton condition (or secant equation):

BFGS in a Nutshell: An Introduction to Quasi-Newton Methods (13)

which is obtained from the first order Taylor expansion of ∇ f(xᴋ₊₁) about ∇f(xᴋ) (we can also view this as sort of a finite difference equation of the gradient itself). We can rewrite the quasi-Newton condition more succinctly by letting yᴋ = ∇ f(xᴋ₊₁) −∇f(xᴋ) and Δxᴋ = xᴋ₊₁−xᴋ, so that we have

BFGS in a Nutshell: An Introduction to Quasi-Newton Methods (14)

Furthermore, we can verify the positive-definiteness of our Hessian approximation B by pre-multiplying the quasi-Newton condition with Δxᴋᵀ, so our requirement for positive-definiteness (ie. the curvature condition) can be expressed as Δxᴋᵀ yᴋ>0.
Before we go further however, let’s take a step back and consider only one dimension, where our intuition is strong. The secant equation, in one dimension, is

BFGS in a Nutshell: An Introduction to Quasi-Newton Methods (15)

and the curvature condition is satisfied by requiring
(f′ᴋ₊₁−f′ᴋ)/(xᴋ₊₁−xᴋ) > 0. Solving for f″ and substituting into Newton’s method in one dimension, we get

BFGS in a Nutshell: An Introduction to Quasi-Newton Methods (16)

We thus have here an optimization method that leverages the (approximate) second-order behavior of the objective function in order to converge faster, without actually taking any second derivatives. Instead, at each iteration k+1, we construct an approximate inverse second derivative using only quantities from previous steps — in this case the first derivative from the previous two iterations k and k−1.

This method — approximating Newton’s method in one dimension by replacing the second derivative with its finite difference approximation, is known as the secant method, a subclass of quasi-Newton methods.

Now back to our n-dimensional quasi-Newton condition (equation 4). At first glance, it looks like we could just work analogously to our one-dimensional case: we can simply solve for Bᴋ₊₁ directly, and substitute it into Newton’s iterative step (2). Job done, right?

Not quite, actually. Despite looking deceptively similar to our one-dimensional case, remember that B is in general a symmetric n × n matrix, with n(n+1)/2 components, whereas our equation only has n components. This means that we’re trying to solve for n(n+1)/2 unknowns with only n equations, making this an underdetermined system. In fact, we were only able to find a unique solution to the secant equation in one dimension because the unknown components of the Hessian, being 1(1+1)/2 = 1, coincide with the one equation that we have. In general, there are n(n+1)/2 − n= n(n−1)/2 unknowns that we cannot solve for.

This is where quasi-Newton methods come in, where the secant method is generalized to multidimensional objective functions. Instead of approximating the second derivative merely by using the finite difference like in the secant method, quasi-Newton methods have to impose additional constraints. The common theme still runs through though — at each iteration k+1, the new Hessian approximation Bᴋ₊₁ is obtained using only previous gradient information.

Various quasi-Newton methods have been developed over the years, and they differ in how the approximate Hessian B is updated at each iteration. As of today, the most widely used quasi-Newton method is the BFGS method, and this will be our focus for the remaining of this article.

It’s been somewhat of a long trek so far, so let’s pause for moment and do a quick recap before moving on. Our objective is to find the minimum of a (twice-differentiable) convex function. A simple approach to this is gradient descent — starting from some initial point, we slowly move downhill by taking iterative steps proportional to the negative gradient of the function at each point. Newton then taught us that we can take far less steps and converge much quicker to the minimum if we also consider the second-order behavior of the function, by computing the (inverse) Hessian at each step. This comes at a cost, however, as calculating (and inverting) the Hessian takes a lot of resources. A compromise would be to instead just approximate the Hessian at each step, subject to the quasi-Newton condition. In one dimension, this just amounts to approximating the second derivative by replacing it with a finite difference approximation; this method of optimization is called the secant method. In more than one dimension, the quasi-Newton condition does not uniquely specify the Hessian estimate B, and we need to impose further constraints on B to solve for it. Different quasi-Newton methods offer their own method for constraining the solution. Here, we will focus on one of the most popular methods, known as the BFGS method. The name is an acronym of the algorithm’s creators: Broyden, Fletcher, Goldfarb, and Shanno, who each came up with the algorithm independently in 1970 [7–10].

BFGS in a Nutshell: An Introduction to Quasi-Newton Methods (17)

We saw previously that in n>1 dimensions, the quasi-Newton condition (4) is underdetermined. To determine an update scheme for B then, we will need to impose additional constraints. One such constrain that we’ve already mentioned is the symmetry and positive-definiteness of B — these properties should be preserved in each update. Another desirable property we want is for Bᴋ₊₁ to be sufficiently close to Bᴋ at each update k+1. A formal way to characterize the notion of “closeness” for matrices is the matrix norm. Thus, we should look to minimize the quantity ||Bᴋ₊₁−Bᴋ||. Putting all our conditions together, we can formulate our problem as

BFGS in a Nutshell: An Introduction to Quasi-Newton Methods (18)

Referring back to Newton’s method, we recall that it’s actually the Hessian’s inverse, instead of the Hessian itself, that makes an appearance. So instead of computing the approximate Hessian B, we can just as well compute the approximate inverse B⁻¹ directly. We can thus alternatively formulate (5) as

BFGS in a Nutshell: An Introduction to Quasi-Newton Methods (19)

(Notice the inverted quasi-Newton condition.) To repeat, we minimize the change in B⁻¹ at each iteration, subject to the (inverted) quasi-Newton condition and the requirement that it be symmetric. We have also previously mentioned many times the requirement that B (and by extension B⁻¹) be positive-definite; it turns out this property comes along for the ride when we solve this problem. We will show this eventually; in the meantime, I’ll have to ask you to hold your breath.

To solve for Bᴋ₊₁, we still have to specify the particular matrix norm we’re using in (6). Different choices of norms result in different quasi-Newton methods, characterized by a different resulting update scheme. In the BFGS method, the norm is chosen to be the Frobenius norm:

BFGS in a Nutshell: An Introduction to Quasi-Newton Methods (20)

which is just the square root of the sum of the absolute value squared of the matrix elements.

Solving (6) from scratch is no easy feat, and we will not go through it here. For the highly mathematically inclined reader, refer to references [4-6] for detailed derivations. For our purposes, it suffices to know that the problem ends up being equivalent to updating our approximate Hessian at each iteration by adding two symmetric, rank-one matrices U and V:

BFGS in a Nutshell: An Introduction to Quasi-Newton Methods (21)

To fulfill the aforementioned conditions, the update matrices can then be chosen to be of the form U = a u uᵀ and V = b v vᵀ, where u and v are linearly independent non-zero vectors, and a and b are constants. The outer product of any two non-zero vectors is always rank one, and since U and V are both the outer product of a vector with itself, they are also symmetric, taking care of the requirement that the Hessian remains symmetric upon updates. We then have

BFGS in a Nutshell: An Introduction to Quasi-Newton Methods (22)

Since Uᴋ and Vᴋ are rank-one (and u and v are linearly independent), their sum is rank-two, and an update of this form is known as a rank-two update. Thus if we start out with a symmetric matrix B₀, then Bᴋ will remain symmetric for all k. The rank-two condition guarantees the “closeness” of Bᴋ and Bᴋ₊₁at each iteration.
Furthermore, we have to impose the quasi-Newton condition Bᴋ₊₁ Δx = yᴋ :

BFGS in a Nutshell: An Introduction to Quasi-Newton Methods (23)

At this point, a natural choice for u and v would be u = yᴋ and v = Bᴋ Δxᴋ. We then have

BFGS in a Nutshell: An Introduction to Quasi-Newton Methods (24)

Substituting a and b back into (7), we have the BFGS update in all its glory:

BFGS in a Nutshell: An Introduction to Quasi-Newton Methods (25)

As advertised, we are updating the approximate Hessian at each iteration using only previous gradient information. Note, however, that in practice we actually want to compute B⁻¹ directly, because it’s the inverse Hessian that appears in a Newton update. To invert (8), we make use of the Woodbury formula, which tells us how to invert the sum of an invertible matrix A and a rank-k correction:

BFGS in a Nutshell: An Introduction to Quasi-Newton Methods (26)

We can thus obtain the formula for a BFGS update for the inverse of B. We first rewrite the update in a more suitable form for applying the Woodbury formula (to avoid clutter, we’ll suppress the subscripts k+1 and k, instead denoting the update as B₊):

BFGS in a Nutshell: An Introduction to Quasi-Newton Methods (27)

Plugging into the Woodbury formula,

BFGS in a Nutshell: An Introduction to Quasi-Newton Methods (28)

So while equation (8) is cleaner and allows for more straightforward analysis, equation (9) is what we actually want to compute in practice. Again, each update is made only requiring previous gradient information. At each iteration, we update the value of B⁻¹ using only the values of Δxᴋ = xᴋ₊₁−xᴋ and yᴋ = ∇ f(xᴋ₊₁) −∇f(xᴋ) of the previous steps, in accordance to equation (9). By directly estimating the inverse Hessian at each step, we are completely doing away with those laborious O(n³) operations of inverting the Hessian, as in Newton’s method.

We are now also in a position to show that our rank-two update preserves positive-definiteness. From (9), we can calculate the quantity

BFGS in a Nutshell: An Introduction to Quasi-Newton Methods (29)

for a non-zero vector z. If Bᴋ⁻¹ is positive-definite, then both terms are non-negative, with the second term being zero only if Δxᵀ z = 0. Positive-definiteness is thus preserved by BFGS’s rank-two update. If we were to go back to (7) and repeat the development but with a rank-one update instead, we wouldn’t actually end up with a positive-definite preserving update [4, 5]. This explains why BFGS uses a rank-two update: it is the update with the lowest rank that preserves positive-definiteness.

One final implementation detail that we previously glossed over: since an update of B⁻¹ depends on its previous value, we have to initialize B₀⁻¹ to kick off the algorithm. There are two natural ways to do this. The first approach is to set B₀⁻¹ to the identity matrix, in which case the first step will be equivalent to vanilla gradient descent, and subsequent updates will incrementally refine it to get closer to the inverse Hessian. Another approach would be to compute and invert the true Hessian at the initial point. This would start the algorithm off with more efficient steps, but comes at an initial cost of computing the true Hessian and inverting it.

Being an introductory piece, the aim of this discussion was to present quasi-Newton methods and BFGS in a manner that is as accessible as possible. As such, we’ve really only just skimmed the surface of all the mathematical intricacies and performance considerations that underlie BFGS and other quasi-Newton methods. There is a wealth of resources out there on this subject, a very small selection of which I have included in the references below. I encourage readers who are interested in diving deeper to seek out these resources. Be forewarned though — some of these references aren’t easy!

Thank you for making it all the way to the end! Hopefully you’ve gained an appreciation for how BFGS works under the hood, so that next time you call scipy.optimize.minimize(method=‘BFGS'), instead of viewing it as a complete black box, you’ll smile to yourself knowingly, fully aware that BFGS is simply iteratively making positive-definite preserving rank-two updates to the loss function’s approximate inverse Hessian, to efficiently find you its minimum.

Cursory run-throughs:
[1] BFGS on Wikipedia
[2] Quasi-Newton methods on Wikipedia
[3] Newton’s method on Wikipedia

Advanced references on BFGS:
[4] J. Nocedal and S. Wright. Numerical Optimization (Chapter 6). Springer, 2nd edition, 2006.
[5] Oxford University lecture notes by R. Hauser
[6] MIT 18.335 lecture notes by S. G. Johnson

The original papers:
[7] C. G. Broyden, “The convergence of a class of double-rank minimization algorithms”, J. Inst. Math. Appl. 6, 76–90 (1970)
[8] R. Fletcher, “A new approach to variable metric algorithms”, Comp. J. 13, 317–322 (1970)
[9] D. F. Goldfarb, “A family of variable-metric methods derived by variational means”, Math. Comp. 24, 23–26 (1970)
[10] D. Shanno, “Conditioning of quasi-Newton methods for function minimization”, Math. Comp. 24, 647–656 (1970)

BFGS in a Nutshell: An Introduction to Quasi-Newton Methods (2024)

FAQs

What are the quasi-Newton methods of BFGS? ›

(BFGS) algorithm is part of the Quasi-Newton family and it's an iterative method for solving unconstrained nonlinear optimization problems. BFGS determines the descent direction by preconditioning the gradient with curvature information.

What is the quasi-Newton method of BFGS? ›

BFGS optimization

Our objective is to find the minimum of a (twice-differentiable) convex function. A simple approach to this is gradient descent — starting from some initial point, we slowly move downhill by taking iterative steps proportional to the negative gradient of the function at each point.

Read The Full Story ›

What are the quasi-Newton methods? ›

In optimization, quasi-Newton methods (a special case of variable-metric methods) are algorithms for finding local maxima and minima of functions. Quasi-Newton methods are based on Newton's method to find the stationary point of a function, where the gradient is 0.

See Details ›

What is the BFGS method? ›

In numerical optimization, the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm is an iterative method for solving unconstrained nonlinear optimization problems. Like the related Davidon–Fletcher–Powell method, BFGS determines the descent direction by preconditioning the gradient with curvature information.

Discover More Details ›

Is L-BFGS better than Adam? ›

The neural network is small, and thus we can use BFGS/LBFGS to train. In fact, the following plot shows that BFGS/LBFGS is much more accurate and effective than the commonly used Adam optimizer.

Tell Me More ›

What is the derivation of BFGS method? ›

The BFGS formula is a correction formula of rank 2 which is derived from the formula of DFP by swapping the roles of sk and yk. BFGS is named for the four people who independently discovered it in 1970: Broyden, Fletcher, Goldfarb and Shanno.

Know More ›

What does Newton's method solve? ›

Newton's method for solving equations is another numerical method for solving an equation f(x)=0. It is based on the geometry of a curve, using the tangent lines to a curve. As such, it requires calculus, in particular differentiation. Roughly, the idea of Newton's method is as follows.

See Details ›

What is L BFGs? ›

Limited-memory BFGS (L-BFGS or LM-BFGS) is an optimization algorithm in the family of quasi-Newton methods that approximates the Broyden–Fletcher–Goldfarb–Shanno algorithm (BFGS) using a limited amount of computer memory. It is a popular algorithm for parameter estimation in machine learning.

Get More Info Here ›

Why is it called Newton's method? ›

The name "Newton's method" is derived from Isaac Newton's description of a special case of the method in De analysi per aequationes numero terminorum infinitas (written in 1669, published in 1711 by William Jones) and in De metodis fluxionum et serierum infinitarum (written in 1671, translated and published as Method ...

Learn More ›

What are the advantages of BFGS? ›

Versatility: BFGS can be applied to a wide range of optimization problems, including those with noise and nonsmooth functions, making it a valuable tool for machine learning practitioners and researchers.

What is the BFGS method of optim? ›

Method "BFGS" is a quasi-Newton method (also known as a variable metric algorithm), specifically that published simultaneously in 1970 by Broyden, Fletcher, Goldfarb and Shanno. This uses function values and gradients to build up a picture of the surface to be optimized.

View Details ›

What is the BFGS method in R? ›

BFGS is an optimization method for multidimensional nonlinear unconstrained functions. BFGS belongs to the family of quasi-Newton (Variable Metric) optimization methods that make use of both first-derivative (gradient) and second-derivative (Hessian matrix) based information of the function being optimized.

See Details ›

What are inexact Newton methods? ›

Inexact Newton Methods are widely used to solve systems of nonlinear equations. The convergence of these methods is controlled by the relative linear tolerance, \eta_\nu, that is also called the forcing term.

Keep Reading ›

Is Gauss Newton a quasi-Newton method? ›

The approximate Hessian in the Gauss-Newton method is not of the same type as the quasi-Newton approximate Hessians (BFGS, DFP, etc.) Note that one can apply quasi-Newton methods like BFGS to nonlinear least squares problems but you can't apply the Gauss-Newton method to minimizing a general function f(x).

Show Me More ›

What is relaxation of crystals with the quasi-Newton method? ›

A quasi-Newton method is used to simultaneously relax the internal coordinates and lattice parameters of crystals under pressure. The symmetry of the crystal structure is preserved during the relaxation.

Know More ›

What is the new quasi-Newton equation and related methods for unconstrained optimization? ›

In unconstrained optimization, the usual quasi-Newton equation is B _k₊₁ s _k=y _k, where y _k is the difference of the gradients at the last two iterates. In this paper, we propose a new quasi-Newton equation, , in which is based on both the function values and gradients at the last two iterates.

Keep Reading ›