Introduction
========================================================================
The downhill simplex, introduced by [#Nelder65] is a method for solving
multidimensional, unconstrained, numerical optimization problems.
This type of problem is one of the most common in engineering and
science and this is one of the simplest, yet often fairly effective,
methods used to solve them.
$\newcommand{\xfat}{\textbf{x}}$
$\newcommand{\cfat}{\textbf{c}}$
$\newcommand{\sfat}{\textbf{s}}$
$\newcommand{\Real}{\mathbb{R}}$
Such problems can be said to be the search for a set of $N$ parameters,
$\xfat$, such that a mathematical function $f(\xfat)$, called the _objective
function_, takes on either a maximum or a minimum value. Formally this can be described as:
\begin{equation}
\text{arg min} f(\xfat) \\
\xfat \in \Real^N, ~ f: \Real^N \mapsto \Real^1 \\
\end{equation}
More intuitively we can look at a function plot such as the one in Figure
[1] and
say that we would like to find the two grid coordinates such that we are at
the lowest point on the surface, in this case located at the center.
![Figure [1]: Optimization problems are concerned with finding global extreme values of
functions such as the minimum at the center of this surface.](hatsimp_cropped.svg width=100%)
The parameters may represent _e.g._ some model that has been fit to data,
such as a machine learning model, the rigid transform in an image registration problem or
the parameterized description of a shape of some commercial product such that material use and mechanical
stress are sufficiently balanced.
!!!
**The method is simple yet it took a long time before I understood it!**
Like many a dear child, this method has many names, which by itself is
confusing. And even more so since _downhill simplex_ very much resembles the name of
the _simplex method_, another optimization method used to solve exclusively,
so called, _linear programs_. A special but common family of constrained optimization
problems with a linear objective function on a convex domain created by
linear constraints.
Second, it simply was not on the curriculum of the course on optimization I
took a long time ago, prioritizing linear programs, least squares methods and gradient
based methods. Nor had I seen any explanation of it in any of the books
on numerical optimization and adjacent subjects that I have skimmed over the years!
And for some reason I always found the animations and illustrations of it
sometimes seen online confusing as they simply paint over the old simplex
as they go, leaving a mess of lines behind!
Good explanations of the method, that inspired both this post and my own successfull
use applied to volumetric image registration, are given by
[#JogLekar16, #Nelder09].
A very compact and easy to implement C++ version can be found in [#NumericalRecipies].
It is also a default function minimizer in the computational library [#SciPy].
Method and simplex transforms
========================================================================
First some redundant formalism, what is a _simplex_?
A simplex is a collection of connected points in $\mathbb{R}^N$.
We call it a $k$-simplex where $k$ is the degree that indicates
how many points it contains. A $k$-simplex contains $k+1$ points, so _e.g._ a
triangle which has 3 points is a 2-simplex, a tetrahedron which has 4 points is
a 3-simplex and so on.
See [#Crane] if you are further interested in the discrete geometry
cause this is all we need.
When optimizing a function with $N$ parameters we will use a simplex with
$N+1$ points. Meaning the degree $k$ of the simplex is the same as the dimension
of the parameter space. So for a problem with 2 parameters, $N=2$,
we use a 2-simplex, a triangle, that has three points.
A 1 parameter problem would use a line segment in 1D and a 3 parameter problem
would use a tetrahedron in 3D.
Each point on the simplex, denoted $\textbf{x}_i, ~ 0 \leq i \leq N$,
corresponds to a complete set of parameters and therefore to a
value of the objective function $f_i = f(\textbf{x}_i)$.
The core of the method is then this.
We keep the points sorted so that $f(\xfat_0)$ is the current lowest value,
_i.e._ the best point, and $\xfat_N$ is the worst.
We then calculate the center of mass of all points except $\xfat_N$:
\begin{equation}
\cfat = \frac{1}{N} \sum^{N-1}_{i=0} \xfat_i
\end{equation}
Afterwards we attempt to replace the point $\xfat_N$ with one along the line formed
by $\overrightarrow{\xfat_N \cfat}$ according to a set of transforms
called _reflection_, _expansion_ and _contraction_.
A _shrink_ transform that moves all points closer to the current best
can also be performed.
We associate each transform with a parameter, $\alpha$ for reflection,
$\beta$ for contraction, $\gamma$ for expansion and $\delta$ for shrink, that
should satisfy the following constraints:
+ $\alpha > 0$
+ $0 < \beta < 1$
+ $\gamma > 1, \quad \gamma > \alpha$
+ $0 < \delta < 1$
These standard values are used for the examples in this article:
+ $\alpha = 1$
+ $\beta = \frac{1}{2}$
+ $\gamma = 2$
+ $\delta = \frac{1}{2}$
The following illustrations are of a 2D problem with a triangle simplex.
The point $\xfat_0$, blue, is the current best, $\xfat_1$, green, the second best
and $\xfat_2$, red, is the one being transformed except for when the simplex
performs a shrink and all but the best are changed.
The blue lines indicate the current simplex and the red lines the new one.
Reflection
-------------------------------------------------------------------------------
For every iteration the reflected point $\xfat_r$ is the one tried first.
$\xfat_r = \cfat + \alpha(\cfat-\xfat_2)$
If this point is better than the second worst but not better than the current best such that
$f(\xfat_0) \leq f(\xfat_r) < f(\xfat_1)$
we replace $\xfat_2$ with $\xfat_r$ and else we try expansion.
![Figure [2]: $\xfat_2$ is mirrored in $\cfat$ and creates a reflected point $\xfat_r$.](transforms/reflect.svg width=100%)
Expansion
-------------------------------------------------------------------------------
If $\xfat_r$ is better than the current best, $f(\xfat_r) < f(\xfat_0)$, we
continue to explore and compute the expansion point $\xfat_e$.
$\xfat_e = \cfat + \gamma (\xfat_r - \cfat)$
Then one of two approaches can be used.
_Greedy minimization_ in which the reflected and expanded points are compared and
the better of the two is picked such that if
$f(\xfat_e) < f(\xfat_r) < f(\xfat_0)$
we accept $\xfat_e$ and terminate the iteration and otherwise,
$f(\xfat_r) \leq f(\xfat_e), ~ \xfat_r$ is accepted.
_Greedy expansion_ will accept $\xfat_e$ if $f(\xfat_e) < f(\xfat_1)$ and $f(\xfat_r) < f(\xfat_0)$
and skip comparing $f(\xfat_e)$ to $f(\xfat_r)$. This will keep the simplex as
large as possible which can have some advantages for non-smooth functions. The
code used for Section Example implements this approach.
![Figure [3]: The expansion point $\xfat_e$ is computed by moving further away
from $\cfat$.](transforms/expand.svg width=60%)
Contraction
-------------------------------------------------------------------------------
If $\xfat_r$ is worse than the current second worst point, in or 2D case $\xfat_1$,
$f(\xfat_r) \geq f(\xfat_1)$, we compute a contraction point using the better of
$\xfat_0$ and $\xfat_r$.
_Outside_ if $f(\xfat_r) < f(\xfat_0)$. We compute the contraction point as
$\xfat_c = \cfat + \beta(\xfat_r - \cfat)$.
And if $f(\xfat_c) \leq f(\xfat_r)$, $\xfat_c$ is accepted. Otherwise a shrink of
the simplex is performed.
_Inside_ if $f(\xfat_r) \geq f(\xfat_0)$. We compute the contraction point as
$\xfat_c = \cfat + \beta(\xfat_0 - \cfat)$.
And if $f(\xfat_c) \leq f(\xfat_N)$, $\xfat_c$ is accepted. Otherwise a shrink of
the simplex is performed.
![Figure [4]: Contraction point $\xfat_c$ computed outside old simplex.](transforms/contract_o.svg width=90%)
![Figure [5]: Contraction point $\xfat_c$ computed inside old simplex.](transforms/contract_i.svg width=90%)
Shrink
-------------------------------------------------------------------------------
The last resort if none of the criteria so far have been met is to shrink the entire
simplex by moving all points closer to the current best point.
$\sfat_i = \xfat_0 + \delta(\xfat_i - \xfat_0), ~ i = \{1, \dots, N\}$
![Figure [6]: Shrink the simplex by moving every point closer to the current best](transforms/shrink.svg width=90%)
The reasoning behind the shrink transform is expressed in the original paper
[#Nelder65].
> _A failed contraction is much rarer, but can occur when a valley is curved and_
> _one point of the simplex is much farther from the valley bottom than the_
> _others; contraction may then cause the reflected point to move away from the_
> _valley bottom instead of towards it. Further contractions are then useless._
> _The action proposed contracts the simplex towards the lowest point, and will_
> _eventually bring all points into the valley._
!!!
The paper [#Nelder65] has an explanitory flowchart that shows all the tests
and when to quit one iteration.
Initialization and termination
-------------------------------------------------------------------------------
When initializing the simplex usually one point $\xfat_0$ is to be given by the user, then
the other points are created by placing them a certain distance
$\mu$ from the initial point along the coordinate axes in the parameter space
$\Real^N$ of the objective function.
$\xfat_i = \xfat_0 + \mu \hat{e}_i, \quad i = \{1 \dots N\}$
The algorithm terminates in one of three different ways.
Either by having the simplex become sufficiently small, such that the active
search space is thought to not have any considerable variation in function
value.
Another criteria is for all the function values of the points of the simplex
to have a similar enough value.
And finally the algorithm must be set to run for a maximum number of iterations
or function evaluations.
Example
========================================================================
The example problem solved is the same as is plotted in Figure. 1.
It is a 2D radially symmetrical function consisting of the difference
between two Gaussians with an added parabolic term that makes the smallest
value of the function be at $f(0,0) = 0$.
\begin{equation}
f(\xfat) = f(x,y)
= e^{-\frac{x^2 + y^2}{2 s^2}} -
e^{-\frac{x^2 + y^2}{2 t^2}} +
\frac{1}{10}x^2 + \frac{1}{10}y^2
, \qquad s = 0.75, \quad t = 1
\end{equation}
Finding the minimum
------------------------------------------------------------------------------
The initial simplex is created manually with one point close to the center, one
close to the circular, local minimum rift, and one further from the center.
The manually selected initial simplex and the first few iterations are plotted below.
  
  
The simplex quickly reduces in size by moving the two points further from the
center towards the point close to it.
When running the algorithm using no termination strategy for 100 iterations the
following result is obtained.
$\xfat = [-2.82 \times 10^{-12}, ~ -4.81 \times 10^{-13}]$
$f(\xfat) = 8.20 \times 10^{-26}$
Which is a result very close to the true optimal parameters and function value
$f(0,0) = 0$.
Further, the following is obtained regarding the transforms that the simplex
makes while searching for the solution.
Transform | Performed
------- |------
Total | 100
Reflection | 12
Expansion | 8
Contraction (Outside) | 0
Contraction (Inside) | 78
Shrink | 2
[Table [states]: The number of different transforms made while solving the minimization problem.]
Animation
-------------------------------------------------------------------------------
The algorithm is sometimes referred to by the rather colorful name of
_amoeba method_ due to its way of crawling, transforming, towards the local
optimum. The following animation is not completely indicative of how the method
actually works but shows this _amoeba_ like motion of the simplex for the
first 20 iterations.
It also shows the two shrink transforms that where performed.
As previously, the blue point corresponds to the current best, $\textbf{x}_0$,
green to the second best, $\textbf{x}_1$, and the red point, $\textbf{x}_2$,
is the current worst point.
![Figure [video]: A video](animation/amoeba_animation.mp4 width=75%)
Computational aspects and conclusion
========================================================================
The downhill simplex method has some pros and cons.
Its major strength is that it does not require the gradient of $f$. The gradient
of the objective function is often not known as an analytical expression and can
therefore be very expensive to evaluate. If so it has to be done using some
form of finite difference scheme.
This computational cost increases with the dimensionality of the problem and is
also subject to potential numerical stability problems if some care is not taken.
Not needing a gradient also makes the method applicable to non-smooth functions
where the gradient may not exist on the whole domain.
The method has a memory footprint on the order of about $(N+2) \times N$ since we always have
to store $N+2$ sets of $N$ parameters.
And even though we don't have to compute any gradients its somewhat slow
convergence compared to gradient based methods generally causes it
to require more function evaluations.
Nevertheless it remains a very attractive method for its robustness and
simplicity.
+++++
**Bibliography**:
[#Nelder65]: Nelder J. A. & Mead R. (1965).
A simplex method for function minimization.
The computer journal, 7(4), 308-313,
http://www.ii.uib.no/~lennart/drgrad/Nelder1965.pdf
[#Nelder09]: Nelder J. A. & Singer S. (2009).
http://www.scholarpedia.org/article/Nelder-Mead_algorithm/
[#Joglekar16]: Sachin Joglekar (2016).
https://codesachin.wordpress.com/2016/01/16/nelder-mead-optimization/
[#NumericalRecipies]: W. H. Press, S. A. Teukolsky, W. T. Vetterling & B. P. Flannery. (2007).
Numerical Recipes: The Art of Scientific Computing (3rd Edition)
[#Crane19]: K. Crane. (2019).
Discrete Differential Geometry: An Applied Introduction
http://www.cs.cmu.edu/~kmcrane/Projects/DGPDEC/paper.pdf
[#SciPy]: https://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.optimize.fmin.html
Code for all figures and examples:
https://bitbucket.org/nils_olovsson/downhill_simplex/src/master/