Linear regression

    Introduction
========================================================================

I remember what I think was the first time I was instructed to fit a line to a
set of data points. It was during a physics (science class) lab exercise when I was
around 16 and it might have concerned resistivity and resistance in a conducting
wire.
We had the results of some rather noisy instrument readings jotted down
on a set of graph paper and were instructed to fit a line to the data
in order to get the slope which corresponded to some material constant.

What I remember our teacher told us was to adjust a ruler so that the data
points lied equally far away on both sides of the line we would draw.
Hopefully this would create something close to the orange line in Figure [header]
but more likely one would end up with a suboptimal result.
A better way is to find a line using linear regression[#wiki].

![Figure [header]: We can see that the orange line seem to fit the blue dots
better than the other lines.](figures/linreg.svg width=100%)

!!! Info
    _**I walk the line, regression line**_

    Recently I participated in a short course in biostatistics and apart from
    learning handy things about various statistical tests, p-values, confidence
    intervals and survival analysis I also encountered an old acquaintance, linear
    regression.

    I can count at least three courses during my first two years as an
    engineering student that covered linear regression, typically with some
    extra flavor. However it was not until now, some ten years later, that
    I got a better understanding of the problem.
    Seeing it after having seen a fair share of
    general optimization problems.

    While waiting for some computations to finish, formulating the
    linear regression problem and solving it was a fun exercise.

In this article I will derive the solution to the linear regression problem in
closed form and show how it can be implemented as a very small python function.

I will also discuss some related problems and apply the linear regression to
real data to solve a problem in radiation protection.

Finally I wanted to try out the mathematical animation framework [#manim]
and put together a small example that attempts to show how the
linear regression line fits the data in an optimal manner.
$\require{mhchem}\require{xcolor}$
$\newcommand{a}{{\color{#1F77B4}a}}$
$\newcommand{b}{{\color{#FF7F0E}b}}$
$\newcommand{AN}{{\color{#BCBD22}{D-C\frac{N}{B}}}}$
$\newcommand{AD}{{\color{#D62728}{B-A\frac{N}{B}}}}$
$\newcommand{BN}{{\color{#9467BD}{C-D\frac{A}{B}}}}$
$\newcommand{BD}{{\color{#2CA02C}{B-N\frac{A}{B}}}}$
$\newcommand{ANb}{{\color{#BCBD22}{BD-CN}}}$
$\newcommand{ADb}{{\color{#D62728}{BB-AN}}}$
$\newcommand{BNb}{{\color{#9467BD}{BC-DA}}}$
$\newcommand{BDb}{{\color{#2CA02C}{BB-NA}}}$
$\newcommand{xi}{x_{i}}$
$\newcommand{yi}{y_{i}}$
$\newcommand{\Real}{\mathbb{R}}$

Linear regression
========================================================================
Let's say we have $N$ pairs of data points $\textbf{x} = \{x_{1}, \dots, x_{N}\}$
and $\textbf{y} = \{y_{1}, \dots, y_{N}\}$.
We assume that there is a linear relationship between the $x$ and $y$ values such that they
can be described by the equation
\begin{equation}
    y = \a \cdot x + \b
\end{equation}

Since we typically don't have data that align perfectly with this model
we decide that what we want is a line that fits our data points by having the
smallest possible sum of squared distances to the values in $\textbf{y}$.

I write this in a way that we typically formulate optimization problems, though
we will solve this analytically.

\begin{equation}
    \text{arg min } \epsilon(\a, \b) =
    \sum_{i=1}^{N}(\a \cdot x_{i} + \b - y_{i})^{2}
\end{equation}

Now, what we really are looking for here is an extreme value of this two
parameter function. When looking for an extreme value of a function with one
parameter, $f(x)$, we would use the first order derivative and look for a point where it
is zero, $f'(x)$=0. And we will do the same thing here but instead look at the gradient.

\begin{equation}
\label{eq:diff}
    0 = \nabla \epsilon (\a, \b) =
    \frac{\partial}{\partial \a} \epsilon \hat{\a} +
    \frac{\partial}{\partial \b} \epsilon \hat{\b}
\end{equation}

We note that both components in the gradient should be zero and calculate the
derivatives of these two terms.

\begin{equation}
    \left\{
    \begin{split}
        0 &= \frac{\partial}{\partial \a} \epsilon =
        \frac{\partial}{\partial \a} \sum_{i=1}^{N}(\a \cdot x_{i} + \b - y_{i})^{2} =
        2 \sum_{i=1}^{N} x_{i} (\a x_{i} + \b - y_{i})\\
        0 &= \frac{\partial}{\partial \b} \epsilon =
        \frac{\partial}{\partial \b} \sum_{i=1}^{N}(\a \cdot x_{i} + \b - y_{i})^{2} =
        2 \sum_{i=1}^{N} \a x_{i} + \b - y_{i}
    \end{split}
    \right.
\end{equation}

We can immediately eliminate the leading factor of 2 and we get a system of two
equations with two unknowns.
\begin{equation}
    \left\{
    \begin{split}
        0 &= \sum_{i=1}^{N} x_{i} (\a x_{i} + \b - y_{i})\\
        0 &= \sum_{i=1}^{N} \a x_{i} + \b - y_{i}
    \end{split}
    \right.
\end{equation}
Collect the terms and put them in a matrix.
\begin{equation}
    \begin{pmatrix}
        \Sigma \xi^2 & \Sigma \xi \\
        \Sigma \xi   & N
    \end{pmatrix}
    \begin{pmatrix}
        \a \\ \b
    \end{pmatrix}
    =
    \begin{pmatrix}
        \Sigma \xi \yi \\ \Sigma \yi
    \end{pmatrix}
\end{equation}
We substitute the summation expressions just to reduce our cognitive load while we solve
the system.
\begin{equation}
    \label{eq:matrix-system}
    \begin{pmatrix}
        A & B \\
        B & N
    \end{pmatrix}
    \begin{pmatrix}
        \a \\ \b
    \end{pmatrix}
    =
    \begin{pmatrix}
        C \\ D
    \end{pmatrix}
\end{equation}
Then we solve it by eliminating terms and isolating $\a$ and $\b$.
First we multiply the first row by $\frac{N}{B}$ and subtract it from the second
to get $\a$.
\begin{equation}
    \begin{pmatrix}
        A              & B \\
        \AD & N - B\frac{N}{B}
    \end{pmatrix}
    \begin{pmatrix}
        \a \\ \b
    \end{pmatrix}
    =
    \begin{pmatrix}
        A              & B \\
        \AD & 0
    \end{pmatrix}
    \begin{pmatrix}
        \a \\ \b
    \end{pmatrix}
    =
    \begin{pmatrix}
        C \\ \AN
    \end{pmatrix}
\end{equation}

Then, again in Equation [eq:matrix-system], we multiply the second row by $\frac{A}{B}$ and subtract it from the first
to get $\b$.
\begin{equation}
    \begin{pmatrix}
        A-B\frac{A}{B} & \BD \\
        B              & N
    \end{pmatrix}
    \begin{pmatrix}
        \a \\ \b
    \end{pmatrix}
    =
    \begin{pmatrix}
        0 & \BD \\
        B & N
    \end{pmatrix}
    \begin{pmatrix}
        \a \\ \b
    \end{pmatrix}
    =
    \begin{pmatrix}
        \BN \\ D
    \end{pmatrix}
\end{equation}

This gives
\begin{equation}
    \left\{
    \begin{split}
    \a &= \frac{\AN}{\AD} \\
    \b &= \frac{\BN}{\BD}
    \end{split}
    \right.
\end{equation}

Multiplying the nominators and denominators by $B$ removes the risk of division by
zero if x is sampled evenly around zero (I learned this the hard way) and
cleans up the expression.
We note that we get the same denominator in both expressions and finally we have
\begin{equation}
    \left\{
    \begin{split}
    \a &=
        \frac{\ANb}{\ADb} &=
        \frac{\ANb}{\ADb} &=
        \frac{\Sigma \xi \Sigma \yi - \Sigma \xi \yi N}{\Sigma \xi \Sigma \xi - \Sigma \xi^2 N}\\
    \b &=
        \frac{\BNb}{\BDb} &=
        \frac{\BNb}{\ADb} &=
        \frac{\Sigma \xi \Sigma \xi \yi - \Sigma \yi \Sigma \xi^2}{\Sigma \xi \Sigma \xi - \Sigma \xi^2 N}
    \end{split}
    \right.
\end{equation}
which can be used to directly calculate the parameters $\a$ and $\b$ that
define the line that best fits our data.


Python implementation
========================================================================

**Linear regression**
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Python
def linearRegression(x: np.ndarray, y: np.ndarray) -> Tuple[float, float]:
    """
        Fit a line to a set of points in a least-squares fasion.
        Returns the slope a and the intercept b.
        y = a*x + b
    """
    N = len(x)
    A = np.sum(x*x)
    B = np.sum(x)
    C = np.sum(x*y)
    D = np.sum(y)

    a = (B*D - C*N) / (B*B - A*N)
    b = (B*C - D*A) / (B*B - A*N)

    return a, b
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

!!! Tip:
    **Type hints in python**

    I recently discovered type hints for python, though it has been around since
    python 3.5. It's not a strict type checking system but it helps me a lot
    both when writing new functions, starting with a sort of prototype, and
    when I have to remember what old ones are supposed to do.

    But the by far most important reason to use it is that it helps environments
    such as Visual Studio Code pass these hints on to you.

Related problems
========================================================================

Use distance instead of squared distance?
-------------------------------------------------------------------------------
So why did we use the squared distance instead of just the distance?
First of all the we want a distance without sign but then could we not just
use the absolute value of the distance?

\begin{equation}
    \text{arg min } \epsilon(\a, \b) =
    \sum_{i=1}^{N}|\a \cdot x_{i} + \b - y_{i}|
\end{equation}

The derivative of the absolute value of a function is

\begin{equation}
    \frac{d}{dx}|f(x)| = \frac{f(x)}{|f(x)|}\frac{d}{dx}f(x)
\end{equation}

The problem now is that if you try to calculate the derivatives in $a$ and $b$ for
this expression as we did in Equation [eq:diff] things are not so nice anymore
since the derivative expression has a discontinuity where $|\a \cdot x_{i} +\b - y_{i}|$ is zero.

The point is that this becomes a very complicated problem since we
have a function without a nice derivative and an easily calculated and unique
minimum.

Multiple linear regression
-------------------------------------------------------------------------------
One of the extensions we could imagine is to have more explanatory variables
than just $x$ to model something.
In other words we want to find a hyperplane
\begin{equation}
    y = b + a_{0}x_{0} + a_{1}x_{1} + \ldots + a_{k}x_{k}
\end{equation}
such that it best fits our $k$-variable data.

I might investigate how this is solved in a later blog post.

Geometric line fitting
-------------------------------------------------------------------------------
Consider the case where you don't really have points along two dimensions
of different units that correspond to some measurements but instead points
in space such as $\Real^{2}$ or $\Real^{3}$. Then the distance that you want to minimize
might not be the vertical distance between the line and a point but rather the
sum of the Euclidean closest distances between the line and the points.

I expect this to also have a closed form solution but I have not investigated
this any further. However we should note that we have two solutions, one going
from _left to right_ and one in the other direction, and that these two define
the same line.

![Figure [geom]: The closest distance to the line is not equal to the vertical
distance used in linear regression.](figures/lingeom.svg width=100%)

Example - Linear regression and radiation protection
========================================================================
I thought a practical example could be more interesting than just showing
made up numbers perturbed by some noise and I found this one that I did
for a lab while attending a course on radiation protection and medical effects last year.

Attenuation of radiation
-------------------------------------------------------------------------------
When a flux of high energy photons such as x-rays or gamma rays (keV to MeV)
with intensity $I_{0}$ pass through a material some will scatter or be absorbed
such that only a fraction, $I_{x}$, will reach the other side.
This attenuation depends on the energy of the photons, the material and its
thickness $x$.

\begin{equation}
\label{eq:att}
    I_{x} = I_{0} e^{-\mu x}
\end{equation}

For a certain energy and some shielding made of a certain material the thickness
required to reduce the intensity by half is called the half thickness, $x_{h}$, [#Half-value].

\begin{equation}
\label{eq:ht}
    x_{h} = \frac{\text{ln}(2)}{\mu}
\end{equation}

!!! Tip:
    **Half thickness and half life**

    A more familiar concept might be that of a radioactive half life.
    That is, the time it takes for a radioactive isotope to decay such that
    the activity, and the emitted intensity, is reduced by half.

    The attenuation of the intensity as the radiation passes through
    a material is described using an identical equation but with
    an attenuation coefficient instead of a decay constant. The
    half thickness is completely analogous to the half life, expressed
    as a length instead of a time.


Experimental setup
-------------------------------------------------------------------------------
A sample of an unknown radioactive isotope emitting gamma and/or x-rays is placed in
front of a portable NaI detector. An increasing number of lead sheets are placed
between the sample and the detector while observing the detected intensity,
counts per second.

![Figure [experiment]: Experiment setup. A sample is placed in front of a detector and sheets of lead with combined thickness $x$ are put in between.](./example/experiment.jpg width=70%)

The goal is to fit a line on the log-linear data, find the half-thickness both
in mm and in the unit g/cm $^{2}$ by using the known density of lead, $\rho x_{h}$. This second
form can be used together with a diagram to determine the energy of the detected
attenuated gamma and/or x-rays which in turn can be used to determine the isotope,
provided we only have a few possible ones to pick from.

Finding the half thickness and identifying the isotope
-------------------------------------------------------------------------------
The measured intensity values, Figure [radiation], are transformed using the
natural logarithm ln.
Then we fit a line to the data points acquired before we reach background level,
indicated by the green squares.
The attenuation coefficient is the negative slope of the line, $\mu=0.572$.
Then we use Equation [eq:ht] to find the half thickness, $x_{h} = 1.21 \text{mm}$.

![Figure [radiation]: Fitting a line to the logarithm of the measurements of the number of counts and the material thickness makes it possible to find the half thickness. Here shown using a $\text{log}_{10}$ scale.](./figures/example_radiation_B.svg width=80%)

Using the density for lead, $\rho=11.35 \text{ g cm}^{-3}$, we get $\rho x_{h}=1.37 \text{ g cm}^{-2}$.
If we look this up in a [half thickness vs photon energy data sheet graph](./example/halfthickness.jpg) we can
see that this corresponds to an energy of about 200 keV.
Out of the possible isotopes $\ce{^111 In}$ is a likely candidate, having two
prevalent gamma emissions at 171 and 245 keV.

Conclusion
========================================================================

Fitting a line is in many cases the first _line_ of offence when you start analyzing
your data. Many important real world processes and relationships are
actually linear or close enough to it that this simple way of modeling things
become a powerful tool.

Further, we have introduced an even more versatile tool indirectly.
The way we formulated the problem for the linear regression works when
fitting any curve described by some parameters. Although in general we
will not be able to find a closed form solution and we have to settle for
one obtained through non-linear optimization.

!!! Warning:
    **Test for correlation**

    One thing we have not covered is when this should not be used. If you don't have
    some well established model in the form of a linear relationship or something
    that can be made linear through a simple transformation as in Equation [eq:att]
    you should really consider performing a test for correlation before you fit a
    line to your data.

    A test such as Pearson correlation [#pearson] will measure the linear correlation in the
    data while also providing some more qualitative information such as if the slope
    is positive or negative.

The linear regression line minimizes the residual $\epsilon$ and in the animation
below I attempted to illustrate what happens to it if any other values for $a$ and $b$ are picked.

![Figure [manim]: Linear regression by minimization of the squared distance.](video/LinearRegression2D.mp4 width=100%)

The relative residual is defined as
\begin{equation}
    \epsilon_{\text{rel}} = \frac{\epsilon}{\epsilon_{0}} - 1
\end{equation}
where $\epsilon_{0}$ is the residual of the regression line.
We can see that $\epsilon_{\text{rel}}$ never goes below 0 which
means that $\epsilon \geq \epsilon_{0}$ for all the values of $a$ and $b$
shown.

!!! Tip:
    **Manim**

    This animation was created using manim. A python package that is a
    fork of the tool that [#3blue1brown] uses in his youtube videos.
    It's very intuitive and the documentation is also very good but
    getting the results that you want might require some effort.

    What I like the best so far is that it does what it's supposed
    to do very well, That is, animations, smooth transitions and
    interpolations of different shapes.

+++++

**Bibliography**:

[#wiki]: Linear regression
https://en.wikipedia.org/wiki/Linear_regression

[#pearson]: Pearson correlation
https://en.wikipedia.org/wiki/Pearson_correlation_coefficient

[#Half-value]: Half-value layer
https://en.wikipedia.org/wiki/Half-value_layer

[#manim]: Manim
https://www.manim.community/

[#3blue1brown]: 3blue1brown
https://www.youtube.com/c/3blue1brown
Code for all figures and examples:
https://bitbucket.org/nils_olovsson/linear_regression/src/master/
[#]
Linear regression

2021-11-26

Keywords: Statistics, Optimization, Calculus, Python, Radiation protection