To learn a function $f_{true}: \mathcal{X} \to \mathbb R$
from data $(x_1, y_1),\dots,(x_N, y_N)\in \mathcal{X}\times\mathbb R$,
in kernel ridge regression the hypothesis takes the form
$f(x)=\sum_{i=1}^N \alpha_i k(x_i, x)$,
where $k:\mathcal X \times \mathcal X \to \mathbb R$ is a positive semidefinite kernel function.
The optimal coefficients are found by solving a linear system $\alpha=G^{-1}y$,
where $G_{ij}:=k(x_i, x_j)+\delta_{ij}\lambda$, $\lambda>0$ and $y = (y_1,\dots,y_N)^\top\in\mathbb R^N$
GP regression (kriging) works in a similar way. Compared with kernel ridge regression, GP regression assumes Gaussian distributed prior. This, combined
with the Bayes rule, gives the variance of the prediction.
In this implementation, the conjugate gradient solver is replaced with the
cholesky solver from lib/linalg.dx for efficiency.