We begin with a simple problem that can be solved using a Seldonian algorithm. This is the Hello World! of Seldonian machine learning. When first solving this problem, we will hold off on providing a slick interface, and first focus on the core computational components of a Seldonian algorithm.

This is a *regression* problem. Let \(X \in \mathbb R\) and \(Y \in \mathbb R\) be two dependent random variables. Our goal is to estimate \(Y\) given \(X\) and training data consisting of \(m\) independent and identically distributed samples, \(\{(X_i,Y_i)\}_{i=1}^m\). Although the Seldonian algorithm will not know the distributions of \(X\) and \(Y\), we will be writing the code to generate the training data samples, and so we must know how to generated samples of \(X\) and \(Y\):
$$
X \sim N(0,1)\quad\text{ and } \quad Y \sim N(X,1),
$$
where \(N(\mu,\sigma^2)\) denotes the normal distribution with mean \(\mu\) and variance \(\sigma^2\).

The plot below shows 50,000 points sampled in this way (with transparency to better see the distribution):

Let \(\Theta = \mathbb R^n\) and let the estimate of \(Y\) given \(X\) and \(\theta\) be \(\hat y(X,\theta)=\theta_1 X + \theta_2\). The *mean squared error* (MSE) of a solution \(\theta\) is
$$
\operatorname{MSE}(\theta)=\mathbf E\left [ (\hat y(X,\theta)-Y)^2\right ].
$$
Our goal is to find an algorithm that minimizes the mean squared error, and so our sample objective is the negative sample mean squared error:
$$
\hat f(\theta,D)=-\frac{1}{n}\sum_{i=1}^n (\hat y(X_i,\theta)-Y_i)^2.
$$
We use the *negative* sample MSE because our algorithm will attempt to maximize \(f\), while we want to minimize MSE.

We consider the problem of creating a linear regression algorithm that ensures that, with probability at least 0.9, the mean squared error of its predictions is below 2.0, and also with probability at least 0.9, the mean squared error if its predictions is above 1.25. That is \(n=2\) (there are two behavioral constraints) and:

- \(g_1(\theta)=\operatorname{MSE}(\theta)-2.0\) and \(\delta_1=0.1\).
- \(g_2(\theta)=1.25 - \operatorname{MSE}(\theta)\) and \(\delta_1=0.1\).

For now we will consider an interface that is not particularly easy to use, as it requires the user to write code to define undesirable behavior: we assume that the user provides a function \(\hat g_i\) for each behavioral constraint such that \(\hat g_i(\theta,D)\) is a vector of independent and identically distributed unbiased estimates of \(g(\theta)\). That is, if \(g_{i,j}(\theta,D)\) is the \(j^\text{th}\) output of \(\hat g_i(\theta,D)\), then for every \(i\) and \(j\): $$ g_i(\theta)=\mathbf E[\hat g_{i,j}(\theta,D)]. $$ In our case, $$ \hat g_{1,j}(\theta,D) = (\hat y(X_j,\theta)-Y_j)^2-2.0, $$ and $$ \hat g_{2,j}(\theta,D) = 1.25-(\hat y(X_j,\theta)-Y_j)^2. $$

This synthetic example was constructed to be simple, and yet also to test the ability of a Seldonian algorithm to handle behavioral constraints that are in conflict with the objective function (the solution that minimizes the MSE will violate the second behavioral constraint). Once we have created a quasi-Seldonian algorithm for this problem, extending our algorithm to include a simple interface and to the general regression and classification settings is relatively simple and discussed in a later tutorial.

In summary, we will creating a linear regression algorithm that guarantees with probabiliy \(0.9\) the MSE of its predictions is below \(2.0\) and with probability \(0.9\) the MSE of its predictions is above \(1.25\). By Boole's inequality, this implies that with probability \(0.8\) the MSE of its predictions is in the interval \([1.25,2.0]\). We will test our algorithm on a simple data set where the inputs come from a standard normal distribution and the outputs are equal to the inputs, plus noise with a standard normal distribution.