We begin with a simple problem that can be solved using a Seldonian algorithm. This is the Hello World! of Seldonian machine learning. When first solving this problem, we will hold off on providing a slick interface, and focus on the core computational components of a Seldonian algorithm.
In this tutorial we will consider a regression problem. Let \(X \in \mathbb R\) and \(Y \in \mathbb R\) be two dependent random variables. Our goal is to estimate \(Y\) given \(X\), based on training data consisting of \(m\) independent and identically distributed samples, \(\{(X_i,Y_i)\}_{i=1}^m\). In this example we construct synthetic data where the inputs \(X\) come from a standard normal distribution and where the outputs \(Y\)are equal to the inputs, plus noise with a standard normal distribution: $$ X \sim N(0,1)\quad\text{ and } \quad Y \sim N(X,1), $$ where \(N(\mu,\sigma^2)\) denotes the normal distribution with mean \(\mu\) and variance \(\sigma^2\). Notice that Seldonian algorithms will not know the distributions of \(X\) and \(Y\) in advance.
The plot below shows 50,000 points sampled in this way:
Let \(\Theta = \mathbb R^2\) and let \(\hat y(X,\theta)=\theta_1 X + \theta_2\) be the estimate of \(Y\) given \(X\) and \(\theta\). The mean squared error (MSE) of a solution \(\theta\) is $$ \operatorname{MSE}(\theta)=\mathbf E\left [ (\hat y(X,\theta)-Y)^2\right ]. $$ In our example, the goal of the designer of the algorithm is to construct an algorithm that
In other words, the designer wishes to create an algorithm \(a\) that maximizes \(f(a)\) (where \(f(a)\) is the negative MSE) and that satisfies \(n=2\) behavioral constraints, \(g_1\) and \(g_2\), with probabilities \(1-\delta_1\) and \(1-\delta_2\), respectively. The behavioral constraints, then, are
Because the algorithm does not have prior access to the distributions of \(X\) and \(Y\), it cannot compute \(\operatorname{MSE}(\theta)\) analytically for any given candidate solution \(\theta\). More generally, because we do not know the user's application—to which datasets he or she will apply the algorithm—we cannot compute the value of \(f(a)\) for any \(a\). We can, however, define a sample objective function, \(\hat f\), that estimates the utility of a given algorithm, \(a\), when \(a\) returns the solution \(\theta\) given input data \(D\). In our particular example, the sample objective function \(\hat f\) is the sample mean squared error of a given candidate solution \(\theta\): $$ \hat f(\theta,D)=-\frac{1}{n}\sum_{i=1}^n (\hat y(X_i,\theta)-Y_i)^2. $$ Furthermore, notice that the behavioral constraints, \(g_1\) and \(g_2\), are also defined in terms of the MSE, which cannot be computed exactly. This means that \(g_1\) and \(g_2\) cannot be computed analytically either. Instead of requiring the user to produce \(g_1\) and \(g_2\), we allow the user to specify functions \(\hat g_i\) that provide unbiased estimates of each corresponding constraint \(g_i\). In particular, each function \(\hat g_i(\theta,D)\) uses data to compute a vector of independent and identically distributed (i.i.d.) unbiased estimates of \(g_i(\theta)\). Let \(g_{i,j}(\theta,D)\) be the \(j^\text{th}\) unbiased estimate returned by \(\hat g_i(\theta,D)\). In our regression application, we compute these unbiased estimates of the behavioral constraints as follows: