What I would like to do, is train a first model $f_{1}(underline{x})$, where $underline{x}$ is a set of features, fix what model 1 has learned, and then train a second model $f_{2}(underline{y})$ where $underline{y}$ is a second set of features.

(it’s not really the emphasis of this post, but in case you’re curious as to why I want to do this, see the bottom of the post)

My target variable is binary, and I want to maximise (binary) cross-entropy rather than accuracy.

While there is nothing intrinsic to this problem that dictates I should use XGBoost, XGBoost is performing favourably compared to other models on the problem of predicting the target when using only the external variables, so I would like to find a way of getting XGBoost to do this.

It seems to me, that this will require using a custom cost function, which requires using the generic booster class rather than `xgb.XGBClassifier`

. When using the booster class, it outputs a real number, with no constraint of being between $[0,1]$, so one needs to define $P(z_{i}=1|x_{i})=frac{1}{1+e^{-f_{1}(x_{i})}}$

and then implement binary cross-entropy accordingly (because I’ve got two sets of features denoted by x and y, I’ve somewhat criminally used z to refer to the target)

I then train a second custom booster, $f_{2}(underline{y})$ in which $P(z_{i}=1|x_{i}, y_{i})=frac{1}{1+e^{-(f_{1}(x_{i})+f_{2}(y_{i}))}}$

This requires a new custom implemented cost function (it’s a bit hacky, as I don’t think xgboost allows the custom cost function to be passed any arguments other than `preds`

and `dmatrix`

, so after training the first classifier, I save the train predictions in a global variable which I then call in the custom cost function of the second classifier. Not the main point of this post, but if anybody knows a way around this, I’d be super grateful)

What happens when I do this, is a little odd (I’m using a validation set and verbose progress printing, as well as early stopping, so I can watch my classifier “get better” as it iterates). Note that for debugging purposes, I am not using a set of external and controllable features, I’m just using a set of features, X, which I artificially subdivide into (x,y), in which I know that x is a suboptimal set of features (as in, a classifier trained on only x performs somewhat worse than a classifier trained on the full X). I thus would expect the combination of classifiers to perform better than the first one (although not necessarily as well as a single classifier trained on the full X)

Let’s say Classifier 1 finishes and has a cross-entropy of K. Classifier 2 starts training, and its validation cross-entropy does actually drop for a substantial number of iterations before exiting. But, the problem is that when classifier 2 starts training, on the zeroth iteration, the validation loss is substantially lower than it was when classifier 1 exited.

I have a hypothesis as to why this happens, which is that in xgboost, the way the first tree is generated is special. For example one possible way to do xgboost regression, is for the first learner in the ensemble to simply output the mean of the target variable, and then each subsequent learner to learn a correction to this. In general, the theory behind xgboost assumes that corrections to the output will be small, and thus a second order taylor expansion is valid, so for this to be true, the first learner needs to be relatively good. Subsequent trees are multiplied by a “learning rate” to ensure that they only make small corrections, but the first tree is not.

This hypothesis is backed up by the fact, that changing the learning rate to something ridiculously small basically doesn’t change the amount by which the loss decreases between the final iteration of classifier 1 and the zeroth iteration of classifier 2.

**My Actual Question(s)**

Is there a good way around this? I have two ideas but do not know how to implement them/whether it’s possible.

1: Can I force XGBoost to also multiply the first tree in the ensemble by the learning rate?

2: The XGBoost booster class can take `xgb_model`

as an argument, and boost an already existing classifier (allegedly, this doesn’t have to be an XGB model, so I hear, but I’ve only played around with doing this with an xgb model). The problem here is, that I think the features the two models take need to be the same, as under the hood, XGBoost will be calling `model_1.predict(dmatrix_train)`

and then using that same dmatrix to train model_2, but of course I want to do this with different dmatrices.

If you’ve made it this far, thanks for reading, and an even bigger thanks if you can help. Below I’ll provide the actual maths/code details

**Cost Functions**

The first classifier $f_{1}(x)$ takes a set of features $underline{x}$ and maps to a real number. We associate this with a probability as discussed above. The corresponding cost function is:

$C = sum_{i}z_{i}lnfrac{1}{1+e^{-f_{1}(x_{i})}}+(1-z_{i})ln left(1-frac{1}{1+e^{-f_{1}(x_{i})}}right)$

which rearranges to

$sum_{i}f_{1}(x_{i})(z_{i}-1)-ln (1+e^{-f_{1}(x_{i})})$

Similarly, the second classifier takes a set of features $underline{y}$ and maps them to a real number $f_{2}(underline{y})$. The cost associated with the output of this second classifier is given by:

$sum_{i}(f_{1}(x_{i})+f_{2}(y_{i}))(z_{i}-1)-ln(1+e^{-(f_{1}(x_{i})+f_{2}(y_{i}))})$

XGboost doesn’t need to be passed the actual cost functions, it needs to be passed the first and second derivatives, and as vectors, i.e. if $C=sum_{i}c_{i}$, XGBoost requires $frac{partial c_{i}}{partial f_{1}(x_{i})}$ and $frac{partial ^{2} c_{i}}{partial ^{2}f_{1}(x_{i})}$ for the first cost function and $frac{partial c_{i}}{partial f_{2}(y_{i})}$ and $frac{partial ^{2} c_{i}}{partial ^{2}f_{2}(y_{i})}$ for the second cost function.

I calculate that

$frac{partial c_{i}}{partial f_{1}(x_{i})}=z_{i} -1 +frac{1}{1+e^{f_{1}(x_{i})}}$

and

$frac{partial ^{2}c_{i}}{partial ^{2}f_{1}(x_{i})}=-frac{e^{f_{1}(x_{i})}}{(1+e^{f_{1}(x_{i})})^{2}}$

and similar expressions for the second cost function.

**Code**

The code for my first cost function looks like this:

```
def binary_cross_entropy(preds, dmat):
labels = dmat.get_label()
f_exp = np.exp(preds)
grad = 1 - labels - 1/(1+f_exp)
hess = f_exp/np.power(1+f_exp, 2)
return grad, hess
```

(note, grad and hess have been multiplied by -1, as I think XGBoost is trying to minimise loss rather than maximise)

The cost function for the second classifier looks like:

```
def boosted_binary_cross_entropy(preds, dmat):
labels = dmat.get_label()
# model_1_preds_train is a global variable
exponent = labels + model_1_preds_train
f_exp = np.exp(exponent)
grad = 1 - labels - 1/(1+f_exp)
hess = f_exp/np.power(1+f_exp, 2)
return grad, hess
```

(note the hack of requiring the global variable).

Similarly, in order to watch your classifier’s performance on an evaluation set, a rather odd quirk of xgboost is that you need to re-implement the cost function so that it outputs the cost as a scalar (rather than its gradients as a vector). I’ll provide these here for completeness (another weird quirk is that you need to give the metric a name, which is what the strings are about):

```
def custom_metric(preds, dmat):
labels = dmat.get_label()
return 'custom_cross_ent',np.mean(-1*np.log(1+np.exp(-1*preds)) + preds*(labels-1))
```

and

```
def boosted_custom_metric(preds, dmat):
labels = dmat.get_label()
boosted_preds = preds + model_1_preds_eval
return 'boosted_cross_entropy',np.mean(boosted_preds*(labels-1) - np.log(1 + np.exp(-1*boosted_preds)))
```

Then the code which actually trains the first model (there are three decision matrices, dtrain, deval and dtest):

```
params = {'max_depth': 7, 'n_jobs':-1, 'learning_rate':0.1, 'reg_lambda':1}
bst_to_boost = xgboost.train(params, dtrain, obj=binary_cross_entropy, feval=custom_metric, num_boost_round=3000, early_stopping_rounds=5, evals=[(deval, 'eval')], maximize=True)
```

After training the first model, I assign `model_1_preds_train`

and `model_1_preds_eval`

, as required by `boosted_binary_cross_entropy`

and `boosted_custom_metric`

respectively:

```
model_1_preds_train = bst_to_boost.predict(dtrain)
model_1_preds_eval = bst_to_boost.predict(deval)
```

Next, I re-assign `dtrain`

, `deval`

and `dtest`

(I won’t include this code, it’s just data manipluation), maybe this is sloppy I should call them `dtrain_2`

etc, but I haven’t…

and the code which trains the second model

```
boosted_model = xgboost.train(params, dtrain, obj=boosted_binary_cross_entropy, feval=boosted_custom_metric, num_boost_round=3000, early_stopping_rounds=5, evals=[(deval, 'eval')], maximize=True)
```

I think that’s about all of the (useful) code I can provide.

**Why am I interested in this**

I’m looking to learn how a set of actions in the past have changed the outcome (e.g. marketing attribution). One way of doing this which is what I’m investigating here, is to train a model to learn how the target variable is related to external factors which I have no control over, denoted by $underline{x}$, and only when all of this dependence has been learned, do I train a second model which learns how factors which I can control/interventions, $underline{y}$ have affected the outcome.

Implicit is the assumption that the external factors have a (much) larger effect than the ones I have control over, and that in the historical data, $underline{x}$ and $underline{y}$ might be inter-correlated (interventions have not been independent of environmental factors).

If you train one model with all the features at once, you can’t control to what extent the model chooses to learn information contained within both $underline{x}$ and $underline{y}$ (because they are correlated) from each. Of course you want the model to learn as much from $underline{x}$ as possible, as you don’t have any influence on external factors.