boosting an xgboost classifier with another xgboost classifier using different sets of features

What I would like to do, is train a first model $f_{1}(underline{x})$, where $underline{x}$ is a set of features, fix what model 1 has learned, and then train a second model $f_{2}(underline{y})$ where $underline{y}$ is a second set of features.

(it’s not really the emphasis of this post, but in case you’re curious as to why I want to do this, see the bottom of the post)

My target variable is binary, and I want to maximise (binary) cross-entropy rather than accuracy.

While there is nothing intrinsic to this problem that dictates I should use XGBoost, XGBoost is performing favourably compared to other models on the problem of predicting the target when using only the external variables, so I would like to find a way of getting XGBoost to do this.

It seems to me, that this will require using a custom cost function, which requires using the generic booster class rather than xgb.XGBClassifier . When using the booster class, it outputs a real number, with no constraint of being between $[0,1]$, so one needs to define $P(z_{i}=1|x_{i})=frac{1}{1+e^{-f_{1}(x_{i})}}$

and then implement binary cross-entropy accordingly (because I’ve got two sets of features denoted by x and y, I’ve somewhat criminally used z to refer to the target)

I then train a second custom booster, $f_{2}(underline{y})$ in which $P(z_{i}=1|x_{i}, y_{i})=frac{1}{1+e^{-(f_{1}(x_{i})+f_{2}(y_{i}))}}$

This requires a new custom implemented cost function (it’s a bit hacky, as I don’t think xgboost allows the custom cost function to be passed any arguments other than preds and dmatrix, so after training the first classifier, I save the train predictions in a global variable which I then call in the custom cost function of the second classifier. Not the main point of this post, but if anybody knows a way around this, I’d be super grateful)

What happens when I do this, is a little odd (I’m using a validation set and verbose progress printing, as well as early stopping, so I can watch my classifier “get better” as it iterates). Note that for debugging purposes, I am not using a set of external and controllable features, I’m just using a set of features, X, which I artificially subdivide into (x,y), in which I know that x is a suboptimal set of features (as in, a classifier trained on only x performs somewhat worse than a classifier trained on the full X). I thus would expect the combination of classifiers to perform better than the first one (although not necessarily as well as a single classifier trained on the full X)

Let’s say Classifier 1 finishes and has a cross-entropy of K. Classifier 2 starts training, and its validation cross-entropy does actually drop for a substantial number of iterations before exiting. But, the problem is that when classifier 2 starts training, on the zeroth iteration, the validation loss is substantially lower than it was when classifier 1 exited.

I have a hypothesis as to why this happens, which is that in xgboost, the way the first tree is generated is special. For example one possible way to do xgboost regression, is for the first learner in the ensemble to simply output the mean of the target variable, and then each subsequent learner to learn a correction to this. In general, the theory behind xgboost assumes that corrections to the output will be small, and thus a second order taylor expansion is valid, so for this to be true, the first learner needs to be relatively good. Subsequent trees are multiplied by a “learning rate” to ensure that they only make small corrections, but the first tree is not.

This hypothesis is backed up by the fact, that changing the learning rate to something ridiculously small basically doesn’t change the amount by which the loss decreases between the final iteration of classifier 1 and the zeroth iteration of classifier 2.

My Actual Question(s)

Is there a good way around this? I have two ideas but do not know how to implement them/whether it’s possible.

1: Can I force XGBoost to also multiply the first tree in the ensemble by the learning rate?

2: The XGBoost booster class can take xgb_model as an argument, and boost an already existing classifier (allegedly, this doesn’t have to be an XGB model, so I hear, but I’ve only played around with doing this with an xgb model). The problem here is, that I think the features the two models take need to be the same, as under the hood, XGBoost will be calling model_1.predict(dmatrix_train) and then using that same dmatrix to train model_2, but of course I want to do this with different dmatrices.

If you’ve made it this far, thanks for reading, and an even bigger thanks if you can help. Below I’ll provide the actual maths/code details

Cost Functions

The first classifier $f_{1}(x)$ takes a set of features $underline{x}$ and maps to a real number. We associate this with a probability as discussed above. The corresponding cost function is:

$C = sum_{i}z_{i}lnfrac{1}{1+e^{-f_{1}(x_{i})}}+(1-z_{i})ln left(1-frac{1}{1+e^{-f_{1}(x_{i})}}right)$

which rearranges to

$sum_{i}f_{1}(x_{i})(z_{i}-1)-ln (1+e^{-f_{1}(x_{i})})$

Similarly, the second classifier takes a set of features $underline{y}$ and maps them to a real number $f_{2}(underline{y})$. The cost associated with the output of this second classifier is given by:


XGboost doesn’t need to be passed the actual cost functions, it needs to be passed the first and second derivatives, and as vectors, i.e. if $C=sum_{i}c_{i}$, XGBoost requires $frac{partial c_{i}}{partial f_{1}(x_{i})}$ and $frac{partial ^{2} c_{i}}{partial ^{2}f_{1}(x_{i})}$ for the first cost function and $frac{partial c_{i}}{partial f_{2}(y_{i})}$ and $frac{partial ^{2} c_{i}}{partial ^{2}f_{2}(y_{i})}$ for the second cost function.

I calculate that

$frac{partial c_{i}}{partial f_{1}(x_{i})}=z_{i} -1 +frac{1}{1+e^{f_{1}(x_{i})}}$


$frac{partial ^{2}c_{i}}{partial ^{2}f_{1}(x_{i})}=-frac{e^{f_{1}(x_{i})}}{(1+e^{f_{1}(x_{i})})^{2}}$

and similar expressions for the second cost function.


The code for my first cost function looks like this:

def binary_cross_entropy(preds, dmat):
    labels = dmat.get_label()

    f_exp = np.exp(preds)

    grad = 1 - labels - 1/(1+f_exp)

    hess = f_exp/np.power(1+f_exp, 2)

    return grad, hess

(note, grad and hess have been multiplied by -1, as I think XGBoost is trying to minimise loss rather than maximise)

The cost function for the second classifier looks like:

def boosted_binary_cross_entropy(preds, dmat):

    labels = dmat.get_label()
    # model_1_preds_train is a global variable
    exponent = labels + model_1_preds_train

    f_exp = np.exp(exponent)

    grad = 1 - labels - 1/(1+f_exp)

    hess = f_exp/np.power(1+f_exp, 2)

    return grad, hess

(note the hack of requiring the global variable).

Similarly, in order to watch your classifier’s performance on an evaluation set, a rather odd quirk of xgboost is that you need to re-implement the cost function so that it outputs the cost as a scalar (rather than its gradients as a vector). I’ll provide these here for completeness (another weird quirk is that you need to give the metric a name, which is what the strings are about):

def custom_metric(preds, dmat):

    labels = dmat.get_label()

    return 'custom_cross_ent',np.mean(-1*np.log(1+np.exp(-1*preds)) + preds*(labels-1))


def boosted_custom_metric(preds, dmat):

    labels = dmat.get_label()

    boosted_preds = preds + model_1_preds_eval

    return 'boosted_cross_entropy',np.mean(boosted_preds*(labels-1) - np.log(1 + np.exp(-1*boosted_preds)))

Then the code which actually trains the first model (there are three decision matrices, dtrain, deval and dtest):

params = {'max_depth': 7, 'n_jobs':-1, 'learning_rate':0.1, 'reg_lambda':1}
bst_to_boost = xgboost.train(params, dtrain, obj=binary_cross_entropy, feval=custom_metric, num_boost_round=3000, early_stopping_rounds=5, evals=[(deval, 'eval')], maximize=True) 

After training the first model, I assign model_1_preds_train and model_1_preds_eval, as required by boosted_binary_cross_entropy and boosted_custom_metric respectively:

model_1_preds_train = bst_to_boost.predict(dtrain)
model_1_preds_eval = bst_to_boost.predict(deval)

Next, I re-assign dtrain, deval and dtest (I won’t include this code, it’s just data manipluation), maybe this is sloppy I should call them dtrain_2 etc, but I haven’t…

and the code which trains the second model

boosted_model = xgboost.train(params, dtrain, obj=boosted_binary_cross_entropy, feval=boosted_custom_metric, num_boost_round=3000, early_stopping_rounds=5, evals=[(deval, 'eval')], maximize=True)

I think that’s about all of the (useful) code I can provide.

Why am I interested in this

I’m looking to learn how a set of actions in the past have changed the outcome (e.g. marketing attribution). One way of doing this which is what I’m investigating here, is to train a model to learn how the target variable is related to external factors which I have no control over, denoted by $underline{x}$, and only when all of this dependence has been learned, do I train a second model which learns how factors which I can control/interventions, $underline{y}$ have affected the outcome.

Implicit is the assumption that the external factors have a (much) larger effect than the ones I have control over, and that in the historical data, $underline{x}$ and $underline{y}$ might be inter-correlated (interventions have not been independent of environmental factors).

If you train one model with all the features at once, you can’t control to what extent the model chooses to learn information contained within both $underline{x}$ and $underline{y}$ (because they are correlated) from each. Of course you want the model to learn as much from $underline{x}$ as possible, as you don’t have any influence on external factors.

All topic

How to show hidden div tag when hover on another div tag using css

I have an issue i know it is small but for my necessity i have doubt when i hover on one div i want second div should display and remove cursor again it should hidden how to do this using css

div 2 which will be hidden so when hover on div it should show this div tag and again hover remove it should get hidden using css

All topic

How can I move the column with the values from one table to another in rails?

I am want to move the column from one table to another.
So I’ve created 3 migrations:

  1. AddNewColumnToTableTwo
  2. MoveTheValuesFromTableOneToTableTwo
  3. RemoveColumnFromTableOne

The ‘thing’ is – when I’m trying to run through all the records in the TableOne and update the TableTwo I see the error

TypeError: can't cast Columnname

P.S. Column type in both tables is hstore. Mayb this is the problem?

All topic

Can I move boot linux partition to another/drive partition and just boot from there?

Can I move working boot partition / root filesystem to another drive/partition and just boot from there?

Or device names will be changed and prevent from working?

All topic

How to blend a segment of an image with another image

I want to put a segment of an image to another image. But to make it realistic I applied poisson blending. But the output is not good at all.

Output without poisson blending. The kid was cropped from another image.

enter image description here

With poisson blending

enter image description here

I have no clue what else to try. I though of calculating average value of the image segment where the kid is placed and then scaling the kids color to that average. But it should not work properly if that image segment is very white or black etc.

All topic

Elements multiplied by cofactors of another row/column

In my book this property has been given without proof:

Sum of the products of the elements of any row or column with the cofactors of the corresponding elements of any other row or column is $0$.

Can someone please provide the proof of it? I would prefer a proof without the i,j and summation notation. They are just hard to interpret and counter-intuitive.

All topic

Importing environment varibles from another script into the current shell

I have a shell script that sets some environment variables that I run through the source command to affect the current shell. I created another shell script that refers to this shell script and I thought calling source inside this other shell script would have the same effect as calling directly on the current shell, but apparently that’s not the case. Is there a way for me to tell this invoker script that whatever applies to the shell it’s run in should reflect to the current shell?

The script with the environment variables start with set -a so I thought doing something like


would export everything to the invoking shell, but this too doesn’t seem to work.

All topic

Multisignature – get one signed transaction, send another, use first signed transaction

The nature of my problem is the following: I, side A, need to get a multisig transaction signed by side B to send it to side C later on, store that signed transaction (not broadcast it yet), then sign another transaction by me (side A) for side B, broadcast it and finally use my first signed transaction I got first to send it to side C.

Is it possible? I’ve read that every time new transaction is being signed there is a sync required for key image.

Question is: Does it make my scenario impossible to do?
I hope it doesn’t…

All topic

Django Foreign key in another schema

I have a MySQL database with 2 shemas (A & B).
My django app can read & write into A.
It can just read from B.

My app managed all the tables in A.
B already contains some data in a table (b).

I want to add a Foreign Key between A and B.
Something like this :

class SchemaBTableB(models.Model):
    class Meta:
        managed = False
        db_schema = 'B'
        db_table = 'b'


class SchemaATableA(models.Model):
    class Meta:
        db_schema = 'A'
        db_table = 'a'

    id = models.OneToOneField(


Unfortunately, db_schema does not exist.
Does someone know a solution ?

All topic

How can I define a function that calculates the value of another function at points selected using some criteria?

I want to define a function g[a, b, c] which takes up the minima of another function f[x, a, b, c]. In my code, I want to combine the following operations in one:

  1. calculate the location of extrema by calculating the roots of f'[x, a, b, c] == 0
  2. locate the minima by the condition f''[x0, a, b, c] > 0
  3. define g[a, b, c] so it behaves like f[x0, a, b, c].

Could anyone please help me with the best way to do this?

All topic