Finding pvalue, r and adjusted r^2 for SVR regression model

I want to use Support Vector Regression (SVR) for regression as it seems quite powerful when I have several features. As I found a very-easy-to-use implementation in scikit-learn, I’m using that implementation. My questions below are regarding this python package in particular, but if you have any solution in any other language or package please let me know as well.

So, I’m using the following code:

from sklearn.svm import SVR
from sklearn.model_selection import cross_validate
from sklearn.model_selection import KFold

svr_rbf = SVR(kernel='rbf')
scoring = ['neg_mean_absolute_error', 'neg_mean_squared_error', 'r2']


scores = cross_validate(estimator, X, y, cv=KFold(10, shuffle=True), scoring=scoring, return_train_score=False)

score = -1 * scores['test_neg_mean_absolute_error']
print("MAE: %.4f (%.4f)" % (score.mean(), score.std()))

score = -1 * scores['test_neg_mean_squared_error']
print("MSE: %.4f (%.4f)" % (score.mean(), score.std()))

score = scores['test_r2']
print("R^2: %.4f (%.4f)" % (score.mean(), score.std()))

As you can see, I can easily use 10-fold cross validation by dividing my data in 10 shuffled folds, and getting all the MAE, MSE and r^2 for each fold very easily.

However, my big question is how can I get the pvalue, r and adjusted r^2 for my SVR regression model specifically, just like I find in other python packages including statsmodels for linear regressions?

I guess I will have to implement the cross validation with KFold by myself in order to achieve this, but I don’t think that’s a big problem. The issue is that I’m not sure how to get these scores from sklearn’s implementation of SVR itself.

All topic

Interaction term in multivariate polynomial regression

I’m looking for answer for the question about multivariate polynomial regression.
I can’t find a clear explanation of when an interaction term is necessary.

Some sources say that the estimated model of a complete second degree polynomial regression model in two variables $x_{1}$, $x_{2}$ may be expressed as

$$hat{y} = b_{0} + b_{1}x_{1} + b_{2}x_{2} + b_{3}x_{1}^{2} + b_{4}x_{2}^{2} + + b_{5}x_{1}x_{2}$$

and others don’t consider interaction term $x_{1}x_{2}$…

And how does it look when I have for example $d=4$ and $p=3$, a third degree polynomial of four independent variables?

All topic

continuous and binary variables, for linear regression

I am trying to use a combination of binary and continuous input variables in a linear regression. In my studies I used only continuous variables in linear regression. Should I do anything special now that I am including binary input variables?

Also, does this answer change depending on whether I am fitting to predict a continuous outcome vs a binary outcome?

All topic

Multiple Regression, good P-value, but Low R2

I am trying to build a model in R to predict Conversion Rate (CR) based on age, gender, and interest (and also the campaign_Id):

The CR values look like this:

CR

The correlation coefficients are not very promising:

rcorr(as.matrix(data.numeric))

correlations with CR:

xyz_campaign_id (-0.19), age (-0.1), gender(-0.04), interest(-0.03)

So, the model below:

library(caret)
set.seed(100)
TrainIndex <- sample(1:nrow(data), 0.8*nrow(data))
data.train <- data[TrainIndex,]
data.test <- data[-TrainIndex,]
nrow(data.test)
model <- lm(CR ~ age + gender + interest + xyz_campaign_id , data=data.train)

will not have a good adjusted r-squared (0.04):

Call:
lm(formula = CR ~ age + gender + interest + xyz_campaign_id, 
    data = data.train)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.636 -11.858  -4.087   0.115  96.421 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     47.231250   6.287738   7.512  1.4e-13 ***
age35-39         1.214713   1.916649   0.634  0.52639    
age40-44        -1.971037   1.986316  -0.992  0.32131    
age45-49        -3.064858   1.866713  -1.642  0.10097    
genderM          3.709192   1.412311   2.626  0.00878 ** 
interest         0.030384   0.027617   1.100  0.27154    
xyz_campaign_id -0.037856   0.006076  -6.231  7.1e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 21.16 on 907 degrees of freedom
Multiple R-squared:  0.05237,   Adjusted R-squared:  0.04611 
F-statistic: 8.355 on 6 and 907 DF,  p-value: 7.81e-09

I also understand that I should probably convert "interest" from numeric to factor (I have tried that too, although I considered all 40 interest levels which is not ideal)

So, based on the provided information, is there any way to improve the model? what other models shall I try besides linear models to make sure that I have a good predictive model?

If you need more information, the challenge is available Here

All topic

SlidingEstimator_multi-class logistic regression

I am trying to solve an issue related to the impossibility of using roc_auc as a scoring method while evaluating the scores for a classifier that uses ordinal logistic regression (multi-class).
The logistic regression takes as input an independent variable X1 that is 3D and a dependent variable y1 that is 1D. I tried to solve the problem in many different ways, but the fact is that the alternative ways I found do not allow me to fit the data, being X1 3-dimensional and not 2-dimensional. Does anyone know how I could solve this? the script is the following:

X1=epochs_awake.get_data()
y1=c_level_awake
clf1=make_pipeline(StandardScaler(), LogisticRegression(solver='newton-cg',multi_class='ovr')) 
time_decod = SlidingEstimator(clf1, n_jobs=1) # scoring??
scores = cross_val_multiscore(time_decod,X1, y1, cv=5, n_jobs=1)
scores = np.mean(scores, axis=0)

All topic

Regression and ARIMA transfer functions in R

I’m currently following the “Forecasting Product Demand in R” (https://www.datacamp.com/courses/visualizing-time-series-data-in-r) course on Datacamp, and I’ve been stuck trying to understand transfer functions for way too long.

This is the mathematics background given in the course:

My understanding was that transfer functions inserted the ARIMA model into the error term of the regression model at the estimation stage, but in this way, it seems to only multiply the results from the regression prediction and ARIMA residuals forecast together. I don’t understand where this comes from.

Would anyone have a good resource to help me understand better? I hate going forward in a course without understanding the core concepts correctly.

All topic

CNN architectures for regression?

I’ve been working on a regression problem where the input is an image, and the label is a continuous value between 80 and 350. The images are of some chemicals after a reaction takes place. The color that turns out indicates the concentration of another chemical that’s left over, and that’s what the model is to output – the concentration of that chemical. The images can be rotated, flipped, mirrored, and the expected output should still be the same. This sort of analysis is done in real labs (very specialized machines output the concentration of the chemicals using color analysis just like I’m training this model to do).

So far I’ve only experimented with models roughly based off VGG (multiple sequences of conv-conv-conv-pool blocks). Before experimenting with more recent architectures (Inception, ResNets, etc.), I thought I’d research if there are other architectures more commonly used for regression using images.

The dataset looks like this:

enter image description here

The dataset contains about 5,000 250×250 samples, which I’ve resized to 64×64 so training is easier. Once I find a promising architecture, I’ll experiment with larger resolution images.

So far, my best models have a mean squared error on both training and validation sets of about 0.3, which is far from acceptable in my use case.

My best model so far looks like this:

// pseudo code
x = conv2d(x, filters=32, kernel=[3,3])->batch_norm()->relu()
x = conv2d(x, filters=32, kernel=[3,3])->batch_norm()->relu()
x = conv2d(x, filters=32, kernel=[3,3])->batch_norm()->relu()
x = maxpool(x, size=[2,2], stride=[2,2])

x = conv2d(x, filters=64, kernel=[3,3])->batch_norm()->relu()
x = conv2d(x, filters=64, kernel=[3,3])->batch_norm()->relu()
x = conv2d(x, filters=64, kernel=[3,3])->batch_norm()->relu()
x = maxpool(x, size=[2,2], stride=[2,2])

x = conv2d(x, filters=128, kernel=[3,3])->batch_norm()->relu()
x = conv2d(x, filters=128, kernel=[3,3])->batch_norm()->relu()
x = conv2d(x, filters=128, kernel=[3,3])->batch_norm()->relu()
x = maxpool(x, size=[2,2], stride=[2,2])

x = dropout()->conv2d(x, filters=128, kernel=[1, 1])->batch_norm()->relu()
x = dropout()->conv2d(x, filters=32, kernel=[1, 1])->batch_norm()->relu()

y = dense(x, units=1)

// loss = mean_squared_error(y, labels)

Question

What is an appropriate architecture for regression output from an image input?

Edit

I’ve rephrased my explanation and removed mentions of accuracy.

Edit 2

I’ve restructured my question so hopefully it’s clear what I’m after

All topic

Why doesn’t deep learning work as well in regression as in classification?

there is a lot of research where deep learning works so well with classification but not in regression field

SVR, tree-based approach is still good and I couldn’t find good architecture about regression

well there is some scheme you have to follow when implementing deep regression but I want to know why it doesn’t work well as classification

any good theory or explanation why deep learning (mlp) doesn’t work as well as classification?

I want to dig into this problem

All topic

Keras Loss Function for Multidimensional Regression Problem

I am new to DL and Keras. I am trying to solve a regression problem with multivariate outputs (y shape (?, 2)) using Keras (tensorflow backend). I am having my confusion about how the loss is calculated. I use mean absolute error as the loss function. However, since my target data has 2 dimensions, is the loss value calculated as the reduced mean on all dimensions (a scalar as the result)? I checked the Keras source code, it uses K.mean(…, axis=-1) for MAE calculation. If K.mean is the same to numpy.mean, “axis=-1” should do the column mean (for my case, it should return a tensor with shape (?,2) but not a scalar). If this is the case, how could the loss value be a single number (as outputed in the training process log)?

If the MAE return is indeed a scalar (reduced mean), this gives me another problem. The data from each dimension of my target is not in a same range. A reduced mean would be biased towards the high value dimension. Shall I change my model to a multi-task learning model then?

Thanks a lot for your help on this.

L.

All topic

Dealing with new Factor Levels in a Regression in R

I originally posted this in stackoverflow (as given here) but was told to try here since it might be more relevant here. I am very new to statistics and R in general so my question might be a bit dumb, but since I cannot find my solutions online I thought I should try ask it here.

I have a data frame dataset of a whole lot of different variables very similar to as follows:

 Item | Size   | Value    | Town
----------------------------------
A     |  10    |   800   | 1
B     |  11    |   100   | 2
A     |  17    |   900   | 2
D     |  13    |   200   | 3
B     |  15    |   500   | 1
C     |  12    |   250   | 3
E     |  14    |    NA   | 2
A     |        |   800   | 1
C     |        |   800   | 2

Basically, I have to try and ‘guess’ the Size based on the type of Item, it’s Value, and the Town it was sold in, so I think a regression method would be a good idea.

I now use a regression as follows:

lm(Size ~ factor(Item) + factor(Town) + Value,...)

The problem however, occurs when I try and predict the Size using this model. I have the following lines of code:

pmodel <- lm(Size ~ factor(Item) + factor(Town) + Value,...)
prediction <- predict(pmodel, dataset2)

(where dataset2 is a subset of dataset which has all the empty "Size" values which I want to predict)

But this now comes up with the error:

Error in model.frame.default(Terms, newdata, na.action=na.action, xlev = object$xlevels): factor factor(Town) has new levels

Is there any way around this, or is there any way I can get the model to make prediction based on the other values if it encounters a 'level' which is not in the original dataset? Thanks!

All topic