How to use categorical explanatory variables in a multiple regression having an ordinal “score” response…

I have limited statistical experience from my coursework in undergrad running simple linear regressions and performing chi-square tests. I have some data, ~5000 survey results on individuals, each with a score from a scale of 1-12 on how security conscious they are (determined by their answers to previous security related questions) and we also asked multiple choice questions on income (USD 0-USD 19,999; USD 20,000-USD 39,999; etc.), age (21-30, 31-40, etc.) and level of education (High School, Undergraduate, Masters, and Doctoral). I wanted to know how I would set this up to determine which is the biggest factor in determining their security consciousness with statistical significance. Here is a pivot table from my Excel file with random data. I have all the individual responses as well.

Should I be using dummy variables (one for each category within a group). I also tried using the average number if the number was a range (USD 10,000 for USD 0- USD 19,999 OR 26 for age 21-30) but still not sure how that would work for education. I have run some tests and am unsure what the results entail (i.e. regression of security consciousness against each of the income brackets but this doesn’t seem to make sense since no one can be in more than one and each bracket was given a correlation coefficient). I am fairly certain my chi-square test makes sense and tells me that income, age and education all play a “significant role” in the variation of security consciousness (all the p-values were well below .05, around .001-.003). But how do I quantify this “significant role” within the variation?

Can anyone let me know how to best go about forming the correct conclusions (i.e. “income is the largest factor in determining level of security consciousness” or “age has no statistical significance in determining security consciousness”)? How do I use my categorical explanatory variables in a multiple regression with my security consciousness ordinal “score” response variable?

Determining which variables to use in regression model

So I’m trying to fit some binary outcome data to a logistic regression model. Besides the binary outcome I have several different metrics (numeric, integers, as well as factors) associated with each case (and outcome). Now, the idea is as usual to get the best model describing the data without overfitting of course.

I’m using R for this, so just to try it out, and getting the data well organized I use the glm function. I can use this to create a model using all variables (not a good one), or I can choose which ones I would like to use. But how does one determine which ones should be used ? I know I can use AIC values to see if one is better than another, but I have many metrics I can use, so that would result in a lot of different models to try out. And I don’t think that is the way to use AIC.

So yeah, what is the basic approach in situations like this ? Do I run the glm function on a single variable at a time, and see if that has any significance, and then choose from there, or are there other more effective approaches ?

multivariate clustering, dimensionality reduction and data scalling for regression

I have a dataset with approximately 20000 observations consisting of 40 independent and 1 dependent variable. My initial objective is to develop a model that will predict the dependent variable. I have tried several models and applied linear regression and other algorithms such as Random Forests, of course by splitting the dataset into training and testing sets. Unfortunatelly I cannot any meaningful results; I have very large errors. I believe there is something “messy” with the dataset, so I have decided to do some clustering first and then apply regression within each cluster. Considering that my dependent variable may exhibit a lot of variation I believe I should do clustering with all variables (dependent and independent). I have tried to apply Kmeans and I faced several problems. First of all, it seems I cannot identify the right number of clusters. The “elbow” method gives an unclear number and when I use it with less data (about 2000 observations) I get something like this:

enter image description here

I also had similar problems with hierarchical clustering. I have already tried to apply regressions within the clusters identified, but the results are still very poor.

Right now I believe I should possibly use some kind of “weight” to my data, in order to put more weight on the dependent variable when I do clustering, since I believe that this is the problem. Hence, my questions here are:

1) is there a way/algorithm where I adjust weights in the variables to be clustered?

Moreover I am confused with two more issues:

2) data scaling:
is it necessary to scale the data before clustering? does this guarantee more accurate results? when do we scale the data?

3)dimensionality reduction:
I have read a lot about principal component analysis and dimensionality reduction, but I am still confused. Again; is this necessary? how many variables are too many to consider applying PCA? are 10 variables too many? or maybe 20? or 50? when should we apply dimensionality reduction? a problem with PCA is that I would still need my original variables to extract the coefficients after regression, while to my understanding with PCA I cannot do that.

This question is more about discussion in order to understand some particular concepts and find a solution to my problem and does not refer to coding issues. Any example and/or references would be appreciated though. I am coding in R

Regression based on series of depth (similar to time series)

I have a data with set of independent variables and a target variable. The target variable is exponentially distributed based on the depth.

Is there a way to identify a general depth function for my target variable? This is similar to a time series but I am not sure if I can use time series.

I am only looking for suggestions on how I can perform regression based on the depth.

#Sample data to show how my data looks like
area = ['area_1','area_1','area_1','area_2','area_2','area_2','area_3','area_3','area_3']
depth = [10,20,30, 10,20,30, 10,20,30]
target = [9,7.201,0.005, 27,20,3, 3, 1.1, 0.5]
x1 = [.8, .8, .8, 0.5,0.4,0.3,0.22,0.214,0.21]
x2 = [3,2.8,0.4, 6,4,2.2,2,1,0.003]

pd.DataFrame(data = {'area': area, 'depth': depth, 'x1': x1,'x2':x2, 'target':target})

dataframe

I did perform multiple regression but it doesnt take depth into account. I was wondering if I can perform MARS(Multivariate adaptive regression splines), but it will still consider the data as a whole and will not do regression for every area.

Appreciate your help. Thanks in advance.

Linear Regression Coefficient Calculation

class LR:

    def __init__(self, x, y):
        self.x = x
        self.y = y
        self.xmean = np.mean(x)
        self.ymean = np.mean(y)
        self.x_xmean = self.x - self.xmean
        self.y_ymean = self.y - self.ymean
        self.covariance = sum(self.x_xmean * self.y_ymean)
        self.variance = sum(self.x_xmean * self.x_xmean)

    def getYhat(self, input_x):
        input_x = np.array(input_x)
        return self.intercept + self.slope * input_x    

    def getCoefficients(self):
        self.slope = self.covariance/self.variance
        self.intercept = self.ymean - (self.xmean * self.slope)
        return self.intercept, self.slope

I am using the above class to calculate intercept and slope for a Simple Linear Regression. However, I would like to tweak it to make it work for Multiple Linear Regression as well, but WITHOUT using matrix formula $(XX^T)^{-1}X^TY$.

Please suggest.

Error in rstanarm with beta regression using stan_glmer

After fitting a stan_glmer or stan_glm functions with mcgv::betar as a family, I get an error when I try to call posterior_predict on it. R says:

Error in exp(eta) : non-numeric argument to mathematical function

A minimal example:

library(rstanarm)
library(loo)
library(mgcv)

a <- rnorm(100, 0.5, 0.1)
b <- a+rnorm(100, 0.6, 0.01)
d <- data.frame(a=a, b=b)

fit <- stan_glm(a ~ b,
               data = d,
               family=betar,
               chains = 10,
               seed = 1)

posterior_predict(fit)

Multinomial logistic regression

I would like to get help from an expert or anyone who knows about this. As a beginner in SPSS, I’ve googled the steps to do multinomial regression. I’ve decide to use multinomial because from what I googled and understood, binomial is for situations that only have 2 categories like yes or no, which you could code 0 for yes and 1 for no. This is what I understand if you choose to use binomial.

My situation is I have a set of questionnaires that use more than 2 answers. For example, level of income: 1. $2000, 2. $3000, 3. $4000, 4. $5000. That is the kind of question I have. So, obviously I choose to use multinomial regression.

My IV’s are 1) social capital, 2) training, 3) credit/loan; and my DV is effectiveness which can be measured by 1) income, 2) saving, 3) repayment rate.

In my questionnaire, I only have two kind of question which “yes or no” and choose answer like above mention “1 or 2 or 3 or 4.” So my questions are:

  1. Did I choose the right regression to use, that is multinomial, or do I have to use binomial?
  2. If it’s multinomial, I need help with the steps like do I have to calculate all my DV questions and transform them into one, which will make new “item” in SPSS columns, then put it in the dependent column of multinomial regression? Because I have approximately 15 questions to measure my DV. Then in the factors columns what did I put? Every question of my IV’s (like 30 questions separately) or just the total of the 30 questions? And what should I put on covariates?

http://sdrv.ms/JtOHu3

above is my questionaire sample link.
income level, savings and repayment is to measure my dv ~ effectiveness of microfinance.
other is my iv, that is credit/capital , training and social capital.
feel free to improve my questionaire if you like.