How to Interpret a Multiple Regression Analysis Model

We have come a long way this semester (in spite of weather cancellations!) and each of you should be congratulated on your QBA performance to date.  Just one more hurdle to go – Business Statistics Forum #6!

The last section of Part II (see “BSF #6 Guidelines” on the left side of this site) has you analyze the potential for employment discrimination at IBM using hypothetical employee data and an SPSS procedure that produces a multiple regression model based on the data.  We have been reviewing the relationship between correlation (r and r2) and regression (R, R2 and 1-R2) in class, through lectures, blog .pdfs and a PowerPoint presentation (also found on the left side of the site), and have integrated SPSS procedures into the discussion.  Though an important part of the discussion has centered on what to do when the numbers are in, I thought I’d continue this part of our in-class dialogue here to help sharpen your data analytic skills.

Specifically, you are to test and analyze the following regression equation and include the implications in the recommendations section written for your boss at IBM:

                                                                    (Click to Enlarge)

Since your task is to interpret MRA output, I will assume you have already reviewed the assigned textbook readings, class notes and .pdfs, and the PowerPoint presentation and are ready to roll up your sleeves to get to work.  For the purpose of this discussion, I will not run and analyze all seven independent variables ON the dependent variable (salary) that are required in this MRA.  Instead, I will run only four: age, education, previous experience in months, and months since being hired (see FIG 1: Explaining Salary from Four Predictors using MRA). Notice that there are three tables in this figure: Model Summary, ANOVA table and Coefficients table.  I will discuss the role of each table in your MRA in turn.   

The model summary output stems from a default forced entry procedure in SPSS (i.e., METHOD=ENTER), rather than FORWARD, STEPWISE or some other useful method.  The implications of METHOD=ENTER are that all predictor variables are entered into the regression equation at one time and subsequent analysis then follows.  Note: The decision to choose a specific method will depend on the theoretical interests and understandings of the researcher.  As an example, refer to our fifth BSF (i.e., Laurie Burney; Nancy Swanson. The Relationship between Balanced Scorecard Characteristics and Managers' Job Satisfaction. Journal of Managerial Issues, v22 i2, Summer 2010: 166-181) which used MRA to explain manager’s job satisfaction from multiple sets of predictor variables.
There are two essential pieces of information in the Model Summary table: R and R2.  The multiple correlation coefficient (R) is a measure of the strength of the relationship between Y (salary) and the four predictor variables selected for inclusion in the equation.  In this case, R=.667 which tells us there’s a moderate-to-strong relationship.  By squaring R, we identify the value of the coefficient of multiple determination  (i.e., R2).  This statistic enables us to determine the amount of explained variation (variance) in Y from the four predictors on a range from 0-100 percent. Thus, we’re able to say that 44.5 percent of the variation in Y (salary) is accounted for through the combined linear effects of the predictor variables.  THIS IS MISLEADING, however, since we don’t yet know which of the predictors has contributed significantly to our understanding of Y and which ones have not. We will address this important issue in the last table (Coefficients) when we explore each predictor’s beta (i.e., standardized regression coefficient) and its level of significance.
Question: How do we know if the MRA model itself is statistically significant or if we’re just wasting our time staring at non-significant output?  The answer is found in the ANOVA table. Because R2 is not a test of statistical significance (it only measures explained variation in Y from the predictor Xs), the F-ratio is used to test whether or not R2 could have occurred by chance alone.  In short, the F-ratio found in the ANOVA table measures the probability of chance departure from a straight line.  On review of the output found in the ANOVA table, in one sentence we can address the above-cited question:  We find that the overall equation was found to be statistically significant (F=93.94, p<.000).
Figure 1: Explaining Salary from Four Predictors using MRA
(Click to Enlarge)

Finally, as I previously pointed out, we need to identify which predictors are significant contributors to the 44.5 percent of explained variance in Y (i.e., R2=.445) and which ones are not – and in what way(s) do the significant ones help us to explain Y.  Answers to these important questions are found in the Coefficients table shown above.  In terms of our focus, note that for each predictor variable in the equation, we are only concerned with its associated (1) standardized beta and (2) t-test statistic’s level of significance (Sig.).  As always, whenever p <.05, we find the results statistically significant. For the purpose of understanding MRA output, this means that when a p-value (SIG.) is less than or equal to .05, the corresponding beta is significant in the equation. So what did we find?

From this equation, the level of education was found to be the only independent variable with a significant impact on employee salary (b=.673, p<.000) when all of the variables were entered into the regression equation.  We found that the higher the level of employee education at IBM, the greater the salary level.  The employee’s age, previous experience in months, and time on the job since being hired did not meet the necessary criteria to significantly impact employee salary, so they played no role at this stage of the analysis. We should note that an employee’s previous experience (measured in months) did approach significance (b=.107, p=.06).

I hope this helps!

Professor Ziner

Calculating the Expected Frequencies (E) in a Chi-Square Test of Independence

I have had several students ask me to review how to calculate a cell’s “E,” the value of the expected frequency under the null hypothesis, across all cells in a Chi-Square Test of Independence. The process is simple: To calculate each cell’s E, you take the cell’s corresponding column marginal total and multiply that number by that cell’s corresponding row marginal total and then divide by n (sample size). This procedure is summarized in the following 2x2 table below.

                    

Note that this procedure differs from the one variable Chi-Square "Goodness of Fit" test.  To obtain the expected frequencies for each cell in the "Goodness of FIt" test, divide the sample size by the number of columns.

I hope this helps!

Professor Ziner