Multiple regression

About the “What researchers mean by...” series

This research term explanation first appeared in a regular column called “What researchers mean by…” that ran in the Institute for Work & Health’s newsletter At Work for over 10 years (2005-2017). The column covered over 35 common research terms used in the health and social sciences. The complete collection of defined terms is available online or in a guide that can be downloaded from the website.

Published: February 2017

In another column, we talked about the term simple regression – a statistical method used to describe the relationship between two factors. We asked you to take on the role of a researcher for a real estate agency trying to find a way to accurately price clients’ homes based on house size. Using simple regression, you came up with an equation to do so. However, you didn’t advise the real estate agency to price clients’ homes based on house size alone. You knew other factors also affect selling price. This is where multiple regression comes in.

Instead of looking at a one-to-one relationship, multiple regression looks at a one-to-many relationship. It is a statistical technique that allows researchers to examine the relationship between two or more factors (called independent variables) at the same time and analyze the extent to which each predicts or explains variations in the outcome of interest (called the dependent variable). The end result is a model (which, in essence, is a mathematical formula) that can be used to explain or predict outcomes based on the presence of different factors.

Main steps in multiple regression

Multiple regression analysis is hard. It’s an elaborate process, involving many steps and usually requiring sophisticated software. Let’s go back to our example to take a look at some of the main steps in doing a multiple regression—most of them preparatory to ensure you are feeding the best information into the software program.

1. Determine the independent variables you want to include in your model. These variables need to make sense. Drawing on your understanding of the real estate market, you decide to include house size, neighbourhood average income, proximity to good schools, lot size, and number of bedrooms and bathrooms.

2. Collect information on each of the variables. You now randomly select, say, 100 houses that recently sold in the city. For each, you collect information on its size, neighbourhood income, proximity to good schools, lot size, number of bedrooms and bathrooms and, of course, its selling price.

3. Explore the relationship between each independent variable being considered and the dependent variable. Using the information collected, you look at the relationship between house size and house price, average neighbourhood income and price, proximity to good schools and price, and so on. You use statistical techniques to determine if a clear (i.e. statistically significant) relationship exists between the factor and house price. If yes, you are more likely to keep the factor in your model. If not, you may or may not decide to use it depending on the nature of the problem you are trying to address.

4. Explore the relationship among the independent variables. Using the same methods above, you may decide to look at how the different factors relate to each other; e.g. between house size and neighbourhood income, neighbourhood income and proximity to good schools, and so on. You may find two factors are so closely related that it would be hard to tell which is contributing to differences in house prices. This is called “multi-collinearity.” Again, depending on the nature of the problem you are trying to address, you may or may not decide to keep both factors. You may also decide to look at how each factor relates to house price taking the other factors into account and, if the factor is no longer related, you may decide to remove it from your model.

5. Perform the multiple regression. For the factors you’ve included in your model, you enter the related information into your software program, do a lot of other statistical prep work (to take into account errors, deviations and so on), then run your program. You end up with an equation that lets you answer questions like: To what extent do each of the factors (neighbourhood income, proximity to good schools, lot size, number of bedrooms and bathrooms) account for variations in home price? What is the predicted price of a particular home knowing the value of all the variables in the model? Multiple regression lets you answer these questions and more. That’s why it’s a powerful tool.

Source: At Work, Issue 87, Winter 2017: Institute for Work & Health, Toronto