Nice to meet you.

Enter your email to receive our weekly G2 Tea newsletter with the hottest marketing news, trends, and expert opinions.

What Is Linear Regression? How It's Used in Machine Learning

July 16, 2021

linear regression

Isn't linear regression part of statistics?

Undoubtedly, yes.

In fact, most machine learning (ML) algorithms are borrowed from various fields, mainly statistics. Anything that can help models predict better will eventually become part of ML. So, it's safe to say that linear regression is both a statistical and a machine learning algorithm.

Linear regression is a popular and uncomplicated algorithm used in data science and machine learning. It's a supervised learning algorithm and the simplest form of regression used to study the mathematical relationship between variables.

More precisely, linear regression is used to determine the character and strength of the association between a dependent variable and a series of other independent variables. It helps create models to make predictions, such as predicting a company's stock price.

Before trying to fit a linear model to the observed dataset, one should assess whether or not there is a relationship between the variables. Of course, this doesn't mean that one variable causes the other, but there should be some visible correlation between them.

For example, higher college grades don't necessarily mean a higher salary package. But there can be an association between the two variables.

Did you know? The term “linear” means resembling a line or pertaining to lines.

Creating a scatter plot is ideal for determining the strength of the relationship between explanatory (independent) and dependent variables. If the scatter plot doesn't show any increasing or decreasing trends, applying a linear regression model to the observed values may not be beneficial.

Correlation coefficients are used to calculate how strong a relationship is between two variables. It's usually denoted by r and has a value between -1 and 1. A positive correlation coefficient value indicates a positive relationship between the variables. Likewise, a negative value indicates a negative relationship between the variables.

Tip: Perform regression analysis only if the correlation coefficient is either positive or negative 0.50 or beyond.

If you were looking at the relationship between study time and grades, you’d probably see a positive relationship. On the other hand, if you look at the relationship between time on social media and grades, you'll most likely see a negative relationship.

Here, “grades” is the dependent variable, and time spent studying or on social media is the independent variable. This is because grades depend t on how much time you spend studying.

If you can establish (at least) a moderate correlation between the variables through both a scatter plot and a correlation coefficient, then the said variables have some form of a linear relationship.

In short, linear regression tries to model the relationship between two variables by applying a linear equation to the observed data. A linear regression line can be represented using the equation of a straight line:

y = mx + b

In this simple linear regression equation:

  • y is the estimated dependant variable (or the output)
  • m is the regression coefficient (or the slope)
  • x is the independent variable (or the input)
  • b is the constant (or the y-intercept)

Finding the relationship between variables makes it possible to predict values or outcomes. In other words, linear regression makes it possible to predict new values based on existing data.

An example would be predicting crop yields based on the rainfall received. In this case, rainfall is the independent variable, and crop yield (the predicted values) is the dependent variable.

Independent variables are also referred to as predictor variables. Likewise, dependent variables are also known as response variables.

Key terminologies in linear regression

Understanding linear regression analysis would also mean getting familiar with a bunch of new terms. If you have just stepped into the world of statistics or machine learning, having a fair understanding of these terminologies would be helpful.

  • Variable: It's any number, quantity, or characteristic that can be counted or measured. It's also called a data item. Income, age, speed, and gender are examples.
  • Coefficient: It's a number (usually an integer) multiplied with the variable next to it. For instance, in 7x, the number 7 is the coefficient.
  • Outliers: These are data points significantly different from the rest.
  • Covariance: The direction of the linear relationship between two variables. In other words, it calculates the degree to which two variables are linearly related.
  • Multivariate: It means involving two or more dependent variables resulting in a single outcome.
  • Residuals: The difference between the observed and predicted values of the dependent variable.
  • Variability: The lack of consistency or the extent to which a distribution is squeezed or stretched.
  • Linearity: The property of a mathematical relationship that is closely related to proportionality and can be graphically represented as a straight line.
  • Linear function: It's a function whose graph is a straight line.
  • Collinearity: Correlation between the independent variables, such that they exhibit a linear relationship in a regression model.
  • Standard deviation (SD): It's a measure of the dispersion of a dataset relative to its mean. In other words, it's a measure of how spread out numbers are.
  • Standard error (SE): The approximate SD of a statistical sample population. It's used to measure variability.

Types of linear regression

There are two types of linear regression: simple linear regression and multiple linear regression.

The simple linear regression method tries to find the relationship between a single independent variable and a corresponding dependent variable. The independent variable is the input, and the corresponding dependent variable is the output.

Tip: You can implement linear regression in various programming languages and environments, including Python, R, MATLAB, and Excel.

The multiple linear regression method tries to find the relationship between two or more independent variables and the corresponding dependent variable. There's also a special case of multiple linear regression called polynomial regression.

Simply put, a simple linear regression model has only a single independent variable, whereas a multiple linear regression model will have two or more independent variables. And yes, there are other non-linear regression methods used for highly complicated data analysis.

Logistic regression vs. linear regression

While linear regression predicts the continuous dependent variable for a given set of independent variables, logistic regression predicts the categorical dependent variable.

Both are supervised learning methods. But while linear regression is used to solve regression problems, logistic regression is used to solve classification problems.

linear regression vs. logistic regression

Of course, logistic regression can solve regression problems, but it's mainly used for classification problems. Its output can only be 0 or 1. It's valuable in situations where you need to determine the probabilities between two classes or, in other words, calculate the likelihood of an event. For example, logistic regression can be used to predict whether it’ll rain today.

Assumptions of linear regression

While using linear regression to model the relationship between variables, we make a few assumptions. Assumptions are necessary conditions that should be met before we use a model to make predictions.

There are generally four assumptions associated with linear regression models:

  • Linear relationship: There's a linear relationship between the independent variable x and the dependent variable y.
  • Independence: The residuals are independent. There's no correlation between consecutive residuals in time-series data.
  • Homoscedasticity: The residuals have equal variance at all levels.
  • Normality: The residuals are normally distributed.

Methods to solve linear regression models

In machine learning or statistics lingo, learning a linear regression model means guessing the coefficients' values using the data available. Several methods can be applied to a linear regression model to make it more efficient.

Tip: Use machine learning software to eliminate monotonous tasks and make accurate predictions.

Let’s look at the different techniques used to solve linear regression models to understand their differences and trade-offs.

Simple linear regression

As mentioned earlier, there are a single input or one independent variable and one dependent variable in simple linear regression. It's used to find the best relationship between two variables, given that they're in continuous nature. For example, it can be used to predict the amount of weight gained based on the calories consumed.

Ordinary least squares

Ordinary least squares regression is another method to estimate the value of coefficients when there is more than one independent variable or input. It's one of the most common approaches for solving linear regression and is also known as a normal equation.

This procedure tries to minimize the sum of the squared residuals. It treats data as a matrix and utilizes linear algebra operations to determine the optimal values for each coefficient. Of course, this method can be applied only if we have access to all data, and there should also be enough memory to fit the data.

Gradient descent

Gradient descent is one of the easiest and commonly used methods to solve linear regression problems. It's useful when there are one or more inputs and involves optimizing the value of coefficients by minimizing the model's error iteratively.

Gradient descent starts with random values for every coefficient. For every pair of input and output values, the sum of the squared errors is calculated. It uses a scale factor as the learning rate, and each coefficient is updated in the direction to minimize error.

The process is repeated until no further improvements are possible or a minimum sum of squares is achieved. Gradient descent is helpful when there's a large dataset involving large numbers of rows and columns that won't fit in the memory.

Regularization

Regularization is a method that attempts to minimize the sum of the squared errors of a model and, at the same time, reduce the complexity of the model. It reduces the sum of squared errors using the ordinary least squares method.

Lasso regression and ridge regression are the two famous examples of regularization in linear regression. These methods are valuable when there's collinearity in the independent variables.

Adam’s method

Adaptive moment estimation, or ADAM, is an optimization algorithm used in deep learning. It's an iterative algorithm that performs well on noisy data. It's straightforward to implement, computationally efficient, and has minimal memory requirements.

ADAM combines two gradient descent algorithms – root mean square propagation (RMSprop) and adaptive gradient descent. Instead of using the entire dataset to calculate the gradient, ADAM uses randomly selected subsets to make a stochastic approximation.

ADAM is suitable for problems involving a large number of parameters or data. Also, in this optimization method, the hyperparameters generally require minimal tuning and have intuitive interpretation.

Singular value decomposition

Singular value decomposition, or SVD, is a commonly used dimensionality reduction technique in linear regression. It's a preprocessing step that reduces the number of dimensions for the learning algorithm.

SVD involves breaking down a matrix as a product of three other matrices. It's suitable for high-dimensional data and efficient and stable for small datasets. Due to its stability, it's one of the most preferred approaches for solving linear equations for linear regression. However, it's susceptible to outliers and might get unstable with a huge dataset.

Preparing data for linear regression

Real-world data, in most cases, are incomplete.

Like any other machine learning model, data preparation and preprocessing is a crucial process in linear regression. There will be missing values, errors, outliers, inconsistencies, and a lack of attribute values.

Here are some ways to account for incomplete data and create a more reliable prediction model.

  • Linear regression thinks that the predictor and response variables aren't noisy. Due to this, removing noise with several data clearing operations is crucial. If possible, you should remove the outliers in the output variable.
  • If the input and output variables have Gaussian distribution, linear regression will make better predictions.
  • If you rescale input variables using normalization or standardization, linear regression will generally make better predictions.
  • If there are many attributes, you need to transform the data to have a linear relationship.
  • If the input variables are highly correlated, then linear regression will overfit the data. In such cases, remove collinearity.

Advantages and disadvantages of linear regression

Linear regression is one of the most uncomplicated algorithms to comprehend and simplest to implement. It's a great tool to analyze relationships between variables.

Here are some notable advantages of linear regression:

  • It's a go-to algorithm because of its simplicity.
  • Although it's susceptible to overfitting, it can be avoided with the help of dimensionality reduction techniques.
  • It has good interpretability.
  • It performs well on linearly separable datasets.
  • Its space complexity is low; therefore, it's a high latency algorithm.

However, linear regression isn't generally recommended for the majority of practical applications. It's because it oversimplifies real-world problems by assuming a linear relationship between variables.

Here are some disadvantages of linear regression:

  • Outliers can have negative effects on the regression
  • Since there should be a linear relationship among the variables to fit a linear model, it assumes that there's a straight-line relationship between the variables
  • It perceives that the data is normally distributed
  • It also looks at the relationship between the mean of the independent and dependent variables
  • Linear regression isn't a complete description of relationships between variables
  • The presence of a high correlation between variables can significantly affect the performance of a linear model

First observe, then predict

In linear regression, it's crucial to evaluate whether the variables have a linear relationship. Although some people do try to predict without looking at the trend, it's best to ensure there's a moderately strong correlation between variables.

As mentioned earlier, looking at the scatter plot and correlation coefficient are excellent methods. And yes, even if the correlation is high, it's still better to look at the scatter plot. In short, if the data is visually linear, then linear regression analysis is feasible.

While linear regression lets you predict the value of a dependent variable, there's an algorithm that classifies new data points or predicts their values by looking at their neighbors. It's called the k-nearest neighbors algorithm, and it's a lazy learner.


Get this exclusive AI content editing guide.

By downloading this guide, you are also subscribing to the weekly G2 Tea newsletter to receive marketing news and trends. You can learn more about G2's privacy policy here.