What Is Logistic Regression? A Detailed Guide with Examples

Table of Contents

Linear regression vs. logistic regression
How logistic regression works
What is a logistic function?
What are the types of logistic regression?
When to use logistic regression?
When to avoid logistic regression?
What are the advantages of logistic regression?
What are the common pitfalls in logistic regression how to avoid them
How to build a logistic regression model
Logistic regression: Frequently asked questions

Life is full of binary choices — have that slice of pizza or skip it, carry an umbrella or take a chance. Some decisions are simple, others unpredictable.

Predicting the right outcome requires data and probability. These are skills that computers excel at. Since computers speak in binary, they’re great at solving yes-or-no problems.

That’s where machine learning software, specifically the logistic regression algorithm, comes in. It helps predict the likelihood of events based on historical data, like whether it’ll rain today or if a customer will make a purchase.

What is logistic regression?

Logistic regression is a statistical method used to predict the outcome of a dependent variable based on previous observations. It's a type of regression analysis and is a commonly used algorithm for solving binary classification problems.

Regression analysis is a predictive modeling technique that finds the relationship between a dependent variable and one or more independent variables.

For example, time spent studying and time spent on Instagram (independent variables) affect grades (the dependent variable) — one positively, the other negatively.

Logistic regression builds on this concept to predict binary outcomes, like whether you’ll pass or fail a class. While it can handle regression tasks, it’s primarily used for classification problems.

For instance, logistic regression can predict whether a student will be accepted into a university based on factors such as SAT scores, GPA, and extracurricular activities. Using past data, it classifies outcomes as “accept” or “reject.”

Also known as binomial logistic regression, it becomes multinomial when predicting more than two outcomes. Borrowed from statistics, it’s one of the most widely used binary classification algorithms in machine learning.

Logistic regression measures the relationship between the dependent variable (the outcome) and one or more independent variables (the features). It estimates the probability of an event using a logistic function.

TL;DR: Everything you need to know about logistic regression

What is logistic regression? Logistic regression is a statistical algorithm used to predict binary outcomes (e.g., yes/no, 0/1) based on one or more input variables.
What does logistic regression do? It models the probability of a categorical dependent variable using the logistic (sigmoid) function and outputs class labels based on a decision threshold.
Why does logistic regression matter? It's widely used in fields such as finance, healthcare, and marketing due to its simplicity, speed, and interpretability in classification tasks.
What are the benefits of logistic regression? It provides calibrated probabilities, is easy to implement, performs well on small datasets, and helps uncover relationships between features.
What types of logistic regression are there? The main types are binary logistic regression, multinomial logistic regression (for 3+ classes), and ordinal logistic regression (for ordered classes).
What are the best practices for logistic regression? Ensure a sufficient sample size, check for multicollinearity, use regularization to prevent overfitting, and interpret coefficients in terms of log-odds or odds ratios.

Linear regression vs. logistic regression

While logistic regression predicts the categorical variable for one or more independent variables, linear regression predicts the continuous variable. In other words, logistic regression provides a constant output, whereas linear regression offers a continuous output.

Since the outcome is continuous in linear regression, there are an infinite number of possible values for the outcome. But for logistic regression, the number of possible outcome values is limited.

In linear regression, the dependent and independent variables should be linearly related. In the case of logistic regression, the independent variables should be linearly related to the log odds (log (p/(1-p)).

Tip: Logistic regression can be implemented in any programming language used for data analysis, such as R, Python, Java, and MATLAB.

While linear regression is estimated using the ordinary least squares method, logistic regression is estimated using the maximum likelihood estimation approach.

Both logistic and linear regression are supervised machine learning algorithms, and they are the two main types of regression analysis. While logistic regression is used to solve classification problems, linear regression is primarily used for regression problems.

Feature	Logistic Regression	Linear Regression
Output Type	Categorical (0 or 1)	Continuous (any number)
Algorithm Type	Classification	Regression
Function Used	Sigmoid (logistic)	Linear equation
Output Interpretation	Probability of class	Predicted value
Common Use Cases	Fraud detection, churn, spam	Sales forecasting, pricing, trends

Returning to the example of time spent studying, linear regression and logistic regression can be used to predict different outcomes. Logistic regression can help predict whether the student passed an exam or not. In contrast, linear regression can predict the student's score.

Key terms in logistic regression

The following are some of the common terms used in regression analysis:

Variable: Any number, characteristic, or quantity that can be measured or counted. Age, speed, gender, and income are examples.
Coefficient: A number, usually an integer, multiplied by the variable that it accompanies. For example, in 12y, the number 12 is the coefficient.
EXP: Short form of exponential.
Outliers: Data points that significantly differ from the rest.
Estimator: An algorithm or formula that generates estimates of parameters.
Chi-squared test: Also called the chi-square test, it's a hypothesis testing method to check whether the data is as expected.
Standard error: The approximate standard deviation of a statistical sample population.
Regularization: A method used for reducing the error and overfitting by fitting a function (appropriately) on the training data set.
Multicollinearity: Occurrence of intercorrelations between two or more independent variables.
Goodness of fit: Description of how well a statistical model fits a set of observations.
Odds ratio: A measure of the strength of association between two events.
Log-likelihood functions: Evaluate a statistical model's goodness of fit.
Hosmer–Lemeshow test: A test that assesses whether the observed event rates match the expected event rates.

How logistic regression works

You can think of logistic regression as a process or pipeline:

Raw data ➝ clean data ➝ feature engineering ➝ model training ➝ probability prediction ➝ yes/no output

Here’s how it works behind the scenes:

Step 1: Take the input data (independent variables)

These factors influence the outcome, including age, income, and time on site. In technical terms, these are your independent variables (often labeled X₁, X₂, X₃, and so on).

Step 2: Assign importance (coefficients or weights)

The model learns the importance of each input. It gives each one a weight — kind of like saying, “This factor matters more than that one.”

Step 3: Combine everything into a single score

The model multiplies each input by its weight and adds them together. This yields a score, technically referred to as a “linear combination.” Think of it like mixing ingredients to make a dish. The model combines your inputs into one value.

Step 4: Convert the score to a probability (sigmoid function)

The combined score could be any number, positive or negative. But we want a probability between 0 and 1. To achieve this, logistic regression employs a special formula called the logistic (sigmoid) function, which compresses the number into a value such as 0.2 (low chance) or 0.9 (high chance).

Step 5: Make a prediction (yes or no)

Finally, the model draws a line in the sand, usually at 0.5. If the probability is above 0.5, it predicts yes (e.g., the customer will buy). If it’s below 0.5, it predicts no.

This whole process transforms raw data into a clear, interpretable result — plus it gives you the probability behind the decision, so you know how confident the model is.

Now that you understand how logistic regression makes predictions, let’s explore the key assumptions that must be met for it to work correctly.

What are the key assumptions behind logistic regression?

When using logistic regression, several assumptions are made. Assumptions are integral to correctly use logistic regression for making predictions and solving classification problems. The following are the main assumptions:

There is little to no multicollinearity between the independent variables.
The independent variables are linearly related to the log odds (log (p/(1-p)).
The dependent variable is dichotomous, meaning it falls into two distinct categories. This applies only to binary logistic regression, which is discussed later.
There are no non-meaningful variables, as they might lead to errors.
The data sample sizes are larger, which is integral for better results.
There are no outliers.

What is a logistic function?

Logistic regression is named after the function used at its heart, the logistic function. Statisticians initially used it to describe the properties of population growth. The sigmoid function and the logit function are some variations of the logistic function. The logit function is the inverse of the standard logistic function.

logistic function

In effect, it's an S-shaped curve capable of mapping any real number into a value between 0 and 1, but never precisely at those limits. It's represented by the equation:

f(x) = L / 1 + e^-k(x - x0)

In this equation:

f(X) is the output of the function
L is the curve’s maximum value
e is the base of the natural logarithms
k is the steepness of the curve
x is the real number
x0 is the x value of the sigmoid midpoint

If the predicted value is a considerable negative value, it's considered close to zero. On the other hand, if the predicted value is a significant positive value, it's considered close to one.

Logistic regression is represented similarly to linear regression, defined using the equation of a straight line. A notable difference from linear regression is that the output will be a binary value (0 or 1) rather than a numerical value.

Here’s an example of a logistic regression equation:

y = e^(b0 + b1*x) / (1 + e^(b0 + b1*x))

In this equation:

y is the predicted value (or the output)
b0 is the bias (or the intercept term)
b1 is the coefficient for the input
x is the predictor variable (or the input)

The dependent variable generally follows the Bernoulli distribution. The values of the coefficients are estimated using maximum likelihood estimation (MLE), gradient descent, and stochastic gradient descent.

As with other classification algorithms, like the k-nearest neighbors, a confusion matrix is used to evaluate the accuracy of the logistic regression algorithm.

Did you know? Logistic regression is a part of a larger family of generalized linear models (GLMs).

Just like evaluating the performance of a classifier, it's equally important to know why the model classified an observation in a particular way. In other words, we need the classifier's decision to be interpretable.

Although interpretability isn't easy to define, its primary intent is that humans should know why an algorithm made a particular decision. In the case of logistic regression, it can be combined with statistical tests, such as the Wald test or the likelihood ratio test, for enhanced interpretability.

What are the types of logistic regression?

Logistic regression can be divided into different types based on the number of outcomes or categories of the dependent variable.

When we think of logistic regression, we most probably think of binary logistic regression. Throughout most of this article, when we referred to logistic regression, we were specifically referring to binary logistic regression.

The following are the three main types of logistic regression.

Binary logistic regression

Binary logistic regression is the most commonly used form of logistic regression. It models the probability of a binary outcome — that is, an outcome that can take on only two possible values, such as “yes” or “no,” “spam” or “not spam,” “pass” or “fail,” and “0” or “1.” The algorithm estimates the likelihood that a certain input belongs to one of the two classes, based on one or more predictor variables.

This method is widely used across industries:

Healthcare: Predicting whether a patient has a disease (yes/no).
Finance: Estimating loan default risk (default/no default).
Marketing: Predicting whether a customer will respond to a campaign.

Mathematically, binary logistic regression calculates the log-odds of the dependent variable as a linear combination of the independent variables. This log-odds is then transformed into a probability using the logistic (sigmoid) function.

Binary logistic regression is ideal when:

The target variable is categorical with two levels.
Predictor variables can be continuous, categorical, or a mix.
Interpretability and probability outputs are important.

Example: A spam classifier that takes features like email length, keyword frequency, and sender reputation and outputs a probability of the message being spam.

Multinomial logistic regression

Multinomial logistic regression is an extension of binary logistic regression used when the dependent variable has more than two unordered categories. Unlike binary logistic regression, where the outcome is either 0 or 1, multinomial regression handles three or more nominal (non-ordinal) classes.

Common examples include:

Retail: Predicting customer choice among product categories (e.g., buying shoes, electronics, or clothing).
Education: Determining a student’s choice of major (e.g., STEM, Humanities, Business).
Healthcare: Classifying types of disease (e.g., viral, bacterial, fungal).

The model estimates the probabilities of each class relative to a reference category. This involves fitting separate binary logistic models for each class versus the reference class, then combining the results to make a final prediction.

Multinomial logistic regression works best when:

The outcome categories are nominal (no inherent order).
The dataset has enough observations per class.
Maximum likelihood estimation is feasible.

It is also more computationally intensive than binary regression, but highly useful when there are multiple outcome choices that don’t follow a natural order.

Ordinal logistic regression

Ordinal logistic regression, also called ordinal regression, is used when the dependent variable consists of three or more ordered categories. The categories have a meaningful order, but the spacing between them is unknown — for instance, “Poor,” “Average,” “Good,” or customer ratings such as 1 to 5 stars.

Use cases include:

Surveys: Analyzing customer satisfaction (Very Unsatisfied to Very Satisfied).
HR: Assessing performance levels (Below Expectations, Meets Expectations, Exceeds Expectations).
Healthcare: Rating symptom severity (Mild, Moderate, Severe).

Unlike multinomial logistic regression, ordinal regression assumes the proportional odds assumption, which means that the relationship between each pair of outcome groups remains constant. This assumption helps simplify the model and interpretation of coefficients.

Ordinal logistic regression is appropriate when:

The outcome variable is categorical with a natural order.
The predictor variables can be categorical or continuous.
You want to preserve ordinal structure rather than flatten it into unrelated categories.

The model calculates the log-odds of being at or below a certain category and then applies the logistic function to compute cumulative probabilities. This makes it a powerful tool when both classification and ranking are important.

Did you know? An artificial neural network (ANN) representation can be seen as stacking together a large number of logistic regression classifiers.

When to use logistic regression?

Logistic regression is applied to predict the categorical dependent variable. In other words, it's used when the prediction is categorical, for example, yes or no, true or false, 0 or 1. The predicted probability or output of logistic regression can be either one of them, and there's no middle ground.

In the case of predictor variables, they can be part of any of the following categories:

Continuous data: Data that can be measured on an infinite scale. It can take any value between two numbers. Examples are weight in pounds or temperature in Fahrenheit.
Discrete, nominal data: Data that fits into named categories. A quick example is hair color: blond, black, or brown.
Discrete, ordinal data: Data that fits into some form of order on a scale. An example is rating your satisfaction with a product or service on a scale of 1 to 5.

Logistic regression analysis is valuable for predicting the likelihood of an event. It helps determine the probabilities between any two classes.

In a nutshell, by looking at historical data, logistic regression can predict whether:

An email is spam
It’ll rain today
A tumor is fatal
An individual will purchase a car
An online transaction is fraudulent
A contestant will win an election
A group of users will buy a product
An insurance policyholder will expire before the policy term expires
A promotional email receiver is a responder or a non-responder

In essence, logistic regression helps solve problems related to probability and classification. In other words, you can expect only classification and probability outcomes from logistic regression.

For example, it can be used to find the probability of something being “true or false” and also for deciding between two outcomes like “yes or no”.

A logistic regression model can also help classify data for extract, transform, and load (ETL) operations. Logistic regression shouldn't be used if the number of observations is less than the number of features. Otherwise, it may lead to overfitting.

Top 3 statistical analysis software

These tools can help you run logistic regression models, visualize results, and interpret coefficients, all without writing complex code.

IBM SPSS Statistics for complex statistical data analysis in social sciences ($1069.2/year/user)
SAS Viya for data mining, predictive modeling, and machine learning (pricing available on request)
Minitab Statistical Software for quality improvement and educational purposes ($1851/year/user)

*These statistical analysis software solutions are top-rated in their category, according to G2's Fall 2025 Grid Reports.

When to avoid logistic regression?

Logistic regression is simple, fast, and easy to interpret, but it’s not always the best tool. Here are situations where it might fall short:

1. When the relationship isn’t linear

Logistic regression assumes that the inputs relate linearly to the log-odds of the outcome. If the actual relationship is complex or nonlinear, the model might oversimplify and make inaccurate predictions.

2. With high-dimensional or small datasets

If you have more features (inputs) than observations (rows of data), logistic regression can overfit — meaning it performs well on training data but poorly on new data. It needs a reasonable number of examples to generalize well.

3. When features are highly correlated

Multicollinearity (when two or more features move in tandem) can compromise the model’s ability to assign proper weights. It becomes hard to know which variable is actually influencing the outcome.

4. When you need to capture complex patterns

Logistic regression creates straight-line (linear) decision boundaries. If your data is complicated, models like:

Decision Trees: for branching, rule-based decisions
Support Vector Machines (SVMs): for separating complex data
Neural Networks: for deep, layered learning
...may be better suited.

Bottom line: Logistic regression is a great starting point, but it's not a one-size-fits-all solution. Know its limits before choosing it as your go-to model.

What are the advantages of logistic regression?

Many of the advantages and disadvantages of the logistic regression model apply to the linear regression model. One of the most significant advantages of the logistic regression model is that it not only classifies but also provides probabilities.

The following are some of the advantages of the logistic regression algorithm.

Simple to understand, easy to implement, and efficient to train
Performs well when the dataset is linearly separable
Good accuracy for smaller datasets
Doesn't make any assumptions about the distribution of classes
It offers the direction of association (positive or negative)
Useful to find relationships between features
Provides well-calibrated probabilities
Less prone to overfitting in low-dimensional datasets
Can be extended to multi-class classification

What are the common pitfalls in logistic regression how to avoid them

Logistic regression is a great first step into machine learning — it’s simple, fast, and often surprisingly effective. But like any tool, it has its quirks. Here are some common pitfalls beginners run into, and what you can do about them:

1. Assuming it works well with non-linear data

Logistic regression draws straight-line (linear) boundaries between classes. If your data has curves or complex patterns, it may underperform. While you can try transforming features (e.g., log, polynomial), models like decision trees or neural nets are better for non-linear problems.

2. Using it with high-dimensional data and small datasets

If you have more features than observations, the model might overfit — meaning it learns noise instead of useful patterns. This results in poor performance on new data. Use regularization (such as L1 or L2) or reduce the number of features through feature selection or dimensionality reduction.

3. Complete separation of classes

When one feature perfectly predicts the outcome (e.g., all “yes” values have income > $100K), logistic regression struggles — the weight for that feature goes toward infinity. This is called complete separation, and the model may fail to converge. Use regularization or Bayesian logistic regression to constrain the weights.

4. Ignoring multicollinearity

If two or more input features are highly correlated, the model can’t distinguish their individual effects clearly. This makes the coefficient estimates unstable and hard to interpret. Check for multicollinearity using the Variance Inflation Factor (VIF) and drop or combine redundant features.

5. Sensitive to outliers

Logistic regression can be thrown off by extreme values, especially in small datasets. Outliers can disproportionately influence the weights, leading to skewed predictions.Use robust scaling methods, or remove/transform extreme values.

6. Misinterpreting probabilities and coefficients

It’s easy to mistake log-odds for probabilities or read coefficients as direct impacts on the output. Logistic regression coefficients affect the log-odds, not the raw probability. Convert coefficients to odds ratios to understand the direction and strength of effect.

Many of these pitfalls are easy to miss when you’re starting out. The good news? Once you know what to watch for, logistic regression becomes a reliable and interpretable workhorse for many binary classification problems.

How to build a logistic regression model

If you're curious about how logistic regression works in practice, here's a beginner-friendly walkthrough of how a simple model is built — from raw data to prediction.

1. Data cleaning to fix missing or messy data

Before anything else, data needs to be cleaned. This includes:

Removing missing or incorrect values
Standardizing formats (e.g., “yes” vs. “YES” vs. “Y”)
Ensuring the outcome variable is binary (0 or 1)
Clean data is essential — even the best algorithm can't fix bad input.

2. Feature engineering to prepare inputs for the model

Features are the inputs to your model (like age, income, or purchase history). Some may need to be transformed:

Categorical variables like “country” or “subscription type” need to be encoded into numbers
Scaling may be applied to numerical variables so that one feature doesn't overpower the others

3. Model fitting to train the algorithm

Once the data is ready, you train (fit) the model to learn patterns. This involves:

Feeding the data to the logistic regression algorithm
The model calculates weights (coefficients) for each input
It uses those weights to estimate the probability of each outcome

4. Evaluation to test how well the model performs

After training, we test how well the model performs. Common metrics include:

Accuracy: How often it predicts correctly
Precision/Recall: How well it handles yes/no cases
Confusion matrix: A table showing correct vs incorrect predictions

No need to be a data scientist to start experimenting; tools like Python’s scikit-learn make it easier than ever to build simple models.

Logistic regression: Frequently asked questions

Q: Is logistic regression supervised or unsupervised?

Logistic regression is a supervised learning algorithm. It learns from labeled training data to classify outcomes.

Q: What’s the difference between logistic and linear regression?

Linear regression predicts continuous outcomes (like price), while logistic regression predicts binary outcomes (like yes/no).

Q: What are the assumptions of logistic regression?

The key assumptions include the linearity of log-odds, the absence of multicollinearity, and a binary dependent variable (for binary logistic regression).

Q: Can logistic regression handle more than two classes?

Yes, that’s called multinomial or ordinal logistic regression, depending on whether the classes have an order.

Q: How do I know if logistic regression is the right choice?

It’s ideal when your target variable is categorical (yes/no) and your data meet assumptions such as linearity in log-odds and low multicollinearity.

When life gives you options, use logistic regression

Some might say life isn’t binary, but more often than not, it is. Whether you're deciding to send that email campaign or skip dessert, many choices boil down to simple yes-or-no decisions. That’s exactly where logistic regression shines.

It helps us make sense of uncertainty by using data, rather than relying on gut instinct. From predicting customer churn to flagging fraudulent transactions, logistic regression enables businesses to make smarter, more informed decisions.

Discover the top predictive analytics tools that simplify logistic regression and help you build, train, and deploy prediction models in less time.

This article was originally published in 2021. It has been updated with new information.

Devyani Mehta

Devyani Mehta is a content marketing specialist at G2. She has worked with several SaaS startups in India, which has helped her gain diverse industry experience. At G2, she shares her insights on complex cybersecurity concepts like web application firewalls, RASP, and SSPM. Outside work, she enjoys traveling, cafe hopping, and volunteering in the education sector. Connect with her on LinkedIn.