What Is A/B Testing? A Beginner’s Guide for 2019

February 18, 2019

When you need an easy and powerful way to improve the conversion rate of a web page, A/B testing is your best bet. Right?

That’s the perception of many marketers today. And, while it’s partly true -- A/B testing can, indeed, help you greatly optimize web design -- it’s nowhere near as easy as it may seem.

What is A/B testing?

A/B testing is a method of data collection that informs optimization. It compares two versions of a design: the original, “A” version, known as the “control,” and a secondary, “B” version, known as a “variation.”

If the rules of sound experimental design are followed, by the conclusion of a test the tester will have an idea of which design performs better for a specific goal.

TIP: Learn about the best A/B testing tools and compare the most popular vendors from real-user reviews.

How does A/B testing work?

Based on the previous definition, and the countless A/B testing case studies on the web, you might think A/B testing is simple. Though, while the concept is fairly easy to grasp, testing your way to valuable results is much more difficult.

So, how does A/B testing work? The short version is this: You drive equal traffic to two pages with the same goal, and the one that performs better is the winner:

A/B-Testing-example

And if you test based on the short version, you might find you’ve created what seems like an improvement of your original design. Chances are, though, that improvement will be imaginary.

For an accurate test result, you need to know the bigger picture. Here are the most important aspects of A/B testing to consider.

When to use A/B split testing

A/B testing is a powerful methodology capable of informing meaningful design improvements. But it’s often misused. Does A/B split testing make sense for you? Consider the following.

Only test when your business is ready

For many businesses, A/B testing shouldn’t be a priority. There are more important things to focus on than A/B testing, like traffic, for example. Derek Halpern explains:

“If I get 100 people to my site, and I have a 20 percent conversion rate, that means I get 20 people to convert... I can try to get that conversion rate to 35 percent and get 35 people to convert, or, I could just figure out how to get 1,000 new visitors, maintain that 20 percent conversion, and you’ll see that 20 percent of 1,000 (200), is much higher than 35 percent of 100 (35).”

So, how do you know when you’re ready to A/B test? In a blog post for Instapage, Alex Birkett shares a benchmark:

Roughly speaking, if you have less than 1,000 transactions (purchases, signups, leads, etc.) per month — you’re going to be better off putting your effort in other stuff. Maybe you could get away with running tests around 500 transactions for months — but you’re going to need some big lifts to see an effect.

What are alternatives to A/B testing?

Do you generate at least 500 transactions per month? If not, you may be better off using qualitative feedback to inform design rather than quantitative. These include but aren’t limited to:

1. Survey technology

Surveys are a great way to discover what your business is doing wrong from the users themselves. Net Promoter Score is just one example of a popular survey used often to inform business decisions. There are many survey software apps available online for brands looking to conduct an NPS or other questionnaires.

2. Live chat

Like surveys, live chat software allows you to hear straight from your customers, what they dislike about aspects of your website. Unlike surveys, live chat has the added benefit of being contextual, which means you get extra detail about when and where your prospects run into a problem.

3. Eye tracking and heat maps

Eye tracking studies and heat maps can also be great sources of qualitative data when you don’t generate a lot of traffic. Heat map software shows the movement of users’ eyes, and their mouses, and can even track scroll depth on a given page.

The heat map below, for example, helped this business realize that high-quality images should be a focus of their product pages. 

Heat Map Testing

4. User recordings

These can be especially helpful because watching them is a little like watching from over the user’s shoulder. On a chosen web page, you get to see where the mouse moves, what they hover over and click on, too.

Qualitative data, like that above, can help inform design beyond best practices when you don’t have enough traffic to rely on. As valuable as A/B testing can be, methods such as session replay can be even more so.

How not to A/B test

Many blog posts and case studies recommend testing one element vs. another. Say, for example, headline one vs headline two on an otherwise identical page. Or, image vs video. That way, at the end of the test, you’ll know exactly the reason for change in conversion rate.

However, A/B testing is best for finding what’s called the global maximum. In basic terms, the global maximum is the best general design for what you’re trying to accomplish. You’re better off A/B testing drastically different designs. Multivariate testing, on the other hand, is best for finding the best combination of elements on a page. 

Take this example, from the MarketingExperiments team, on an Investopedia signup page. Here’s the control:

control-multivariate-testing

Compared to the variation:

Variation-split-testing
 

The variation produced a nearly 90 percent boost in conversion rate. Now, because multiple changes were made between the control and variation, there’s no way for MarketingExperiments to know exactly why this page performed that much better. But, if you boosted conversion rate by 90 percent, would you care? Or would you simply take this design and run with it?

Likely, the latter. Then, you’d use multivariate testing to determine which combination of elements improves this general design even further.

Steps to running an A/B multivariate test

Think you’re ready to A/B test? Below are some steps to follow.

Bare in mind, however, that this list is not exhaustive. There are lots to learn before you embark on your first test. Each one of these sections could be their own blog post. Think of this as a basic overview of the methodology.

1. Start with data

Never test without a reason to. This is one of the biggest mistakes beginners make. They test different headlines or CTAs because they saw that it worked for another business. But, that business is not your business.

You face unique challenges, indicated by unique data. What do your analytics tell you?

For example, say you use a multi-step form. People are converting on page one, but on page 2, there’s a big drop-off. In this case, you might find you need to rearrange the order in which you ask for information, or, if that information isn’t totally necessary (like phone number, for example), eliminate it altogether.

Other businesses may have already tested this, and you might get some ideas for optimizations to make based on their results. However, you should not simply test something because it worked for someone else. Your business problems and your data should form the basis of the test, not someone else’s.

2. Generate a hypothesis

From this reason to test, generated by data, you can form your hypothesis. What is your goal, and how are you trying to improve it?

In this case, you might say, “After observing that visitors abandon our sign-up flow after step one, I believe that requesting phone number at the end of the flow may improve the likelihood it gets filled out, and ultimately lead to more signups.” At the conclusion of your test, you should be able to accept or reject this hypothesis.

3. Adjust your variation

Now, you’re ready to translate your new hypothesis to your variation page. If you’re rearranging the signup flow, do it on your variation. If you’re testing a video versus long-form copy, add the video, etc.

4. Determine your required sample size

This is where things start to get a little more complicated. There’s more to simply driving traffic to your pages and declaring a winner when you see a difference in conversion rate.

Before you can conclude your test, you have to drive enough visitors to ensure your data is as close to accurate as possible. Think of it this way: If you drive just three visitors to your control and variation, it’s possible that your control converts all three, and your variation converts none of them. The conversion rate for your original would be 100 percent, and your variation, 0 percent.

Does that mean that your variation is doomed and your original is destined for perfection? No.

It means that you need to collect more visitors to get more accurate data. The point at which you can start to trust the data your seeing is when you reach what’s called statistical significance. This number is a few factors: how confident you want to be in your results (confidence level), the difference in conversion rate you want to detect between pages (minimum detectable effect), and your original conversion rate.

The more accurate you want to be, the more visitors you’ll need to drive to your page. Original conversion rates and the minimum detectable effect will vary from team to team, but you should be wary of testing with a confidence level below 95 percent. Below that, and you might as well skip the testing process and just guess.

To determine your sample size, there are a lot of great calculators out there. Here’s one from Evan Miller.

5. Account for validity threats

Your tests aren’t run in a lab. They’re run in the real world by real people on real people. Because of that, they face threats to their validity. 

For example:

Regression to the mean refers to a phenomenon that occurs when you run your test longer. The longer you run it, the more your results will move into the average range.

Think of the above example, in which two landing pages generate a conversion rate of 0 percent and 100 percent. As you drive more visitors to each page, the 100 percent conversion rate will drop, and the 0 percent conversion rate will rise. These two will both move closer to average if you let them run. If you call your test early, though, you may make the mistake of thinking your original is better than the variation.

  • The novelty effect refers to results that can be attributed to newness. Think, for example, of a marketer changing the color of CTA buttons on a website. The new color may get more attention, and more clicks, but the reason could be the novelty of the change. This also can be controlled for by running your test longer.
  • The selection effect refers to a validity threat that comes when a tester does not test with an accurate representation of the tester’s audience. This would be like driving traffic from a social media platform with a young audience, like Snapchat, to a landing page aimed at older professionals, say, one that offers disability insurance.

There are many more you’ll have to account for throughout, and in the beginning of your test, too. The instrumentation effect, for example, is one of the most common threats to your validity. It refers to an issue with your tools.

To protect against it, ensure your landing pages look the way they should across all devices and browsers. Make sure analytics are working, your pixels are firing, etc., all before you start your test.

Some even conduct A/A tests to calibrate their testing tools, though, others argue this is a waste of time.

6. Drive traffic to your pages

Once you’re through those steps, you’re ready to drive traffic. Remember the selection effect: The traffic to both pages should be an accurate sample of your audience, and it should be from the same source. Audiences can vary greatly between platforms.

If you’ve picked your sources, you identified in your pre-testing calculations for both landing pages (original and control). And, if you hit that number in less than a week, keep the test running.

Why?

Days of the week have a significant impact on conversions. There are some days your visitors will be more receptive to your marketing messages than others.

If you’ve hit your sample size and run your test for at least a full week, all the while accounting for confounding variables that might poison your data, it’s time to look at the results.

7. Analyze and optimize

When you’ve hit statistical significance at your chosen confidence level, it’s time to analyze. Which page performed better? Why?

Remember, if you set your minimum detectable effect to 10 percent, you can’t be sure about any changes that are within that range.

If there does happen to be a difference in 10 percent or more between the two pages, you can be 95 percent confident your adjustment is the reason why (if you accounted for all those validity threats).

So, how does this form the basis for a new test?

It’s your job to figure that out. Keep testing. There’s always a better version of your design.

Ready to learn more about A/B testing methods? Learn about the best CRO software in 2019 from real users.

Never miss a post.

Subscribe to keep your fingers on the tech pulse.

By submitting this form, you are agreeing to receive marketing communications from G2.