Mozilla Persona A/B Interface Tests

< Nick Parlante's home (Oct 22 2013)

This is a writeup of an A/B user interface test I ran on the codingbat.com site, testing a part of the Mozilla Persona user interface. It's a good example of a simple A/B interface test, and it provides some insights about buttons and tests and whatnot. In this case I used an A/B test strategy based on sessions. I've included some implementation notes about how the A/B tests were coded at the very end of this doc for people interested in running A/B tests.

Mozilla Persona is a vendor-neutral universal log-in system for web sites, so you don't have to make up a new password for every different site. I implemented some A/B tests in part just for fun, and also to help out this open source project (disclosure: my wife Katie Parlante works on Persona). For my first test, I generated data for Hannah Quay-de la Vallee, who was interning at Mozilla (Talk Video). Then I ran a couple follow up tests, so I have the benefit of more data than Hannah. I'll gather all the results here.

The question for the tests was: what should the log in button look like? One popular choice is a single "Sign In" button to handle both the sign-in and create-account cases. Using just one button has an attractive minimalism. Another approach is separate "Sign In" and "Create Account" buttons. Which will perform better?

Background

The create-account case is crucial for any internet service to get momentum, but it's hard to measure. The user might just glance at the page, get confused, and not create an account and not click on anything. How do you measure that that happened?

An A/B test provides a great solution. We randomly give each user either the A UI or the B UI, doing this around the clock. It's important to show each particular user a consistent (if random) UI all the time, or it's confusing (see the implementation notes at the end of this doc). If the A UI produces on average 1000 new accounts per day, and the B UI produces 600 new accounts per day, you know that the B UI has real problems. In a neat way, you have in effect measured the 400 accounts that the B UI is failing to create each day. You don't know why it performs poorly, but at least you know there's a problem, and you can do further tests to narrow things down.

I'll present the three experiments in the order I did them. The experiments are based on, all told, about 1.25 million login-widget views spread over about 2 months.

Experiment 1: A= Log In B= 2-Button

For the first experiment, A is a single "Log In" button.

A:

We titled the button "Log In" to match the existing UI, but there's evidence that "Sign In" performs similarly.

B is the "2-Button" UI which quite literally spells out the "create" vs. "log in" cases:

B:

The A/B experiment buttons showed up below the existing log-in UI of Codingbat as you can see above. We're just comparing A and B to each other, so the existing Codingbat UI factors out, as it's the same for both A and B.

Here are the results, courtesy of "grep -c" on the log files. We know if each user got the A or B UI, and we just count how A users and B users created accounts. The experiment is running 24/7 so the data is spread across all types of users, time zones, etc.

widgets-shown:226104 old-create:1404
a:17 b:212   ab percent:8 1247

What the above means is that 226k log in widgets of one sort or another were shown to end users. The site works pretty well without creating an account, so there are a lot of widgets shown per account created. When someone is logged in, the widget is not shown, so this all is just a fraction of the page views. In that time, 1404 accounts were created with the old UI, 17 accounts were created from the (A) Log In button, and 212 accounts were created from the (B) 2-Button. That is a 12.4x performance difference! I'll save conclusions for the end.

Statistics note: I haven not run a statistical significance test, but the results seemed quite robust as I monitored them. Basically the pattern that was established within the first 6 hours would continue pretty much unchanged for the whole 2 weeks. This suggested to me that the measurement was not suffering from a lot of noise. The greater source of uncertainty is the peculiarities of the CodingBat CS-student population. That said, 12.4x is such a strong signal, changing it even by a factor of 2 does not much change the conclusions. If someone really cares, I can break out the data for a significance test.

Experiment 2: A= 2-Button B= Wide Button

For this experiment, A is the 2-Button interface as above. Then for B, I tried a single button made wide enough to put more words on it, including the word "create":

B:

Results:

widgets-shown:327626 old-create:4655
a:653 b:383   ab percent:170 58

The wide button performs much better than the Log In button, at 58% of the performance of the 2-Button. For all the experiments, the 2-Button UI was, by far, the best performing, so you can always rank a UI by what % of the 2-Button performance it can make. I would say that 58% is pretty good considering that it only uses one button. No doubt the exact phrasing could be tuned and beautified.

Experiment 3: A=Sign in with your Email B= 2-Button

Here A is the "Sign in with your Email" button, taken right from the Persona docs where it is an officially suggested UI, and B is the 2-Button UI.

A:

This one ran for a lot longer, since I got busy with other stuff and just left it in production for a lot longer than necessary. Results:

widgets-shown:748445 old-create:9255
a:202 b:1157   ab percent:17 572

So here the Sign in with your Email performs at 17% of the level of the 2-Button UI, aka 5.7x worse. I was hoping this one-button design would work well, but it doesn't.

Conclusions

Here are the button design organized from best performing to worst, which is maybe the more sensible way to look at them.

(1) The 2-Button "Create Account" Performs Great

The 2-Button UI with spelled-out Create Account and Log In buttons performs the best by a significant margin, and it should be mentioned in the docs and on the example sites as a good UI choice for sites where having 2 buttons make sense. We can only laugh since the 2 buttons, in fact, lead to the exact same next screen. Nonetheless, the data shows clearly the big gain a site gets by giving the user that first "Create Account" button to click.

It looks like the word "create" is the key. The buttons without the word create performed 12.4x and 5.7x worse than the 2-Button UI. This makes some intuitive sense. The user is thinking "I want to create an account" and it's a big help if there's an effortless match on screen for their idea. Apparently the thought train "Well it says Sign In, but I'll bet really that prompts me to create an account" does not work well with actual users. As a bit of supporting evidence, note that Gmail and Facebook have very prominent Create Account buttons, and you just know they've done a ton of testing.

(2) The Wide "Create" Button Performs Pretty Well

The wide "Create Account / Log In" button performs pretty well, at 58% of the level of the 2-Button. This design, or the variant of "Create Account / Sign In" is another good design to mention in the docs, demo sites etc, since many sites will like the simplicity of a single button. In reality, the text "Create Account / Log In" was something I thought up in 12 seconds. I just wanted a single wide button with the word "create". No doubt a better looking and performing design can be worked out in the longer term. I suspect something as terse as "Create / Log In" might perform pretty well.

In particular, the 58% performance of the wide button is more than 3x better than 17% performance of the "Sign in with your Email" button.

Docs, Talks, Sample Sites

The docs should devote some space to listing a few button choices and their pros and cons, since it appears to be a surprisingly important and unintuitive decision. The current quick start docs, obviously written before there was any A/B testing data, say to create a single "login" button which the data now shows to be a bad choice. I also suspect that half the time a site implements Persona, they just grab whatever button they see first and don't read any docs. Therefore, the buttons routinely shown in demos and in the docs and example sites should be from the set of buttons with at least reasonable performance.

Caveats

How wrong might these numbers be?

1. The CodingBat population is computer science students, so that skews the results in who knows what way. However, the data signals here, e.g. 5.7x worse, are so strong I'm extremely confident that the basic conclusions will hold up for a more general audience.

2. The CodingBat UI shows prominent alternative ways to log in. Instead of using the Persona, they could create an account another way or just not create an account. We are measuring A and B in the presence of these alternatives. However, since we are measuring A/B relative to each other, the alternatives should not be much of a factor. In reality most internet sites operate with alternatives too.

3. The Persona buttons tested do not fit in with the rather primitive UI look of CodingBat. As above, we are always comparing A and B to each other, so the CodingBat UI is just this constant in the background. Nonetheless, the combination of styles just looks a little weird.

A/B Test Implementation Ideas

For the curious, here's the implementation strategy I worked out. I was quite happy with this strategy in the end -- it didn't take much time, it required minimal changes to the existing code, and it seemed reliable.

There's this one issue: when the user clicks the logout button, that redirects to a new session, so there's a 50% chance they get a different UI. This was actually kind of jarring to the end user, so it seemed worth fixing. I did not want to change the way the app did login/logout, as that was all working quite reliably. My solution was to have a 1-deep global buffer in the app. When a user logs out, we store their A/B choice in there, evicting any previous value. The next time we need to generate a random A/B, we consume the value in the buffer first, making a real random choice if the buffer is empty. Because the log-out/new-session cases happen within a few milliseconds of each other, this de-facto makes the right thing happen, and you can convince yourself that the 50/50 A/B split is maintained. You can also, of course, audit the logs to verify at the end, that you did in fact show roughly equal numbers of the A and B UIs.