Top Five Lessons from Running A-B Tests on the World's Largest Professional Network

Tuesday, November 3, 2015

We run thousands of experiments each year at LinkedIn. Thousands of features, big and small, have been launched, refined, or discarded through our robust A/B testing platform. As a result of these experiments, we have learned a tremendous amount about how to make our products better and our members happier.

At the same time, we have accumulated knowledge on how to run better experiments and how to leverage experimentation for better decision making. Here are the top five lessons on my ever-growing list.

#1. Measure one change at a time.

This is not to say that you can only test one thing at a time, but that you have to design your experiment properly so that you are able to measure one change at a time. At LinkedIn, a product launch usually involves multiple features/components. One big upgrade to LinkedIn Search in 2013 introduced unified search across different product categories. With this functionality, the search box is smart enough to figure out query intent without explicit input on categories such as “People,” or “Jobs,” or “Companies.”

However, that was not all. Almost every single component on the search landing-page was touched, from the left rail navigation to snippets and action buttons. The first experiment was run with all changes lumped together. To our surprise, many key metrics tanked. It was a lengthy process to bring back one feature at a time in order to figure out the true culprit.

In the end, we realized that several small changes, not the unified search itself, were responsible for bringing down clicks and revenue. After restoring these features, unified search was shown to be positive to user experience and deployed to everyone.

#2. Decide on triggered users, but report on all users.

It is very common that an experiment only impacts a small fraction of your user base. For example, we want to automatically help people fill in their patents on their LinkedIn profiles, but not every member has a patent. So the experiment would only be affecting those ~5% of members who have filed patents.

To measure how much benefit this is bringing to our members, we have to focus on this small subsegment, the “triggered” users. Otherwise, the signal from that 5% of users would be lost in the 95% noise. However, once we determined that patents are a beneficial feature, we needed to have a “realistic” estimate of the overall impact. How is LinkedIn’s bottom line going to change once this feature is rolled out universally? Having such a “site-wide” impact not only makes it possible to compare impacts across experiments, but also easy to quantify ROI.

#3. The experimental group should not be influenced by the experiment outcomes.

The fundamental assumption of A/B testing is that the difference between the A and B groups is only caused by the treatment we impose. It may be obvious that we need to make sure the users in A and B are similar enough to begin with. The standard approach to check for any pre-existing differences is to run an A/A test before the actual A/B test, where both groups of users receive identical treatments. However, it is equally important to make sure the user groups stay “similar” during the experiment especially in the online world because the experimental population is usually “dynamic”.

As an example, we tested a new feature where members received a small banner on their LinkedIn profile page to encourage them to explore our new homepage. Only users who had not visited the homepage recently were eligible to be in the experiment, and the eligibility was dynamically updated after a user visited the homepage. Because the banner brought more users in the treatment group to visit the homepage, more treatment users became ineligible over time. Because these “additionally” removed users tend to be more active than the rest, we artificially created a difference between users in A and B as the test continued.

In general, if the experimental population is directly influenced by the experiment outcomes, we are likely to see a bias. Such bias could void the experiment results because it usually overwhelms any real signal resulting from the treatment itself.

#4. Avoid coupling a marketing campaign with an A/B test.

We have recently revamped the Who Viewed My Profile page. The product team wanted to measure through an A/B test if the changes are indeed better, and if so, by how much. The marketing team wanted to create buzz around the new page with an email campaign.

This is a very common scenario, but how can the A/B test and the email campaign coexist? Clearly, we can only send campaign emails to the treatment group, since there is nothing new for members in control. However, such a campaign would contaminate the online A/B test because it encourages more members from the treatment to visit. These additional users tend to be less engaged, therefore we are likely to see an artificial drop in key metrics. It is best to measure the A/B test first before launching the campaign.

#5. Use a simple rule of thumb to address multiple testing problems.

Multiple testing problems are extremely prevalent in online A/B testing. The symptom is that irrelevant metrics appear to be statistically significant. The root cause is usually because too many metrics are examined simultaneously (keep in mind that we compute over 1000 metrics for each experiment).

Even though we have tried to educate people on the topic of multiple testing, many are still clueless about what they should do when a metric is unexpectedly significant. Should they trust it or treat it as noise?

Instead, we have found it very effective to introduce a simple rule of thumb: Use the standard 0.05 p-value cutoff for metrics that are expected to be impacted, but use a smaller cutoff, say 0.001, for metrics that are not. The rule-of-thumb is based on an interesting Bayesian interpretation. It boils down to how much we believe a metric will be impacted before we even run the experiment. In particular, if using 0.05 reflects a prior probability of 50%,then using 0.001 means a much weaker belief - at about 2%.

These are only a few best practices for experimentation, but they've proven crucial for product development at LinkedIn. As I’ve said before, A/B testing and making data driven decisions through experimentation is an extremely important part of the culture at LinkedIn. It guides how and why we build products for our users by giving us crucial data on how they actually use our services.

By following these five lessons, developers across all companies and industries can not only make more informed decisions about their products, but also create a better experience for the people using them.

This content is made possible by a guest author, or sponsor; it is not written by and does not necessarily reflect the views of App Developer Magazine's editorial staff.