A/B Testing Apps: The Complete Guide
Why intuition beats data in your mind but not in reality
You're certain the blue button will get more clicks than the orange button. Your designer swears the new onboarding flow is more intuitive. Your PM believes the feature is exactly what users want. You're probably wrong.
Humans are terrible at predicting behavior. We are biased toward our own preferences, our design aesthetic, and our internal assumptions. The only way to know what actually works is to test it.
A/B testing isn't about proving yourself right — it's about discovering what's true. The best product teams are wrong 60% of the time. They ship faster, learn faster, and iterate based on data instead of confidence.
Core A/B testing framework
An A/B test compares two versions of the same feature and measures which one performs better.
Define the metric you care about
Everything else follows from this choice. Are you optimizing for: • Conversion: did the user complete the desired action (sign up, purchase, share)? • Engagement: how long did the user spend in a particular flow? • Retention: did this change affect whether users return tomorrow? • Revenue: did this change increase lifetime value?
Pick one metric per test. If you optimize for five metrics simultaneously, you'll find reasons to ship the version you prefer regardless of data.
Create version A (control) and version B (variant)
Version A is your current implementation. Version B is the change you want to test. The only difference should be the element you're testing. Don't change button color AND text AND size simultaneously — you won't know which change worked.
Split users randomly 50/50
Half your users see version A, half see version B. The split must be random at the user level, not by day or device (if Monday is all version A and Tuesday is all version B, day-of-week effects will contaminate your results).
Run the test long enough to reach significance
If you run a test with 100 users per version and 55 choose version B, that's not a win — that's random noise. You need 10,000+ users per version for most tests to reach statistical significance (usually 95% confidence).
Small teams: run tests for 2–4 weeks minimum. Run multiple tests in parallel (button color, headline copy, feature placement). Don't stop a test early even if results look good — wait for statistical significance.
Analyze results: did B beat A?
Tools like Amplitude, Mixpanel, or Firebase show win/loss at statistical significance. If version B has 5% higher conversion rate with p-value < 0.05, ship version B. If the difference is not statistically significant, either version works; ship whichever you prefer (or run more users).
Common A/B testing mistakes
• Running a test on too few users. You'll get incorrect winners 40% of the time with underpowered tests. • Stopping early when results look good. If you peek at results and stop, you introduce bias. Let it run. • Running too many tests simultaneously. Multiple tests interact. If you're running 10 tests on the same flow, results become uninterpretable. • Testing the wrong metric. Testing if more users click the sign-up button instead of testing if more users actually sign up and return.
What to A/B test
High-leverage tests across the user lifecycle:
• Onboarding: try different first-screen messaging, skip options, number of steps. • Core features: button placement, terminology, interaction pattern. • Pricing: price points, payment methods, subscription vs. one-time. • Messaging: notification text, email subject lines, in-app copy. • Visual: colors, layouts, typography.
Avoid testing: deeply personal preferences (logo design), things you've already tested, features that 90%+ of users love.
From test to shipped feature
When you have a winner: 1. Ship it to 100% of users. 2. Continue monitoring the metric. Sometimes post-launch behavior differs from test behavior. 3. Document the win in your playbook: 'Removing the optional email field on signup increased signup rate 12%.' 4. Build on winners: if a messaging change worked, test related messaging changes. 5. Celebrate, then move on: the best product teams are already testing the next thing.
Frequently Asked Questions
How many users do I need to run a valid test?
Depends on the expected effect size and your metric baseline. For a typical 5–10% improvement on a 30% baseline metric, you need ~5,000 users per version for significance. Use an A/B testing calculator to estimate based on your specifics.
Can I A/B test with 1,000 users total?
Yes, but only if you're testing a large effect (25%+ improvement). Most meaningful improvements are 5–15%; these require 10,000–50,000 users per version for statistical power.
What if the test is a tie?
If A and B are statistically equivalent, neither won. You can ship whichever is cheaper (less work to maintain), or run more users if it's a critical decision. Most ties mean the variant doesn't matter — user behavior is driven by something else you haven't tested.