How to run ecommerce experiments that actually teach you something

Last tuesday i sat in on a growth review with a skincare brand in greenpoint. the founder pulled up a spreadsheet titled test tracker. it had 47 rows. new ad creative. landing page variant. email subject line. checkout button color. i asked which tests had produced clear learnings they could build on. she scrolled for a minute then said honestly i cant remember what most of these were testing.

This happens everywhere. teams run experiments constantly but learn almost nothing. they test a bit of everything and nothing compounds. six months later someone suggests testing the same thing again because nobody documented what happened last time.

Running tests isnt the same as building knowledge. most ecommerce teams treat experimentation like throwing spaghetti at a wall. launch something. check if revenue went up. move to the next thing. no hypothesis about why it might work. no clear method for deciding what to test first. no system for capturing what you learned so the next experiment is smarter than the last one.

I spent the last eight years helping ecommerce brands in new york build experimentation systems that actually compound. not fancy. not complex. just a lightweight process that turns random testing into systematic learning. this article breaks down exactly how to do it.

Why most ecommerce tests produce noise instead of knowledge

Walk into any ecommerce team and ask what theyre testing this month. youll get a list. new facebook creative. product page layout. abandoned cart email timing. free shipping threshold. all running at once. all kind of connected to revenue but not to each other.

This is testing as activity not as learning. you ship stuff. you watch dashboards. maybe something moves. you declare victory or failure then start over with a completely different idea. nothing connects. nothing builds.

I worked with a home goods brand in tribeca last year. they ran 23 experiments in q3. when i asked what they learned they said well conversion rate went up 4 percent overall. great but which experiments caused that. they didnt know. they had changed too many things at once and documented nothing so the knowledge evaporated.

The problem is structural. most teams have no hypothesis format. they say lets test new product photos instead of we believe lifestyle photography will increase add to cart rate by 15 percent because customer interviews revealed people struggle to visualize scale. the first is a task. the second is a testable belief that you can learn from even if youre wrong.

Without a hypothesis you cant capture a learning. you can only say it worked or it didnt. that doesnt help you six months from now when youre trying to decide what to test next.

The hypothesis format that forces clarity

Heres the format i use with every brand i work with. its one sentence. we believe changing x will improve metric y by z percent because reason.

We believe adding customer reviews above the fold on product pages will increase add to cart rate by 12 percent because exit surveys show trust concerns as the top objection.

We believe reducing our free shipping threshold from 75 to 50 dollars will increase average order value by 8 percent because cart analysis shows 40 percent of customers adding a second item then removing it at checkout.

We believe launching a 7 day post purchase email series will increase 90 day repeat purchase rate by 10 percent because our current first repeat purchase happens at day 62 and we have no structured touchpoints before then.

See the structure. what youre changing. which metric you expect to move. by how much. and why you think itll work. this format does three things. it forces you to pick one metric. it makes you state your expected impact so you know if the test succeeded. it requires you to articulate the underlying insight so you can learn even if youre wrong.

A jewelry brand in soho i advise started using this format six months ago. before that they would just say test new email subject lines. now they write we believe using urgency language in subject lines will increase open rate by 8 percent because our best performing emails historically have scarcity messaging. when the test only lifts open rate by 3 percent they dont just move on. they document why the gap happened and that insight informs the next email test.

The hypothesis is the foundation. if you skip this step youre just launching random changes and hoping. that might accidentally work but it doesnt compound. knowledge compounds. activity doesnt.

How to prioritize when everything feels urgent

Once you start writing hypotheses youll have too many. thats good. it means youre thinking systematically. but you cant run them all. you need a prioritization method that doesnt rely on whoever yells loudest or whatever sounds exciting today.

I use ice scoring. impact confidence effort. rate each hypothesis on a 1 to 3 scale for all three dimensions. impact is how much this could move the needle. confidence is how sure you are itll work. effort is how hard it is to execute. multiply impact times confidence. divide by effort. highest score wins.

This takes about two minutes per hypothesis. a cookware brand in the west village uses this in their weekly growth review. they score 5 to 8 hypotheses every monday. the top two get resourced for that week. everything else stays in the backlog. simple. fast. removes politics.

Impact is about magnitude not ease. a test that could increase conversion rate by 20 percent has high impact even if its hard. a test that might lift email open rate by 2 percent has low impact even if its easy. dont confuse quick wins with meaningful ones.

Confidence comes from data not gut feel. if youve run similar tests before and they worked your confidence is high. if this is a brand new idea based on one customer comment your confidence is low. low confidence doesnt mean dont test it. it just means dont bet everything on it. test it small. learn. then scale if it works.

Effort is time and resources. does this need dev work. does it require new creative assets. does it take two hours or two weeks. be honest. a mens apparel brand in chelsea kept scoring tests as low effort when they actually required design contractor time they didnt have budgeted. their backlog got clogged with medium priority tests they couldnt execute. we recalibrated effort scores and suddenly priorities got real.

Once you score everything run one clean test at a time per funnel stage. if youre testing acquisition dont simultaneously test activation. you need clean signal. i know this feels slow. youre wrong. one definitive learning beats five inconclusive maybes.

The learning log that makes experiments compound

Heres where most teams completely fail. they run a test. it finishes. someone says cool it worked or damn it didnt. then everyone forgets about it and moves on. three months later someone suggests the same test again because theres no institutional memory.

You need a learning log. not a test tracker. a learning log. the difference matters. a test tracker lists what you launched. a learning log captures what you learned.

I use a simple spreadsheet. five columns. what we tested. what we predicted. what actually happened. why we think it happened. what well do next. thats it. every completed experiment gets one row. takes five minutes to fill out. saves hours of repeated mistakes.

A beauty brand in the financial district started doing this last year. they ran an experiment on product page videos. hypothesis was videos would increase add to cart rate by 18 percent because their customer research showed confusion about application methods. actual result was add to cart went up 6 percent. why the gap. they documented that the videos they used were too long and most people didnt watch past 15 seconds. next action was test 30 second application videos focused on the single most common question. second test hit the 18 percent lift.

Without the learning log they would have said videos dont work and moved on. with the learning log they said this version of videos didnt work but we know why and we know what to try next. thats knowledge compounding.

The log also prevents repeated failures. a furniture brand in dumbo kept testing free shipping offers every few months. they would try it. revenue would spike. profit would tank. they would turn it off. then forget why they turned it off. six months later someone would suggest trying free shipping again. i made them document every free shipping test in the learning log with clear math on unit economics. now when someone suggests it they pull up the log and see exactly why it doesnt work at their price point and margin structure.

The learning log is your institutional memory. especially critical if you have team turnover. a new growth hire can read six months of learning logs and get up to speed in a few hours instead of repeating every mistake the last person made.

Store it somewhere accessible. notion. google sheets. airtable. doesnt matter. what matters is you fill it out after every test and you reference it before starting new tests. thats the system.

Building a weekly experiment review rhythm

All of this falls apart without cadence. you write great hypotheses. you score them. you run clean tests. but if you only review experiments when someone remembers to check the numbers nothing compounds.

I run weekly experiment reviews with every brand i work with. 20 minutes. same day every week. same agenda. what tests are currently running. what data do we have. should we extend call it or change course. what did we learn. what goes in the log. what do we prioritize next.

This isnt a big strategic planning session. its a tactical checkpoint. the goal is to keep experiments moving and capture learnings while theyre fresh. if you wait until quarterly reviews to document what you learned from a test you ran in week three youve already forgotten the details.

A fragrance brand in nolita runs their experiment review every friday at 10am. takes 15 to 25 minutes. theyve done it for 18 months straight. they now have 60+ documented learnings in their log. when they hire someone new that person reads the log and immediately understands what works for this specific business and audience. thats a competitive advantage you cant buy.

The weekly rhythm also prevents zombie experiments. you know the ones. tests that launched then nobody checked for six weeks. if youre reviewing weekly you catch problems fast. maybe the test setup was wrong. maybe external factors changed. maybe you need more time. weekly reviews let you course correct instead of wasting months.

Block it on your calendar right now. pick a day and time. make it recurring. dont skip it when things get busy. thats when you need it most.

Frequently asked questions

how many experiments should we run at once

One per funnel stage maximum. if youre testing acquisition like new ad creative dont simultaneously test activation like checkout flow. you need clean attribution. more experiments dont equal more learning if you cant tell which one caused which result. most solo founders or small teams should run one test at a time period. better to learn something definitive from one test than learn nothing from three overlapping tests.

what if we dont have enough traffic for statistical significance

Then test bigger changes and extend your testing window. statistical significance is ideal but not always realistic for smaller brands. if youre doing 10k visits per month you cant run a button color test. you can test entirely different product page layouts or major pricing changes. also consider using directional data. if conversion rate goes from 2 percent to 3.5 percent over four weeks and nothing else changed you probably learned something even if its not statistically perfect. just document your confidence level in the learning log.

should we test on mobile and desktop separately

Only if they behave completely differently and you have enough volume in each to get signal. most brands should test across all traffic first. if you see a meaningful result then you can segment the data after to see if one platform drove it. dont start with segmentation. it fragments your learning and slows everything down. get the overall result first. dig into segments only when it matters for implementation.

Conclusion

Experimentation in ecommerce isnt about running more tests. its about building a system that turns tests into knowledge. hypothesis format. prioritization method. learning log. weekly review rhythm. thats the complete loop.

When you have this system experiments stop being random activity and start compounding. six months from now youre not starting from zero. youre building on 20 documented learnings that tell you exactly what works for your specific business and audience.

Most ecommerce brands treat growth like a slot machine. pull the lever. hope for wins. repeat. you can do better. write one clear hypothesis this week. score it. run it clean. document what happens. then do it again next week. thats how systematic growth works.

Ready to stop guessing and start learning. build your hypothesis. run your test. fill out the log. review it next week. repeat until you have a system that actually teaches you something