This smells fishy. If you're experimenting with a typical p < 0.05 (which is often too high for ecommerce optimisation), surely you'd expect to fail 1 in 20 by chance, even if your product is better?
We have a documented API that you can easily integrate directly with if you are more technical!