Scientist: Measure Twice, Cut Over Once

nine_k · on Feb 3, 2016

If you ever wondered why pure functions might be practical, this is an example. For instance, you can calculate two pure (side-effects-free) functions in parallel and compare performance and the results; this is what `science` does.

vdm · on Feb 4, 2016

Twitter's Diffy goes further: it first computes a diff between two controls, to detect non deterministic output (e.g. transaction IDs) as false positives, which can then be omitted from the metrics, and then diffs this first diff against the output of the candidate.

https://blog.twitter.com/2015/diffy-testing-services-without...

onalark · on Feb 3, 2016

Awesome, I'm a huge fan of new and innovative tools that help improve the process of refactoring and improving existing code. This looks like a really promising tool for Ruby developers, and I'm always grateful when companies and their employees invest the time and effort to release their tools to the community. I really liked the point about "buggy data" as opposed to just buggy code, I think that's a really important point.

A few reactions from reading through the release:

Scientist appears to be largely limited to situations where the code has no "side effects". I think this is a pretty big caveat, and it would have been helpful in the introduction/summary to see this mentioned. Similarly, I think it would be nice to point out that Scientist is a Ruby-only framework :)

You don't mention "regression test" at any point in the article, which is the language I'm most familiar with for referring to this sort of testing. How does a scientist "experiment" compare to a regression test over that block of code?

Anyway, thanks again for writing this up, I'll be thinking more about the Experiment testing pattern for my own projects.

gregmac · on Feb 4, 2016

> Scientist appears to be largely limited to situations where the code has no "side effects".

That's one of the things I was initially thinking too, but then as I thought about where I could have used it in the past, I can think of only a few cases where it wouldn't be possible to keep it isolated.

For example, have your new code running against a non-live data store. Example: When a user changes permissions, the old code changes the live DB, while the new code changes the temporary DB. Later (or continuously), you could also compare the databases to ensure they're identical (easier if the datastore remains constant, a bit harder if you are changing to a different storage schema or product).

Where it would be in the difficult-to-impossible range is when touching on external services that don't have the ability to copy state and set up a secondary test service, but even in that case, you could record the requests made (or that would have been made) and ensure they both would have done the same thing.

Definitely an interesting concept overall.

diocles · on Feb 3, 2016

I've been thinking about the naming of this library, and I don't think "science" is a good metaphor for what it does.

You can only test hypotheses of the form "A is exactly like B" - no bug fixes are allowed, because they will show up as differences.

So a more accurate (but less cool) name might be "Refactoring" - you assert that all your tests still pass, where your tests are your production data.

masklinn · on Feb 3, 2016

> You can only test hypotheses of the form "A is exactly like B" - no bug fixes are allowed, because they will show up as differences.

Of course they will. That's a feature. The system doesn't intrinsically know whether a change in behaviour is a bug added or removed, it can only report that there's a difference in behaviour, it's your job to investigate whether the control or the experiment is correct.

platz · on Feb 4, 2016

Is is also called a "Test Oracle" in property-based testing. The "Oracle" is your known good impl, and you test your subject function using inputs comparing against the oracle. Works great with property tests because you don't need to manually specify the input, just generate them and check the equality holds

eru · on Feb 4, 2016

Of course, property-based testing is richer, and also works without a reference implementation. Eg you can test properties like commutativity

   f(x, y) = f(y, x)

without a reference implementation.

noobiemcfoob · on Feb 3, 2016

Overall, I love this type of approach. We've begun doing something similar at work as well.

However, I don't get the restriction on code with side effects.

Would it not be possible to introduce another abstraction layer around those side effects to allow comparison between the old code's side effects and the refactor's code side effects?

onalark · on Feb 3, 2016

I don't think this would work for a number of reasons. If it's a database that you're modifying, you can see that a lot of operations (increment, delete, etc...) will do the wrong thing if they're called twice. If the operations themselves are idempotent, you wouldn't be able to verify that the intended side effect was correct. This is one reason developers spend a lot of time building mock objects: to capture "side effects".

lazaroclapp · on Feb 3, 2016

How robust do you imagine it would be to just record the call / response pairs of the mutable objects in the new code and then replay them when running the experiment on the old code?

For example, suppose you have a db object and two versions of the code new_code and old_code. You call something that looks like:

experiment.run(new_code, old_code, mutables=[db])

Then the infrastructure runs new_code normally, but records the arguments and return value of every call to db (and any other object defined as mutable). Then, the infrastructure runs old_code, but whenever a method of db is called, it tries to match it with a call made by new_code and just directly returns the return value that call returned. If it can't match the call, it signals an error, but it never actually tries to call db, thus negating the risk of side effects.

Obviously this would fail when the two versions perform different operations in the database, even semantically equivalent, but non-identical operations (say one retrieves a value and increments it inside a transaction, while the other uses a stored procedure in db to increment without fetching). But it still relaxes the constraint, now you can do this for code that has no side effects and code that has the exact same side effects as represented by exact call/return pairs to mutable objects.

masklinn · on Feb 3, 2016

> How robust do you imagine it would be to just record the call / response pairs of the mutable objects in the new code and then replay them when running the experiment on the old code? […] Obviously this would fail when the two versions perform different operations in the database, even semantically equivalent

I'd think the latter would be the common expectation by virtue of a different implementation.

danielheath · on Feb 3, 2016

For that you would need

* An STM implementation (ruby doesn't have one afaik)

* All network services you connect to (eg databases, APIs etc) would need to support transactions for write operations

gonyea · on Feb 3, 2016

Wow. We are literally working on the same problem described in this post.

Looks like a great tool! We'll give it a spin.

vinceguidry · on Feb 3, 2016

Sadly, the only project I can think to use this on is still on Ruby 1.8.7, whereas the gem only works on 1.9+.

kelseydh · on Feb 3, 2016

I wonder if there could be a way to abstract this for testing gem version upgrades.

heynk · on Feb 3, 2016

At first I thought tests would cover this, but it would be pretty cool to compare performance across different gems. Unfortunately you can't really use two versions of the same gem in the same runtime, so you'll probably go back to other benchmarking methods.

pdm55 · on Feb 3, 2016

Will you run into copyright issues with that name? https://www.micromath.com/ have a product with the same name.