Flake Factories

21 Mar 2025

In the last few years, every client's Rails application that I've worked on has factory_bot factories that look like this:

FactoryBot.define do
  factory :user do
    first_name { Faker::Name.first_name }
    last_name { Faker::Name.last_name }
    email { Faker::Internet.email }
    status { %i[active inactive banned].sample }
    date_of_birth { Faker::Date.birthday(min_age: 18, max_age: 65) }
  end
end

That is, most of the attribute values are randomly generated, with heavy use of faker. We'll call these 'random factories'. In these projects, random factories are used in unit and all other types of tests for setting up data; they aren't only used for demo purposes. I don't know the origin of this practice or how widespread it is, but it is apparently the default way to write factories in my recent circle of Ruby developers.

After about 3 years of working in such projects, I'm convinced that it's a bad practice that should be avoided. It's good for generating flaky tests and not much else.

This might be obvious to those seeing this for the first time. "Your shared code for setting up test data is indeterministic? Or at best, it's deterministic with respect to a random seed? Of course you're going to have flaky tests." But when I ask why we don't have simple, static values in our factories, which can of course be overridden where needed, like this:

FactoryBot.define do
  factory :user do
    first_name { "Test" }
    last_name { "User" }
    email { "testuser@email.com" }
    date_of_birth { Date.new(2000, 1, 1) }
    active

    trait :active do
      status { :active }
    end
    # traits for other statuses, etc....
  end
end

I hear arguments like:

If we pick a single value, how do we know that our tests aren't passing spuriously? It could be that they only pass for that specific value or a small subset of possible values like it. Thus our tests would be missing bugs.

Before examining this argument, I will say that, in practice, I have never observed a single concrete instance of the putative advantage of this approach. I have never found the reason for a failing test (that passed the last N times it ran, mind you) to be a genuine bug that occurs for a particular, obscure combination of randomly generated factory values.

I have, however, wasted dozens of hours rerunning CI pipelines and investigating and fixing flaky tests. Perhaps I'm unlucky or I am selectively forgetting things. But, in my experience and with all due caveats, the empirical case for random factories stinks. And my lawyer tells me that I am contractually obligated to blog about bad stenches (see my forthcoming Protein Shakes Left in Hot Cars series).

The missing upside for random factories

There are reasons why the imagined benefits of random factories don't pan out in practice, or at least I have speculations about them. Consider what proponents of random factories are actually advocating.

Random factories do not increase the number of cases tested per test run; each test still only creates a single User instance with a single first_name value. There's no particular reason to think that the value picked on a given test run is "better" for the given test than the arbitrary static value we would have used.

Instead, random factories merely increase the number of test cases we are sampling from for each run. This is in contrast to useful testing techniques that generate a bunch of random test cases per run, like fuzzing. The advantage of random factories, then, is not that any given test run will be more effective at detecting defects. It's that over time and many test runs, the sample of test runs from the larger distribution will be more effective since they will have seen a larger variety of data.

But this is not how PRs' test results are evaluated. A PR can pass CI and get into production once all of the tests pass a single time. Random factories do not really guard against false test successes for a given PR. Their guarantee is that, if there is some bug lurking somewhere in the distribution of random test cases, as the number of test runs approaches infinity, the probability that a relevant test case will run and (hopefully) fail a test approaches 1. This is a rather weak guarantee for projects with a finite lifetime and a finite number of test runs against frequently changing code. In other words, the supposed benefits of random factories are realized over a large sample of test runs, but n=1 when deciding whether a PR goes into production.

There are at least three more facts that diminish the utility of random factories. These may help explain why I've never actually seen a random factory I was happy to have.

First, the combination of developers, code reviewers, code coverage tools, and QA (at least the ones I work with) are competent enough to ensure that the obviously relevant cases are covered by tests. Consider a method that formats user names. One would write tests with data representing each significant case according to the method's intended logic--missing first name, extra whitespace, etc. And these would be defined explicitly in the tests themselves; you wouldn't leave them up to the random factory to generate. The imagined scenario where the tests only pass for a user with first_name: 'Test' and last_name: 'User' just isn't realistic in my experience. Thus, there's little benefit in complicated factory code to prevent it.

Second, developers tend to ignore test failures that happen randomly and infrequently, dismissing them as flakes. This is especially true if your test suite has earned the reputation of being flaky, which is more likely if you're using random factories. And frankly, ignoring flaky tests often makes sense. I'm trying to ship some important feature and CI fails due to an unrelated test that I suspect is flaky. Why would I delay the important feature for an indefinite amount of time to track down the cause of the flake and fix it? I could instead run the tests in CI again, which has a high probability of unblocking me and will take a known amount of time (probably on the order of a few minutes). But notice that this reduces the effectiveness of random factories for preventing actual bugs. If the bug occurs only for some improbable combination of random attribute values and causes an accurate test failure, it still looks like a flake. Thus, it will probably be ignored until some developer has some free time or is fed up enough to investigate.

Third, the chosen methods for generating random values have little to no information about the system under test. They typically sample from a uniform distribution of possible values when possible (e.g., %i[active inactive banned].sample) or from whatever is in Faker's data sets. The latter are not designed to probe edge cases or find uncovered branches in your code--they're designed to "look realistic". And sampling from a uniform distribution of possible values means that the most common or happy path values are just as likely to be picked as rare or unhappy path values. But if we're looking for lacunae in our tests, wouldn't we want to bias toward unhappy path values and edge cases if we know nothing else about how the factory is used?

This gets worse when we sample from a uniform distribution of, say, a range of integers or dates. Presumably, the most interesting values for testing our code will be at the edges of the range. But if we define it as something like: rand(1..100), we are much more likely to pick an uninteresting value in the middle than one at the edge, especially as the range size increases.

To summarize, in practice, random factories don't provide additional help with obvious test coverage gaps and they don't really help with improbable data scenarios--at least not in a timely manner. They may help with some scenarios in the middle (if they exist). But the techniques used to generate values aren't particularly well-suited for finding interesting cases with respect to the code under test. Hence, it's no surprise that I've never been lucky enough to encounter an actual random factory that pulls its weight.

Given this, it seems that the best the advocate of random factories can say is:

All else equal, wouldn't it be better if tests failed some of the time randomly for real, improbable bugs rather than not at all? It could be that some of those flaky tests that you ignored or did not have time to fix had real bugs behind them.

Perhaps. But all else is not equal.

The downside of random factories

One of the main causes of flaky tests is when the test set up is not what was intended some of the time. You're testing that a User can't be created without a last_name. But sometimes, randomly, when you initialize User it in fact has a last_name and saves successfully, failing your assertion. That's a flake.

The problem with random factories in this regard is twofold. They make test set up code more complicated and hence easier to get wrong. And their results are random; mistakes do not make themselves known reliably.

Consider a very simple example (ignore potential flakiness issues caused by relative dates). Assume there's a validation on Invoice preventing due dates greater than 60 days in the future.

FactoryBot.define do
  factory :invoice do
    # ...
    due_date { Faker::Date.between(from: 5.days.ago, to: 61.days.from_now) }
  end
end

Is that date range inclusive or exclusive? Turns out it's inclusive. But in this case, the writer of the factory interpreted it as exclusive. Now, ~1.5% of the time, it will create an invalid Invoice. It's likely that the person that wrote this factory and ran it against some tests never encountered a failure due to this error. And so it passes CI and gets committed. Any tests that assume they're creating a valid Invoice will now be flaky.

You might say, "OK, that's just a silly mistake. It should have been caught in code review". Perhaps. But I've fixed almost this exact line of code and ones that involved misinterpreting Faker's API multiple times. These sat latent for weeks or months causing flaky tests before they were fixed.

To clarify, I'm not saying that this was due to some unforgivable negligence by the writer or the reviewer. The point is that this mistake should not even be humanly possible. If it is made, it should be blindingly obvious and fail reliably. Pick a single arbitrary date (e.g., 30.days.from_now) that is unambiguously valid, instead of overcomplicated random nonsense, and it will never happen. As we established in the last section, making that attribute value block more complicated by randomly sampling the range of values does almost nothing to make your tests more effective. But it makes it easier to commit silly mistakes that are difficult to detect up front.

Another illustrative example is when, on one project, I found a bug in Faker's method for generating random bank account numbers. The method takes an argument for the desired string length. It generates a random float and converts it to a string, appending the substring of everything after the decimal point to the result. It does that in a loop until it constructs a string of the desired length. However, sometimes the generated number would be less than 10^-4, whose string representation would be converted to scientific notation, like '6.28394857374234e-05'. Thus if the desired length was long enough, it would include an 'e' in the output. This would then fail our model's validation for account number since it included non-digits, causing tests that created bank account records to fail. This happened about 1 in every 1,000 runs of that factory method.

The infuriating part about the above is that no application logic even considered the content of the bank account number at that time, besides the only-digits requirement! There was no reason to randomly generate the account number--we could have just picked any valid string. All of the complexity involved with using Faker to generate it, which happened to include a bug for our use case, was completely unnecessary. But it wasted hours of developer time rerunning tests and tracking down the cause of the flakiness.

These simple one-liners are just the tip of the iceberg, though. In any non-trivial application, factories will create a relatively large graph of associations. The most insidious, and probably the most common, causes of flaky tests are when there are interactions between factories that randomly result in unintended data conditions.

Here's a contrived, but representative example of this sort of thing. Let's take our :user factory above and suppose that Users can have zero or more Reports against them for posting content that violates terms of use. A Report can then have a Judgment created by a moderator, which has an outcome, which can be :pending, :confirmed, or :rejected.

The corresponding random factories, might look like this (omitting irrelevant details):

FactoryBot.define do
  factory :user do
    # ...
    reports do
      [0, 1, 2].sample.times.map { association(:report, user: instance) }
    end

    trait :active do
      status { :active }
    end
  end

  factory :report do
    user
    judgment { association(:judgment, report: instance) }
  end

  factory :judgment do
    report { association(:report, judgment: instance) }
    outcome { %i[pending confirmed rejected].sample }
  end
end

Suppose further that there's a rule that automatically changes a user's status to :banned if they have 2 or more reports with :confirmed judgments (probably triggered in a before_save callback in User).

Careful readers will notice that 1/3 of users created with these factories will have 2 reports. Each report has a judgment that has 3 possible outcomes, one of which is :confirmed. There are 6 combinations of picking 2 outcomes, one of which 2 :confirmeds. So 1/3 * 1/6 = 1/18 users (or ~5.5%) will have their statuses changed to :banned when the second judgment is saved. Any tests that assume that they are creating an active user, even when explicitly requested (i.e., create(:user, :active)), are now potentially flaky.

Each factory by itself looks more or less innocuous. They could have been implemented independently and typically they would be in separate files (I've combined them here for brevity). It would be very easy to miss this behavior by eye alone, especially in an actual codebase with a lot more code. Moreover, even if you run the tests that use these factories, 95% or more of the time, you would still get the expected results. It's easy to see how this would be missed, pass CI, and continue to cause flaky tests for some time, since it would be a pain to track down.

Interactions like this are commonplace in nontrivial applications. They're complicated enough on their own to create reliable test data. Adding randomness to them makes this significantly more difficult. You will create more flaky tests and that will waste time, money, and possibly cause hair loss, for no real benefit. Please stop.