Ratchets

For when you'd really prefer not to backslide

Sep 26, 2023

Rachet (n)
a part of a machine that allows movement in one direction only.

- Cambridge Dictionary

Ratchets (physical mechanisms aside) are systems, processes, or utilities for supporting forward progress, and discouraging regression. In software development, a variety of things (when employed with intent) can act as ratchets -- unit tests, linters, even compilers can be considered ratchets. In the context of migrations, however, they can have a more subtle use: measuring and incrementally enforcing new constraints.

Making sweeping changes to a large codebase is difficult, especially one that is rapidly changing and continuously deployed. Long-Lived Feature Branches have a bad reputation in industry (rightfully so; the pain is real!), and few developer relish a 10k line pull request. When starting a large in-place migration, it's helpful to enumerate as much of the work as possible, as quickly as possible, and then add ratchets to the codebase to prevent that list from getting longer.

Delivering incremental changes can be achieved with some simple tactics:

Scope the work via code instrumentation
Track required changes in code
Collect observations in production

Source code best reflects the truth of your progress. Coupling your open issues with test failures (or code annotations) allows you to clearly enumerate, estimate, and execute observable work. While the code is truth, production behavior is reality, and safely collecting observations there will help you find where regressions have slipped through the cracks between your tests.

I'll be leaning heavily on some tools and techniques I picked up at Stripe, though I've personally used and observed similar tactics at scale as early as 2015. For those looking for a broader treatise on large-scale migrations, I can't recommend Kill It with Fire strongly enough.

An Example Project

In this context, let's imagine that we're introducing a database sharding scheme into an existing server-side application.

We have a database client:

class Client(object):

	def exec(operation) -> Result:
		# something something
		pass

We'll assume that the client currently sends all operations to a single database, and we'd like it to instead (somehow) route the operation according to a new sharding strategy, informed by the operation. We'll also assume that our migration will not require data movement, simply that new data be created in a new location, and queries routed appropriately. If you feel like this is becoming a contrived example, you've got me there!

Instrument the changing behavior

It's likely that your programming language of choice has some concept of optional parameters -- python makes this easy with keyword-arguments, so we'll use that.

class Client(object):
    def exec(operation, shard_key=None) -> Result:
        # something something
        pass

If you're operating a small codebase, you might try something like this:

class Client(object):
    def exec(operation, shard_key=None) -> Result:
        if not shard_key:
            raise Exception("Missing shard key")
        # something something
        pass

This will generate a list of call-sites for you to fix, and you'll have a (hopefully) small-ish change to review/submit. At this point, we haven't changed any behavior, however we aren't guaranteed to have perfect test coverage. If there's some novel call-site in production (maybe a shardkey is nil sometimes?), you might only find it when it fails. Instead, introduce a tool that lets you maintain failures in development and testing, but only _measure in a production environment.

Introduce Soft Assertions

Soft assertions are something I first observed at Stripe, and are at a glance an inversion of common safety checks. A soft assertion fails dramatically in development, CI, and staging, but only notifies in production. This allows us to enforce desired behavior before it becomes required behavior, without drifting from main.

class Assert(object):
    @staticmethod
    def soft(condition, message):
	    if condition:
		    return
	    if env != "production":
	        raise Exception(message)
	    else:
	        observability.report(message)

    @staticmethod
    def hard(condition, message):
    .   if !condition:
            raise Exception(message)

With this new tool, we can update our database client.

class Client(object):
    def exec(operation, shard_key=None) -> Result:
	    Assert.soft(shard_key != None, "Missing shard key")
        # something something
        pass

When new invocations at development time or in CI violate our new shard_key constraint, this assertion will fail. In production, however, we'll simply get a report that we can use to reproduce and track.

But what if your codebase (and initial list of violations) is large? A new strategy may be required, often called incremental enforcement.

Adopt Incrementally, opt-in or opt-out

Adopting all at once may not be an option for you -- the change might be too large, conflict with a myriad in-flight changes, or may require expertise you do not have. There are multiple ways to construct gradual adoption of your new condition, but they primarily boil down to opt-in (an area of logic or context is explicitly marked as enforcing the new constraint), or opt-out (a list of known violations is intentionally excluded from enforcement). Which path you choose will vary based on your codebase and preferences, and you may even combine the two strategies.

Enforcement Context (opt-in or opt-out)

Many languages have a concept of runtime context, which can be used to store information about current behavior. In GoLang you have context.Context threaded through most code as convention; in Java you have ThreadLocals -- at Stripe we had a ruby utility called LSpace (whose name I never understood, but is apparently a Discworld reference) -- all of these allow you to annotate a scope dynamically, and you can use that to either enforce or decline to enforce new constraints. This is often referred to as "defaults with escape hatches".

import threading

context = {}

class ThreadContext(object):
    """
    ThreadContext 
       Store context related to the current thread using python's
       ContextManager conventions.
    """
    def __init__(self, **kwargs):
        self.ctx = **kwargs
    def __enter__(self):
        context[threading.get_native_id()] = self.ctx
		return self

	def __exit__(self):
	    del context[threading.get_native_id()]

	@staticmethod
	def get(key, _default=None):
        return context.get(
            threading.get_native_id(), 
            _default
        ).get(key, _default)


class Client(object):
    def exec(operation, shard_key=None) -> Result:
        if !ThreadContext.get("allow_missing_shard_key"):
	        Assert.soft(shard_key != None, "Missing shard key")
        # something something
        pass


def main():
    client = Client()
    operation = ...
    try:
        client.exec(operation) # this is bad and we explode!
    except Exception as e:
        print(e)

    with ThreadContext(allow_missing_shard_key=True):
        client.exec(operation) # oh, this is fine!

In the above example, we choose not to enforce at all from within the ThreadContext (opt-out/escape hatch). We could instantiate that context surrounding a specific call-site, or at the beginning of an HTTP request. We could invert the business logic and require the presence of the context to enforce (opt-in). We can use this tool (and ones like it) to enforce our progress through the migration in code (fully merged into the main branch), and ratchet that behavior into place.

Combined with additional instrumentation (metrics for how many call-sites have opted-in vs -out), you can measure your progress. With more required structure than a boolean, you can require that exclusions be tracked in separate systems (like an issue tracker), or notify migration owners when new exclusions are added.

Allow-Lists (opt-out)

While dynamic scopes allow you to enforce behavior with great flexibility, the creation of an allow-list grants a centralized source of truth, and can serve multiple purposes.

Act as a backlog

An allow-list is an unimpeachable list of work to tackle. It is known, and if it's size is > 0, you aren't done yet. When possible, integrate directly with your issue tracker!

Act as a ratchet

An allow-list allows you to prevent new, undesirable behavior from popping up across the code-base, while providing an escape hatch for known-but-not-executed work. Tools like CODEOWNERS (github) can notify you when an unexpected change is proposed to the allow-list, creating an interaction point between disparate team members to collaborate and work towards a solution.

Act as a filter

An allow-list allows you to filter out noise in reports (test/build reports, dashboards, logs, etc) -- a short list of actionable failures is preferable to a long mixed list of known and unknown failure modes.

Listen to Production

The final step here is to listen in production for the faults reported by your safe assertion framework. Where test coverage is never perfect, observed behavior in production will (in the fullness of time) enumerate the accessible branches in your business logic. Observed violations in production can be fed back into the allow-list and tracked, providing a more realistic picture of work remaining and finding potential bugs before they impact users.

When determining how long to listen to production, consider things that happen infrequently, and ensure you've given them time to be flexed:

Low-volume code paths
Asynchronous processing events
Periodic reporting systems

Strict Enforcement (a.k.a. the new normal)

It's up to you to determine what is an appropriate length of time to listen for faults -- this will be based on your own knowledge of the applications and products in play, as well as what are critical vs non-critical failure modes from the perspective of your users. When you have reached this point, you can start to make the actual changes in your migration that rely on these new constraints for correctness.

This can be a very long process. The older/larger the codebase, likely the longer the migration will take. With quality static-analysis tooling, faults will likely be detected earlier on than observed in production. A migration of this style at Stripe took a team of 8 engineers 9 months to execute, with many discoveries along the way. A team of 2-3 engineers did the same at my current gig in roughly 8 weeks, with most surprises originating in internal tooling with low test coverage.

I hope that in your next great migration adventure, you can find use of these tools (or ideas). May you achieve the win-condition every platform engineer hopes for -- no one complains!

Sean’s Substack

Discussion about this post

Ready for more?