Scaling Square Register

Over the six-year history of Square Register, the codebase and the company have undergone significant changes, as the app has grown from a simple credit card terminal to a full-featured point-of-sale system. The company has grown from 10 people to more than 1,000, and we’ve had to scale quickly. Here are some of the things we’ve learned and the processes we’ve implemented along the way.

The Team

As we grew, we realized that once an engineering team reaches a certain size, it’s ineffective to organize the team by platform. Instead, we have “full-stack” teams that are responsible for specific sets of features within the app. These teams consist of iOS engineers, Android engineers, and server engineers. This gives teams more freedom and creates improved focus on building a deeper, more comprehensive product. We’ve organized feature-oriented teams around restaurants, retail stores, international support, hardware, and core components (to name a few). Giving full vertical ownership to a group allows those engineers to make more holistic technical decisions, and it gives them a tangible sense of ownership over the product.

Our Release Process

Before 2014, Register releases followed the waterfall methodology; we decided on a large set of features to build, set a future deadline (three to six months), and then worked to build these features.

This process did not scale well. Waterfall became laborious and slow as we added features and engineers to the product. Since all features developed during the release had to ship together, a single delayed or buggy feature would delay the entire release. To ensure that teams continued to stay autonomous, we looked for a different, more efficient approach.

All Aboard the Release Train

To stay efficient, we always want to make sure our processes match our size. Starting in 2014, we adopted a new model consisting of “release trains.” Release trains optimize for feature team autonomy while enabling continuous shipping. This means individual features can be released when they’re ready, without having to wait for other work to be completed.

Switching to release trains required changes to our workflow:

  • Incremental Development — Features are developed incrementally, rather than in long-lived isolated feature branches.

  • Isolation & Safety — New features must be behind a server-controlled feature flag. The feature flag remains disabled outside of development until the feature is ready to ship.

  • No Regressions — If a change introduces a regression in an existing feature, the regression must be fixed immediately.

  • Target a Timeframe — Instead of attempting to ship a feature in a specific version, teams instead target a release timeframe that contains two to three features.

This means that our master branch stays in a stable state. This is where the train part comes in.

  1. Branch — At the beginning of every month, a release branch is created off of the master branch.

  2. Board the Train — If a feature is ready to ship (very few issues remaining), its feature flag is enabled. If it is not, it must wait for the next train.

  3. Test and Fix — The rest of the month is spent fixing bugs in the release branch. If a team is not shipping anything in the train, it will continue to work on the master branch.

  4. Merge — The changes in the release branch are continuously merged back down into the master branch.

  5. Ship — We ship that release branch to the App Store at the end of the month.

  6. Repeat — We repeat the process every month after that.

This has several benefits:

  • There is no more than one month of code change between each release, leading to fewer bugs.

  • Only bug fixes go into the train’s release branch. This means longer “bake time” to prove that changes are stable.

  • There’s less need to ship bug fix releases; most bug fixes can wait until the next train.

  • By building and merging features incrementally, we avoid large disruptive merges that destabilize the master branch.

  • Regressions or high-priority bugs on the master branch are not acceptable. Fixing these are the team’s highest priority.

  • There’s less pressure to ship features on a specific date. Instead of having to wait months for the next release, the feature can be included in next month’s train. This means that teams don’t need to rush through their work. They simply ship when they’re comfortable with the quality of their features. This improves team productivity, morale, and code quality.

At the beginning of 2015, we refined this process even more: Release branches are now cut and shipped on two-week intervals. That means teams will have 26 opportunities to ship this year. Compared with just three or four releases per year in 2013 and earlier, this is a huge win. More opportunities to ship means more features delivered to customers.

Our Engineering Process

Square merchants rely on Register to run their businesses. As such, it needs to be reliable at all times. We have rigorous processes for ensuring quality at the design, implementation, and testing phases.

Large Engineering Changes Require Design and Review

“Writing is nature’s way of letting you know how sloppy your thinking is” –Guindon

This is one of my favorite quotes, and it applies to building software too! If the software you’re building exists only in your head, that software will be flawed. The image in your head is very ambiguous and ephemeral; it’s constantly changing, and thus needs to be written down to be clarified and perfected.

Every large change at Square must go through an engineering design review. This sounds daunting if you’ve never done it before, but it’s actually quite easy! The process generally involves writing up the following in a design document:

  • Goals — What are you trying to accomplish? What are the customer-facing effects of your change?

  • Non-Goals — What aren’t you trying to accomplish? What are your boundaries?

  • Metrics — How will you measure success or failure?

  • Investigations — What other solutions (if any) did you investigate? Why didn’t you choose them?

  • Choice — What have you decided to do? Why have you decided to do it?

  • Design — What are the details of your chosen design? Include an API overview, technical details, and (potentially) some example headers, along with anything else you think will be useful. This is where you sell the design to yourself and your fellow engineers.

We then include two to four reviewers who should review the document, ask questions, and make a final decision. These reviewers should be familiar with the system you’re extending.

This may seem like a lot of work, but it’s well worth it. The end result will be a design that’s more robust and easier to understand. We consistently see fewer bugs and less complexity when a change goes through design review. Plus, as a side effect, we end up with peer-reviewed documentation of the system. Neat!

Our Code Review Process

Our code review process is rigorous for a few reasons:

  • App Store Timing — If we do ship a bug, the App Store review process delays delivering fixes to customers by about a week.

  • Register Is Critical — Finding bugs is important because Register is a critical piece of restaurants, retail shops, and so on.

  • Large App — Catching bugs post-merge in a large application like Register is difficult.

What is our process for pull requests? Every PR must:

  • Be Tracked — Pair a PR with a JIRA issue.

  • Be Described — There must be a clear description of the what and why behind the change.

  • Be Consumable — Pull request diffs must be 500 lines or less. No large diffs are allowed. Reviewers will overlook bugs if a change is much larger than 500 lines.

  • Be Focused — Do not pair a refactor or rename with a new feature. Do them separately.

  • Be Self-Reviewed — All PR authors are expected to do a self-review of their change before adding outside reviewers. This is meant to catch stray NSLogs, missing tests, incomplete implementations, and so on.

  • Have Two Specific Approvers — One of these reviewers must be an owner of the component being changed. We require explicitly listed reviewers to ensure engineers know exactly what is and isn’t in their review queue.

  • Be Tested — Include tests that demonstrate the stability and correctness of the change. Non-tested pull requests are rejected.

Similarly, reviewers are expected to:

  • Be Clear — Comments must be clear and concise. For new engineers, reviewers should include examples to follow.

  • Explain — Don’t just say “change X to Y”; also explain why the change should occur.

  • Not Nitpick Code Style — This is what automated style formatters are for (more on this later).

  • Document — Each code review comment must be marked as one of the following: — Required (“This must be fixed before merge.”) — Nice to have (“This should be fixed eventually.”) — A personal preference (“I would do this, but you don’t have to.”) — A question (“What does this do?”)

  • Be Helpful — Reviewers must enter a code review in a helpful mindset. It is the job of a reviewer to help code be merged safety, not to block it.

Before merging, all tests must pass. We block pull requests from being merged until a successful run of our unit tests and our automated integration tests (which use KIF).

Some Process Tips

We’ve begun doing the following things to help streamline and accelerate the Register development process.

Document Common Processes as Much as Possible

One thing we learned as the Register team grew was how poorly “word-of-mouth” knowledge scales. This isn’t a problem if you’re only onboarding a few engineers a year, but it quickly becomes time-consuming if you’re onboarding a few engineers a month, especially if they’re only on the project temporarily (e.g. a server engineer helping to build a particular feature). It becomes important to have an up-to-date document containing the standards and practices of the team. What should this document include?

  • Code review guidelines (for submitters and reviewers) — “How many reviewers do I need? When can I merge this?”

  • Design review guidelines — “How should I design this feature?”

  • Testing guidelines — “How do I test this? What testing frameworks do we use?”

  • Module/component owners — “Who can I talk to about X? Who built it?”

  • Issue tracking guidelines — “Where do I look up and track what I have to do?”

You’ll likely notice a pattern here: anything that can be answered in 10 minutes or less should be clearly documented.

Automate as Many Inefficiencies as Possible

Manual processes that take a couple of minutes with a few engineers can take much longer with many engineers. Any time you see something trivial that eats up a lot of time, automate it if possible.

We Automated Our Style Guide

One of our biggest “automate it” wins recently has been our Objective-C style guide: We now use clang-format to automatically format all code committed into Register and its submodules. This eliminates code review comments along the lines of “missing newline” or “too much whitespace,” meaning reviewers can focus on things that actually improve the quality of the product.

We merge many pull requests each day. These “style nit” comments used to take anywhere from 10–20 minutes per pull request (between the reviewer and the author). That means we’re saving two or more hours a day from style guide automation alone. That’s 10 hours a week. It adds up quickly!

We Automated Visibility into Code Reviews

Another example of automation saving time is our new “Pull Request Status” email that gets sent out daily.

Before this email existed, 10 to 15 of us would crowd around a stand-up table for 10 minutes each morning and assign reviewers to open pull requests. Instead, we now send out a morning email containing a list of all open PRs, along with who is assigned to review them. No more meeting needed. This means we’re getting back more than 2 hours of engineering time per day, or 10 hours per week.

Another benefit of this daily PR status email is that we can easily track what’s happening with reviews: How long they take, which engineers are contributing the most, and which are reviewing the most. This helps to reveal time allocation issues which may be slowing the team down (e.g. Is one engineer doing half of the team’s reviews?).

Centralize on a Single Bug Tracker

It’s impossible to ship a bug-free product if your bugs are split across multiple trackers. It’s incredibly important to have one place where we can go to see everything pertaining to the current release: the number of bugs, the number of open bugs per engineer (is anyone overloaded?), and the overall trend of bugs (are we fixing them faster than they’re being opened?).

Maintaining Quality in a Shared Codebase

When only a few engineers are working on a project, it’s easy to maintain quality: all engineers know the codebase well, and they all feel strong ownership over it. As a team scales to 5, 10, 20, or more engineers, maintaining this quality bar becomes more difficult. It’s important to ensure every component and feature has an explicit owner who is responsible for maintaining its quality.

Every Component Needs an Owner

In Register, we recently decided to have explicit owners for each logical component of the app. These owners are documented in a list for anyone to easily look up. What is a component? It might be a framework, it might be a customer-facing feature, or it might be some combination of the two. The exact separation isn’t important; what’s important is to ensure that every line of code in your app is owned by someone. What do these owners do?

  • They review and approve code changes and design documents.

  • They know the “hard parts” and how to work around them.

  • They can provide an overview for engineers new to the component.

  • They ensure quality is always increasing.

We’ve seen great results from electing explicit owners for components: code quality is consistently higher (and the bug rate is lower) in components which have owners versus those that are implicitly owned by everyone.

Keep the Master Branch Shippable

This is another recent change for us: We’ve started enforcing a strict “no regressions” rule on the master branch. The benefit of this? Our master branch is now always very stable. If anyone finds a bug, there’s no question if it should be reported or not. It also reduces QA load, as less time is spent figuring out if issues should be filed, if they’re duplicates, etc. If a bug is found, an issue is filed.

This policy goes hand in hand with the release train model: At nearly any time, we can cut a release branch from the master branch and be just a few days from shipping to the App Store. This is incredibly valuable for an app as large as Register, and it helps us move as fast as possible.

Keeping the master branch in a shippable state also helps avoid the “broken windows” problem as we scale; fixing bugs as they’re discovered ensures engineers hold themselves to a higher standard.

Build for Testability from the Beginning

It’s incredibly important to ensure every component within Register is built and designed with testability in mind. Without this, we would need to expand manual QA efforts exponentially: two features can interact in four ways, three features can interact in eight ways, etc. Obviously, this isn’t reasonable, reliable, or scalable.

As we’re working through the engineering design for a feature, we’re constantly asking ourselves: “Is this testable? Am I making automated testing easy?”

Building for testability also has an additional benefit: It introduces a second consumer of all APIs (i.e. the tests themselves). This means engineers are forced to spend more time thinking through the design of an API, making sure it’s useful in more than one case. The result is that it will be easier for other engineers to reuse the API, saving time in the future.

For us, testing isn’t an option; it’s a requirement. If you’re committing code to Register, you need to include tests.

The Importance of CI on Pull Requests

A mental exercise: If an engineering organization has 365 engineers, each engineer only has to break the master branch once a year for it to be broken every single day. This obviously wouldn’t be acceptable, and would slow down and frustrate the engineering team greatly.

What’s an easy way to prevent the master branch from breaking? By not merging broken code in the first place, of course! This is where pull request CI comes in. Every Register pull request has a CI job that is kicked off for new commits. Around 15 minutes later, the engineer submitting the PR can feel confident that he or she is not introducing any regressions.

This has been incredibly valuable as we onboard new engineers into the codebase. They can commit code without worrying that they’re going to introduce master-breaking changes.

Some Observations as the Team Has Grown

These are some personal observations I’ve made as the Register iOS team has grown and expanded around me over the last three years.

Not Everything Will Be Perfect

In a large app, you’ll have a lot of code. Some of this code will be old. But old doesn’t have to mean bad. As long as you have good test coverage, old code will continue to work just fine. Don’t spend time “cleaning up” code that is fulfilling its needs and isn’t slowing anyone down. The best you’d be able to do during this cleanup is not break anything. Spend this time building new features instead.

Make Time to Learn Outside of Your Codebase

In a big codebase, it’s very easy to spend all your time working within it, and never learning from outside influences.

How do you fix this? Take time during the week (I set aside an hour every day) to learn from resources outside of your codebase. What can you learn from? Watch talks that sound interesting. Read papers on subjects you find interesting. You’ll be surprised by the parallels and benefits you’ll begin drawing back into your day-to-day work. Sometimes these little things make the biggest difference.

Addressing Tech Debt Takes Time

There’s rarely an immediate solution to anything, and this includes technical debt. Don’t let yourself get frustrated if addressing tech debt takes a long time, especially in a large codebase.

Think about accumulating tech debt like gaining weight: you don’t gain 100 pounds overnight; it happens gradually. Like losing weight, it also takes a great deal of time and effort to eliminate tech debt — there is never an instantaneous solution. Track your progress while paying it off, and make sure it’s progressing downward at a reasonable pace.

That's All, Folks

If you have any questions, feel free to reach out to me at k@squareup.com. Thanks for reading!

(Thanks to Connor Cimowsky, Charles Nicholson, Shuvo Chatterjee, Ben Adida, Laurie Voss, and Michael White for reviewing.)