Auto-shrink-wrapping in PayPal Checkout, and how it’s saved us more than once
The problem we face is this: PayPal Checkout is a fairly sizeable node app, with an equally sizeable front-end. All of this is managed through npm and bower: the whole thing is split up into a growing number of npm modules containing REST apis, along with a suite of bower modules containing front-end angular components, and in-house libraries.
On top of all of this, we have a metric ton of node dependencies we don’t own as a team, including internal PayPal modules built by our infrastructure team, and 3rd party dependencies from the jungle that is npmjs.com. Not to mention our third-party bower dependencies.
So how do you even start to manage branching, packaging and releasing all of this?
The first pass: three branch model
When we first started out with node and npm, we went with a fairly bog-standard master/develop/release branch, where:
- develop is the bleeding edge, all feature branches are merged in here
- release is anything from develop ready to tested, packaged, and released.
- master is currently-in-production, stable code
This works kinda well for small/medium projects where your dependencies are mostly public third-party modules, locked to patch version updates. Your master branch ends up being pretty stable, and if push comes to shove you can patch master and release it without a huge number of problems. Maybe some npm dependencies updated, but they probably didn’t introduce any breaking changes.
The problem is, once you start adding into the mix even one or two node modules that are owned by your team, which are changing regularly as more features are added, this whole model breaks down.
You could start applying the same master/develop/release branching to each one of your team-owned modules, but very quickly things start to get out of sync, and then all of a sudden you’re having to do an urgent security fix in production, and you find yourself manually trying to figure out stable versions and hand-editing your package.json, and it all goes to hell.
My advice: skip this step. It doesn’t scale, and you end up frequently pushing code you had no idea you were pushing, because your npm dependencies are constantly changing.
The second pass: throw in npm-shrinkwrap
From the first rudimentary approach, we moved to a workflow where if we wanted to create a release package, we’d first manually shrinkwrap the develop branch, and then push it to release.
This is basic — but it solves a world of problems from the offset. In fact, if you’re releasing any kind of node app using npm dependencies, and you’re not already using shrinkwrap, you’re doing yourself a disservice. We gained the ability to safely rebuild our app with any hotfixes, knowing that all of our dependencies would be locked to their stable, released versions, and there was less chance of anything terrible happening in prod.
“Yes, but there’s no bower shrinkwrap” I hear you say, and it’s true, it’s been a feature request for 3 years and it’s still not implemented. Which definitely sucks.
(NB: you might also ask why even use bower — isn’t it dead already? But personally, I’m still a huge fan of the entirely flat dependency tree, reliance on github tags, the extremely lightweight registry, and the separation between front and back end dependencies. So yeah, sue me, I still really like bower)
So — back to the question — if you’re using bower and bower doesn’t have shrinkwrap, you still run the risk of pushing unstable or untested code whenever you release, right? Well, true enough; but actually, implementing bower shrinkwrap turns out to be the easiest thing ever. Here’s a version in 10 lines of python:
app = json.load(open("bower.json"))
dependencies = {}
for dep in os.listdir("bower_modules"):
module = json.load(open("bower_modules/%s/.bower.json" % dep))
dependencies[module["name"]] = module.get("version") or module["_target"]
app["dependencies"] = dependencies
app["resolutions"] = dependencies
json.dump(app, open("bower.json", "w"), indent=4, sort_keys=True)
(Yeah, javascript engineers are allowed to write python shell scripts, ok? The point is, implement it in whatever language you want, it’s dead easy)
The downside to all of this is, of course, doing shrinkwraps manually on one branch means you limit yourself to just one ‘stable’ set of code on which to apply hotfixes in future. What if you need to jump multiple releases back, for any reason? Believe me, this can happen.
The third pass: one branch, smart build job (holy grail)
This is where we are now, and this is what I’d absolutely recommend.
- Ditch all of the branches, except for develop (obviously feature branches are still fine)
- Always build from develop
- Have your build job automatically shrinkwrap your npm and bower modules, when it’s done installing
- Then have it create a new tag in your repo, with the shrinkwrap files attached
What’s the rationale behind this? There are a load of benefits:
- Unlike branches, there’s absolutely no risk of any tags ever getting ‘out of sync’ with anything, be it with what’s stable on production, what’s ready to be released, or anything else. The tag ends up being a snapshot in time which you can always go back to.
- Every release can be traced back to a single, immutable tag. Not only is the tag itself immutable in the git sense (in that you’d have to force-push to overwrite it) - it’s also immutable in a dependency sense, in that every time you create a build from that tag, all of the code and all of the npm and bower dependencies will be exactly the same as they were when you created the tag.
- Nobody ever has to merge any branches manually before a release, they just build from develop and everything else is taken care of for them by the build job. There’s less mental overhead and less room to mess up.
- Nobody can be lulled into a false sense of security — if you’re building from develop branch, you’re getting the bleeding edge. If you’re building from a tag, you’re getting stable code (at least, code that’s as stable as what you have in production). There’s no risk of getting into the mindset of “I’ll build from release branch, I know code in the branch is stable” (but in reality maybe you’ll pick up a few new npm dependencies or versions hitchhiking on your build…)
- You can jump back any number of releases and be able to build from any tag, at any point in time. Each tag is a unique snapshot in time, code and dependency versions.
What if I need to do a hotfix?
In that case, your life got a whole lot simpler: just check out your live, stable tag, cherry pick a hotfix commit, create a new tag, and generate your build from that tag:
git pull --tags
git checkout live-stable-tag-2016-02-09
git cherry-pick some-fix-on-develop-branch
git tag live-stable-tag-2016-02-09-hotfix
git push origin live-stable-tag-2016-02-09-hotfix
If you need to make a hotfix change to a dependency, you’ll have toto publish a new version of that dependency — but after that, the only thing you need to do in your parent app is make sure the npm-shrinkwrap.json (or bower.json) is pointing to the correct hotfix version of your dependency.
Isn’t it a lot of process?
Nope. This approach optimizes for the happy case — most of the time you’ll just build from develop branch, and release, and you won’t even care about the tags that are being auto-created by your build job. They’ll be created for you automatically, and they’ll be ready when you need them. Which you definitely, definitely will.
Then when all hell breaks loose at some distant point in the future, and you have to immediately push a fix because OH CRAP, THE LOG-IN BUTTON IS THE WRONG COLOR! —but thanks to shrinkwrap tags, you’re already prepared, and you can make your fix and push it to production without any risk of introducing any other horrendous new live bugs in the process.
Great, so I can make hotfixes. What else?
There are a few examples of where the shrinkwrap tagging approach has been invaluable for us, aside from generally just making hotfixes.
Fast Iteration
Personally, I’m working on a small infrastructure team which oversees the entire Checkout codebase. My team is continually releasing new experiments and code improvements — be it for site speed, conversion improvements, high priority issues, new tech— and we want to be able to move as fast as possible, releasing to prod and iterating with as little overhead as possible.
The caveat is — we have a lot of feature teams in checkout, all contributing code back daily and hourly into the develop branch (and all of those other child dependencies I mentioned earlier).
If we were to include all of this new feature code in our releases, as an infra team, we would spend our whole lives debugging new production issues and verifying new features in prod. But with the shrinkwrap-tag workflow, we can quickly roll out new code, verify it in production, and continue on. It allows us to work much faster in parallel with the feature teams, on a large codebase, in ways we wouldn’t be able to otherwise.
Solving impossible issues
At the end of last year, we ended up with a memory leak somewhere in our node app. It was a slow leak — our servers would die over the course of 3–4 days, and because we were releasing so frequently and rebooting our servers often enough, it went under our radar for an embarrassingly long time. By the time we realized the leak was there, we didn’t have enough historical logs to identify the exact release build which caused the issue. And no matter what we tried, we couldn’t reproduce it in our development environment. We looked at v8 snapshots, and saw very little that stood out.
Cue historical shrinkwrap tags.
Our first approach was to come up with about 15 potential reasons for the leak, and try to fix them all in parallel. Some of them were potential bugs, others were new integrations, others were updated code paths. Because we had a shrinkwrap tag for the current live build, we could generate 15 new builds on top of that tag, each with incremental fixes, then push them all out to production in parallel, each on a different server.
It was a nice idea. But it didn’t work. Each server showed the same slow leak.
Eventually, we ended up taking all of our historical shrink-wrap tags for the past few months, and re-deployed them incrementally, effectively doing something between a brute-force and a binary-search in production to find the tag where the issue first surfaced.
Once we’d found the offending tag, we did the same process using each commit and dependency version from one tag to the other, creating new shrink-wrapped tags each time, and isolating each set of changes to a different production server, narrowing the issue down to a single commit.
Eventually we found the commit which caused the problem — and without going into detail, boy did it not make sense. Even though we know the code change that caused the issue, we struggled a great deal to understand how it was actually even causing a leak.
The moral of the story — sometimes the only solution is brute-force— and without having set up shrink-wrap tags well in advance of the problem arising, we wouldn’t have even had that as an option.
—
So — if you take anything from this, it’s that auto-shrinkwrap-tagging is a great idea, and may just save you in more ways than one.