How we upgraded a big Ruby on Rails monolith with Zero Downtime

When Upgrade Becomes Inevitable.

December 14, 2015

In September 2015, we started taking the Rails upgrade of our monolith seriously. We were running Rails 3.2 up to that point, and with Rails 5 on its way, we ran into the risk of using an unmaintained version of the web framework. This was not an option. Hence, the upgrade to a current Rails 4 version was inevitable.

Before going into details, it’s always good to know what we are dealing with. The monolith has around 115,000 lines of code, 30,000 spec examples, uses around 250 different gems (including dependencies) and serves 10,000 RPM. All in all, over 40 people have been working on the application since 2007.

First things first

Before our initiative even started, at the end of 2014 a team of engineers had already adapted the code to use strong parameters. Additionally, CanCan was replaced with Pundit. At that point, multiple known issues with Cancan in combination with Rails 4 existed, and its future was uncertain.

Furthermore, this team discovered that we were affected by a Single Table Inheritance bug in Rails 4. The fix for this bug was removed again from Rails 4 because it does not play nicely with the deprecated_finders gem. Luckily, we do not use those finder methods and were able to apply the fix in our codebase.

How we organized as a team

One of our cross-functional teams took care of the Rails upgrade. The first stories for the upgrade were created in the team’s backlog. Its three backend engineers started working on the upgrade while, at the same time, the work for the mission of the entire team continued. After a few weeks we saw that this was not a good setup, for multiple reasons:

we were frequently disturbed by the team’s work and its communication
we disturbed the rest of the team with updates on the Rails upgrade
the information provided was not valuable for those who were not involved
we also had the strong feeling that we were not working together as a team – we sat together in one room, but did not work on the same topics

In the end, we decided to split the team into two parts. The first part of the team continued working on the team’s mission. The second part of the team worked on the Rails upgrade. The latter we named the “Rails Task Force”.

We know from former upgrade initiatives that estimating any effort when upgrading Rails is difficult to nearly impossible. Hence, we did not estimate any tasks during the upgrade. Usually, at Babbel, engineering teams use Scrum or Kanban and estimate stories accordingly. When the Rails Task Force was founded we switched from Scrum to Kanban, which made more sense for this unpredictable kind of work.

How we organized the work

During the upgrade, several other teams had to continue working on the codebase of the monolith. Our goal for the upgrade process was to have an application that works with Rails 3 and 4 in parallel. This would allow us to work with only one codebase. Additionally, it would avoid a codebase that is drifting apart for both versions of Rails.

We achieved this with backports and monkey patches. However, simple conditional statements like the following also helped in this regard:

language-ruby module Kernel
  def rails4?
    Rails::VERSION::MAJOR == 4
  end
end

if rails4?
  # Do something in Rails 4
else
  # Do it differently in Rails 3
end

Furthermore, this approach enabled us, after flicking the switch, to easily remove all the boilerplate code for supporting both major Rails versions.

To support our goal of a Rails 3 and 4 compatible codebase, all changes made in Rails 4 land had to be merged back into our Rails 3 development branch. Therefore, we used the Rails 3 branches develop and master for the staging and production environments as usual, and a rails4 branch in which Rails 4 was enabled.

We concentrated on making small incremental changes which helped us in identifying any newly introduced issues. The process we applied during the upgrade is as follows:

implement changes locally on a feature branch that is based on the develop branch and test code on Rails 3 and 4
merge the feature branch to develop
deploy the develop branch to the staging environment and test changes on Rails 3 as usual
merge develop to master
deploy the master branch to the production environment
merge develop to rails4

Upgrading the Rails monolith

Before even starting to touch any code, we had a look into the Rails upgrade guide. This is generally a good idea to get an overview of the major changes in the framework (i.e. API and configuration changes) and possible incompatibilities.

Get the Rails console to boot

Now we tried to achieve the first goal: booting the Rails console. First, we adapted the Gemfile to support both Rails versions. Additionally, some gems like the strong parameters gem were no longer necessary, while other gems needed to be updated. In attempting to boot the console, the error messages we received along the way helped us step-by-step to achieve this first goal.

Achieving a green build

The second step involved getting all specs to pass. As mentioned above, with around 30,000 spec examples, the monolith is tested pretty well. However, these tests are mainly unit tests with some request specs sprinkled in. Sadly, integration tests – testing the core business functionality – are missing.

The issues we ran into were diverse:

some tests broke because of subtle changes behind the scenes in the Rails API
other tests required a huge setup on our side and were hard to understand – it just took a long time to actually discern what these tests were supposed to test
we also ran into incompatibilities with gems in Rails 4
finally, we had tests that worked by mistake in Rails 3; Rails 4 obviously got stricter, therefore, those tests broke into pieces

Full regression test

After the green build was achieved we pushed the rails4 branch to our staging environment. This was also the time when we decided that we needed a code-freeze for the monolith. With multiple teams working on such a big codebase, there was a good chance that new incompatibilities would be introduced sooner rather than later.

The QA department went hands-on and put the monolith through their paces. In the course of this process, more bugs were found. These originated from the monolith communicating with other Babbel applications and external services. Furthermore, certain parts of the monolith were simply not covered by tests, and deprecated technologies like Flash gave us a hard time debugging.

Going live

We pushed Rails 4 to the production environment after getting the green light from QA. In the course of this, we monitored the behavior of the application on Rollbar and New Relic. We ran into some smaller issues that we were able to fix within two days. Most of them were not customer-facing. All in all, the deployment went smoothly and most people did not recognize the switch.

One key take away is that it’s hard to foresee everything. Our staging system is close to the production system. However, the traffic and the amount of data, of course, varies. This is especially true for user-generated data. Some of our MySQL tables had size restrictions on numeric attributes. Rails 3 seemed to be more lax and ignored those restrictions. With Rails 4, this led to exceptions that unfortunately were only discovered in production.

The cleanup

Obviously, we introduced quite a lot of technical debt while preserving the possibility to run and develop the application with Rails 3 and 4 in parallel. After a few days had passed, we started the cleanup which mainly involved removing all the rails4? conditionals and the monkey patches that were introduced over the past weeks.

Done Done

From start to finish the upgrade took us roughly three months. At the beginning, we only worked part-time on it. Later, with the Rails Task Force, we dedicated the time of three engineers to it. All Babbel engineers, however, were willing to help where they could, which was highly appreciated. We used this service more than once.

Learnings

After starting to work and upgrade as part of a feature team, it was a good idea that we split up and founded the Rails Task Force. It helped everyone to focus and move forward more quickly.

The synvert gem might be a good tool to automate adapting your code to changes in the Rails API as much as possible.

Having a staging and production environment, you should keep in mind that every difference between them will make your life harder. Additionally, having multiple staging environments will definitely make your life easier as you can deploy your changes in a Rails 3 and a Rails 4 version and test them in parallel.

Photo by Karsten Winegeart on Unsplash

Want to join our Engineering team?

Apply today!