Babbel Bytes

Insights from the Babbel engineering team

Dynamically Built Regular Expressions

Jan

TL;DR: If you need to build dynamically regular expressions memoize them. Compiling them is expensive.

I don’t like memoizing methods, or caching in general. Obviously, caching adds more complexity to your source code and might lead to more errors, or even worse, errors which are harder to reproduce because they only happen due to a specific history. For memoized methods, you have to ensure that there are no other dependencies for the result than the passed arguments. Or that the result of a memoized method does not reference a larger object tree ending with a memory leak. There might be even more pitfalls.


Thus, it should be worth the trouble. Caching larger page fragments in order to avoid hunderds of method calls or even database queries might be a great boost for the performance. Memoizing smaller objects, just to avoid composing them again and again seems usually not to be the best idea to start optimizing.

Anyway, today, we were able to kick a performance killer from one of our data analysis processors by memoizing quite small objects: regular expressions, more precisely dynamically built regular expressions.

The impact on the response time of the data analysis processor at 9:08 pm:

Response Time After Memoizing

Example

In our data analysis processor, we are trying to make some educated guess about the browser or the mobile app of a request based on the User-Agent header. We are calling some helper methods building the regular expressions to extract that information for each known browser or app.

Let’s take the Safari browser. The version of that browser in the User-Agent string is not after the browser name, but after the string Version/. The call of the following helper method

user_agent_browser('Safari', version_prefix: 'Version/')

will build two simple regular expressions like /Safari/i and /Version\/([a-z0-9.]+)/i.

Unfortunately that expressions have not been hard coded literals in Ruby source file, but have been built from /#{identifier}/i and /#{version_prefix}([a-z0-9.]+)/i. And that was the performance killer.

The solution was to extract the building of the regular expressions into memoized class methods:

class << self
  extend Memoist

  def user_agent_name_regexp(identifiers)
    /#{Regexp.union(identifiers).source}/i
  end
  memoize :user_agent_name_regexp

  def user_agent_version_regexps(identifiers, version_prefix)
    prefixes = Array(version_prefix || identifiers.map { |i| ["#{i}/", "#{i} "] }.flatten)
    prefixes.map { |prefix| /#{Regexp.escape(prefix)}([a-z0-9.]+)/i }
  end
  memoize :user_agent_version_regexps

end

This had reduced the time for processing the browser and mobile app guessing based on the User-Agent header by half; and the entire processing step after all by one quarter up to one third.

Facebook Twitter Google+ Reddit EMail
Comments