Aug 24, 2017

Waiting in Parallel - Parallel Processing in Ruby

Tomáš Čerevka
Tomáš Čerevka
ruby Backend
Waiting in Parallel - Parallel Processing in Ruby

A lot of people think that Ruby is a slow language and its parallelism sucks - mainly due to Global Interpreter Lock (GIL). But I want to show you that even in plain Ruby you can achieve incredible results with proper usage of parallelization.

At Showmax, we use microservices following the golden rule “divide and conquer”. Thus our clients (iOS, Android, TVs, …) sometimes have to send quite a lots of requests to various services to fulfill users’ desire for watching new content. Because of this we created a service called “UI Fragments” that serves as a Facade and lowers the complexity of our clients. UI Fragments are responsible for assembling a personalized home page for every user. The home page has multiple rows, where every row holds selected assets (movies, tv series or other objects relevant to the user).

This approach gives us tremendous flexibility. The server side is capable of deciding what content should be shown to the user and augmenting its placement to suit the personal needs of the user. The key thing is that this functionality can then be shared across all supported platforms without the need for complex implementation on their side. It looks like a straightforward win.

Internally, UI Fragments assembled the home page via a builder interface, which figures out the order of rows and then their content. Everything worked fine when we served only a few static rows. But over time the number of rows grew, and some dynamically generated rows were added. As a result, our latency has grown significantly as well.

First, we tried to fetch only data that we really needed to build the response. We reduced the size of fetching JSONs in nested requests. That improved the situation, but we were looking for more.

The team figured out that our approach at that time was not sustainable. All rows are independent of each other, so they ought to be fetched and built in parallel. It became clear that solving the problem would require parallelization of the builder step.

But preconceptions about parallelism in Ruby were very strong. On the Internet you can find a lot of articles about MRI, and why it is not good to process data in parallel (mainly because of GIL).

Alternative VMs

There are some projects trying to provide a Ruby VM that is fast, removes GIL and is able to utilize all available CPU cores (MagLevRubiniusJRuby). So we have been facing some big questions - are these VMs production ready? Which one to choose? How hard is migration going to be? Can we optimize the current code in a way, that we will not need to migrate from MRI to other VM?

But before trying different VMs, we decided to tune our code in MRI first.

To make our situation a bit more complicated, we use Goliath quite heavily. Goliath is an eventmachine based server. Using Ruby Threads is not a good idea at all. It uses Fibers - primitives for implementing light weight cooperative concurrency in Ruby. The author of Goliath, Ilya Grigorik, also wrote a collection of tools to manage Fibers in a similar way as Threads, but in the eventmachine - FiberIterator.

Making Requests Parallel

Evil tongues say that Ruby is slow - yes, Ruby is not as fast as C or Java, but we don’t need to compute complex mathematical equations. We just need to make multiple requests at once, as we know there is a lot of I/O wait involved - it takes some time to establish a connection to a server, transfer data, the server has to process the requests, and finally transfer a response back to us.

We don’t need the CPU when waiting. Let’s wait in parallel!

This leads us to the first step. Using Fibers for making a lot of requests to other services in parallel. The previous builder had one method for each type of row. So we needed to add plumbing to run these methods in parallel. I prepared the following wrapper for each method:

class ParallelBuilder < OriginalBuilder
  def build_promo_row(title:, slug:)
    proc = Proc.new { super }
    @procs << { position: bump_position, proc: proc }

The wrapper acts as a common interface for executing the real code. Execution is a two step process. First we prepare all wrappers, then we just fire them all at once:

EM::Synchrony::FiberIterator.new(@procs, 20).each do |proc|
  @results[proc[:position]] = proc[:proc].call

FiberIterator takes care of the execution in parallel, and after this block you have all the results prepared in the @results variable, ready for another processing.

We saw very impressive results even during the testing phase. It turned out that our initial consideration was right - all requests are fired at almost the same time (we still have GIL, right?) and as soon as the last request finishes, we have all the data that we need. In the following image you can compare sequential (reddish) latency with parallel (blueish) latency in our staging environment.

Post image

Let’s say we need to invoke three calls with following average latencies:

  • Static promo row - 250 ms
  • Recommendation row - 600 ms
  • Recently watched row - 400 ms

In the previous solution we had to call all these requests sequentially so the total latency was:

250 ms + 600 ms + 400 ms + overhead (50 ms) = 1300 ms

But the new parallel solution is able to process all these requests (almost) at once. Notice that I also added estimation of parallel overhead to make the result more realistic (see below). The latency dropped to:

max(250 ms, 600 ms, 400 ms) + overhead (50 ms) + parallel overhead (100 ms) = 750 ms

Although in theory this looks amazing, the real world is cruel and we have to take other circumstances into consideration:

  • The network is not as reliable as we would wish, so slow requests appear and slow down the whole batches.
  • With higher traffic, the worker pool/reactor can saturate easily - it can’t scale up infinitely.
  • The level of parallelism is limited (20 workers per request in the example above). If more rows are needed, it falls back to sequential processing.
  • EventMachine can’t easily use the benefits of multicore CPUs. Hardware limitations still apply.

The picture below shows the latency in the Showmax production environment before (reddish) and after (blueish) the deployment. As you can see, the results are not as impressive as in the laboratory conditions, but we still sped up the service very nicely without too much effort.

Post image


Obviously, there is nothing magical in getting this performance boost. One would expect it as we have moved from sequential processing (which needs the sum of all latencies) to parallel (where the total time is driven solely by the slowest request). After all, that’s the same reason why your standard browser does parallel requests as well.

For me, the beauty is in the process of how we got here.

We started with naïve implementation, which was totally sufficient at the beginning. It was also easy to implement and fairly simple, so we could focus on getting the business logic right and kept optimization for the time when it was needed. Fibers turned out to be quite robust, so even with MRI and GIL we got the expected performance boost.

However, we didn’t stop there. The response is as fast as the slowest request in the parallel batch, and thus we improved surrounding services as well. In the following image you can judge how much we have been successful - we are able to process the whole homepage response within 500 ms on average, even in peak hours.

Post image
Share article via: