Stratus3D

A blog on software engineering by Trevor Brown

Effective Debugging of Memory Leaks in Ruby

Recently at work I was tasked with finding a memory leak in a legacy Ruby application I was unfamiliar with. We had been monitoring the application’s memory usage for several weeks and it was clear that there was a slow memory leak somewhere in the application. When I began to look at the source code I realized it I didn’t know where to begin looking for the memory leak. The application was using a framework and several gems that I was unfamiliar with and was structured different than other Ruby projects I had been working on. I also found that I really missed Erlang’s excellent runtime_tools that I rely on for diagnosing issues in Erlang applications. After a lot of trial and error I finally found and fixed the memory leak. In this blog post I share the steps I devised to find the memory leak in the unfamiliar source code.

Steps to Find a Memory Leak

The list of steps below assumes you have determined there is a memory leak in your Ruby application but don’t know what is causing it. Your first instinct might be to reach for some memory profiling tool and begin looking at memory allocation over time. Some of these steps might seem like a waste of time, but in practice I’ve found them to the most of effective use of time as I often overlook simple things. These first two steps focus on gems. Often times third-party code is more widely used and pitfalls and memory leaks are more widely known.

1. Check for any unused gems in the Gemfile and remove them

There are numerous tools out there to help you find memory leaks in Ruby applications, but with a large codebase even the best tools still produce a lot of noise. If you find even one unused gem it will greatly reduce the amount of code you need to analyze to find the memory leak. Seldom have I not found at least one or two unused gems when reviewing the Gemfile of a legacy Ruby application. Removing unused gems sometimes has the added benefit of reducing overall memory usage.

2. Check the issue tracker of each gem still present in the Gemfile for reports of memory leaks

A gem may contain memory leaks that have already been reported on the gem’s issue tracker or mailing list. If you find a ticket or thread that describes something similar to the leak you are experiencing you may have found your memory leak. If a newer version of the gem contains a fix for the memory leak upgrade to the latest version. If you found an ticket but a fix is not available you may have to work with the maintainers of the gem to get it fixed or fork the project and fix it yourself. If you don’t find someone on the issue tracker describing your issue the changelogs will show if any released versions of the gem contain fixes for memory leaks. Even if a leak does not seem like the one you are experiencing it’s best upgrade to a version that doesn’t have any known leaks. If you do not find the source of the leak continue to step 3.

3. Run Rubocop with the rubocop-performance extension

This isn’t likely to find the cause of the memory leak, but it will alert you to general performance issues, which may give you clues as to where the leak is. If there are any Rubocop performance warnings correct the code and see if the memory leak is still present. The memory leak will likely still be present. If it is continue to step 4.

4. Visually review the Ruby code for possible memory leaks

Read through the application code and look for class attributes that grow in size, arrays that grow but never shrink, and long lived objects. Creating a memory leak is pretty easy. Fix any obvious issues. Don’t spend a ton of time on this; just read through the code quickly and look for any obvious issues. On applications with very large codebases you may need to skip this step as it will be too time consuming. If you haven’t found the cause of the leak continue to step 5.

5. If you still haven’t found the issue, use Ruby’s ObjectSpace class to find the leak

Now things get a little more involved. Follow the steps in the sections below to profile memory usage.

Profiling Memory Usage

We can use Ruby’s ObjectSpace class to dump the heap formatted as JSON. Doing so will allow us to see how memory was allocated at the time the heap was generated. A single dump may be enough to diagnose the cause of the leak but slow leaks will likely not be apparent in a single dump. As stated in Peter Wagenet’s excellent blog post on skylight.io comparing three dumps is best because it allows you to find memory that is allocated but not freed between the second and third dumps.

Generating Memory Dumps

Generating a memory dump is quite easy with the ObjectSpace class. To generate an accurate dump, we first start tracing object allocations, run the garbage collection so we get more accurate data, then dump all the allocations to file as NDJSON. Here is a helper function that does it all:

def dump_memory(file)
  # Start tracing object allocations
  ObjectSpace.trace_object_allocations_start
  # Run garbage collection to get a more accurate dump
  GC.start
  # Dump memory to file
  ObjectSpace.dump_all(output: file)
end

Now all that we need to do is invoke this function three times while our application is running to get the memory allocation dumps we need.

Hooking into the application

Since generating a memory dump is as easy as invoking the function above there are many different ways you can invoke the function. The three methods I use most often are listed below.

Add Directly to the Source Code

For debugging small applications or command line scripts adding dump_memory calls directly to the source code may be easiest. Just look for several places in the source code where dumping memory would make sense. The further apart in time the dump_memory calls are made the easier it will be to identify memory leaks.

Pry

If you already are using pry and have binding.pry calls placed in your code you can just invoke the dump_memory function from the Pry shell.

Rack

For larger applications it might not be clear when or where to dump memory, and it may take more time for a slow leak to manifest. For the most flexibility and ease of use add a simple Rack endpoint that invokes the function. If you are using something like Rack::URLMap adding a new endpoint is easy:

run Rack::URLMap.new(
  'dump-memory' -> MemoryEndpoint.new
  )

Then define the MemoryEndpoint class:

class MemoryEndpoint
  def call(env)
    # Open temp file
    time_str = Time.now.strftime('%Y-%m-%d_%H-%M-%S')
    File.open("tmp/dump-#{time_str}.json", 'w') do |file|
      # Use the helper function we defined above
      dump_memory(file)
    end
    [200, { 'Content-Type' => 'text/plain' }, ['Dumped Memory']]
  end
end

Now whenever you send a request to the /dump-memory endpoint the memory will be dumped to a file in the application’s current directory.

Finding Leaks by Comparing Dumps

Using one of the methods listed above generate at least three memory dumps. Once you have three or more dumps you can use the script below to generate a diff of three dumps. The script was created by Peter Wagenet at Skylight.io and taken from that same blog post I mentioned earlier.

Memory Dump Diff Script

#!/usr/bin/env ruby

require 'set'
require 'json'

if ARGV.length != 3
  puts "Usage: detect_leaks [FIRST.json] [SECOND.json] [THIRD.json]"
  exit 1
end

first_addrs = Set.new
third_addrs = Set.new

# Get a list of memory addresses from the first dump
File.open(ARGV[0], "r").each_line do |line|
  parsed = JSON.parse(line)
  first_addrs << parsed["address"] if parsed && parsed["address"]
end

# Get a list of memory addresses from the last dump
File.open(ARGV[2], "r").each_line do |line|
  parsed = JSON.parse(line)
  third_addrs << parsed["address"] if parsed && parsed["address"]
end

diff = []

# Get a list of all items present in both the second and
# third dumps but not in the first.
File.open(ARGV[1], "r").each_line do |line|
  parsed = JSON.parse(line)
  if parsed && parsed["address"]
    if !first_addrs.include?(parsed["address"]) && third_addrs.include?(parsed["address"])
      diff << parsed
    end
  end
end

# Group items
diff.group_by do |x|
  [x["type"], x["file"], x["line"]]
end.map do |x,y|
  # Collect memory size
  [x, y.count, y.inject(0){|sum,i| sum + (i['bytesize'] || 0) }, y.inject(0){|sum,i| sum + (i['memsize'] || 0) }]
end.sort do |a,b|
  b[1] <=> a[1]
end.each do |x,y,bytesize,memsize|
  # Output information about each potential leak
  puts "Retained #{y} #{x[0]} objects of size #{bytesize}/#{memsize} at: #{x[1]}:#{x[2]}"
end

# Also output total memory usage, because why not?
memsize = diff.inject(0){|sum,i| sum + (i['memsize'] || 0) }
bytesize = diff.inject(0){|sum,i| sum + (i['bytesize'] || 0) }
puts "\n\nTotal Size: #{bytesize}/#{memsize}"

Run this script and pass in the three dump files as arguments. The script will output lines like this:

Retained 77 STRING objects of size 10947/10961 at: ~/.asdf/installs/ruby/2.6.6/lib/ruby/gems/ruby-2.6.6/gems/actionview-4.1.6/lib/action_view/template.rb:297
Retained 15 DATA objects of size 0/51808 at: ~/.asdf/installs/ruby/2.6.6/lib/ruby/gems/ruby-2.6.6/gems/actionview-4.1.6/lib/action_view/template.rb:297
Retained 1 DATA objects of size 0/0 at: ~/.asdf/installs/ruby/2.6.6/lib/ruby/gems/ruby-2.6.6/gems/puma-2.7.1/lib/puma/client.rb:35
...

Each line contains information on objects that were allocated between the first and second dumps, but not freed between the second and third dumps. Objects that are allocated and not freed are possible memory leaks.

Review the output and look at the type and number of retained objects as well as where the objects came from. It may be obvious from the output where the memory leak is but don’t be discouraged if it is not immediately clear. Look for objects that are retained in higher numbers than you would expect, objects from files that are seldom used, or files that should not be creating new objects at all. If you think you’ve found the source of the leak see if you can confirm the leak by either fixing it or removing the leaky code altogether.

It may take a while for slow leak to manifest as a large number of objects in the diff output from this script. If you aren’t able to find the source of the leak you may need to increase the time between the memory dumps to make the leak more obvious. You may need to generate memory dumps several times before you are able to identify the leak in the diff output.

Conclusion

To make best use of our time it is best to resort to profiling only when we have exhausted the easier methods of finding the leak. Even with the ObjectSpace class memory profiling is time consuming. With this list of steps we start with the things that are most likely to help us find the source of the memory leak, like checking for gems with known memory leaks, before moving on to memory profiling.

Running Dialyzer in Jenkins Builds

Writing tests is an important way of validating the correctness of the software you are building and is a common practice for Erlang and Elixir projects. Type checking is also an important way of validating the correctness of the software you are working on. Dialyzer is still a valuable tool that is good at finding type errors in Erlang and Elixir source code. If you aren’t running Dialyzer on your Erlang and Elixir code your build is likely not catching bugs that could easily be found and fixed. Running Dialyzer automatically during Jenkins builds is a good way of ensuring obvious type errors are found and fixed.

In this blog post I will explain how I set up Jenkins to run Dialyzer during a Docker build for an Elixir project. I will also show how to cache Dialyzer’s PLT (persistent lookup table) files between builds using Jenkins artifacts and a Docker volume to speed up Dialyzer runs.

The instructions in this article are for Elixir 1.10 projects running in Docker, but the same steps can be applied to Erlang projects with minimal changes. The primary difference between the Elixir and Erlang builds would be the build paths.

Running Dialyzer

Running Dialyzer is pretty easy.

Add the dialyxir package to your project’s mix.exs file. Make sure to specify the environments the command will be needed in. In my case I needed the dialyzer command in dev and test environments:

defp deps do
  [
    ...
    {:dialyxir, "~> 1.0", only: [:dev, :test], runtime: false}
  ]
end

Running Dialyzer in Docker is as simple as adding the Dialyzer command right below the place where you run your Elixir tests:

RUN mix test
RUN MIX_ENV=test mix dialyzer

While this works, running Dialyzer during the Docker build like this isn’t going to allow us to reuse PLT files. There is a better way.

Reusing PLT Files

Building the type information is time consuming so Dialyzer writes the data to PLT files on disk. If you run Dialyzer during the Docker build in Jenkins the PLT files will not be present and Dialyzer will recreate them for every build. Building the PLT files will likely add 10-15 minutes to your build time. To speed up builds it is possible to cache files with Jenkins artifacts and reuse them on subsequent builds. Since we are using Docker the PLT files will be located inside the Docker container, so we will need a way to get the PLT files in and out of the Docker container.

Sharing PLT Files Between the Container and the Host

Putting files into a Docker container is easy. A simple COPY <source> <destination> command in the Dockerfile is all that’s needed. But we need to access the same PLT files both inside and outside of the container. Inside the container the PLT files will be used by Dialyzer to speed up the type analysis. After Dialyzer has finished inside the container the updated files will be needed outside of the container in the Jenkins build directory so they can be cached as Jenkins artifacts.

The easiest way of sharing files between the container and the host file system is a Docker volume. In order to use a Docker volume, you’ll need to first build the container with the application code, and then run the Dialyzer command with a volume mounted in the container.

Dockerfile

When running a container using docker run, you’ll need to use the --mount flag like this:

$ docker run ... --mount 'type=volume,src=/dialyxir,dst=/app/dialyxir,read_only=false'

Docker Compose

My Jenkins build actually ran mix dialyzer in a container that was started with Docker Compose. Creating a volume is also easy with Docker Compose:

volumes:
  - type: bind
    source: /dialyxir
    target: /app/dialyxir
    read_only: false

Putting PLT Files in the Volume

When Dialyzer runs in Elixir, it’s not going to write the PLT files to dialyxir/, it’s going to write them _build/test/. After Dialyzer runs we will need to copy the PLT files into the volume so they can be accessed on the host file system. We need to add two commands after the mix dialyxir command for this. The easiest way to do all this is to create a Bash script in the container that you can invoke with docker run. The script should look something like this. I named my script run_checks.sh:

run_checks.sh
#!/usr/bin/env bash
set -euo pipefail
IFS=$'\t\n' # Stricter IFS settings

# Run tests as usual if you want
mix test
# Run Dialyzer
MIX_ENV=test mix dialyzer
# Create the dialyxir directory if it doesn't already exist
mkdir -p dialyxir
# Copy the PLT files into the dialyxir directory
cp -f _build/dev/dialyxir* dialyxir
Note

In the past I have encountered permissions issues with files shared with the host through a Docker volume. Sometimes the host does not have permission to read the files, preventing Jenkins from uploading the PLT files as artifacts. You can work around this issue by copying the files into a new directory, and setting permission on that directory and its contents to 777:

cp -f _build/dev/dialyxir* dialyxir/new
chmod -R 777 dialyxir/new

Then outside the container in the Jenkins build copy the files from dialyxir/new/ into the parent dialyxir/ directory before saving them as Jenkins artifacts.

Now you can run the script using the docker run command after you have built the docker image:

$ docker run <container>:<tag> run_checks.sh ... --mount 'type=volume,src=/dialyxir,dst=/app/dialyxir,read_only=false'

After running this command you should see the PLT files in a dialyxir directory inside the same directory that you ran the command in. You now have something that you can cache using Jenkins artifacts.

Saving PLT Files as Jenkins Build Artifacts

If you are using Jenkins pipelines uploading the PLT files as Jenkins artifacts can be done as a stage that runs after the docker run that generates the PLT files. You only want to save PLT files for builds of the master branch. Feature branches may generate PLT files that contain new dependencies that are not used by the master branch or other feature branches and may result in PLT files aren’t suitable for use in later builds.

    stage('Upload Build Artifacts') {
      when {
        branch 'master'
      }
        stage('Upload Artifacts') {
          steps {
            // Upload the PLT files as Jenkins artifacts
            archiveArtifacts artifacts: 'dialyxir/*', onlyIfSuccessful: true
          }
        }
      }

Downloading PLT File Artifacts for Reuse

Now that the build is storing PLT files in Jenkins artifacts subsequent builds can download and use them instead of rebuilding the PLT files from scratch. This can be easily done with a Jenkins stage that downloads the files before running the run_checks.sh script in Docker:

  stages {
    stage('Download Artifacts') {
      steps {
        // download dialyzer artifacts from jenkins
        copyArtifacts(projectName: '<project name>/master', target: 'dialyxir', optional: true, flatten: true)
      }
    }
    ...

Summary

With these two new stages in place your builds should now be able to upload generated PLT files after running Dialyzer on your project’s master branch and download the cached PLT files on subsequent builds of any branch. PLT files will never need to be re-created and will only need to be updated when dependencies change. With PLT file caching my Elixir build times improved from an average of 15-20 minutes to around 4 minutes.

Example Files

Below is a Dockerfile and a Jenkinsfile so you can see what the final files might look like after all these changes.

Example Dockerfile:

FROM elixir:1.10.2-alpine AS base

LABEL project="<project name>"
ENV APP_ROOT=/app

WORKDIR $APP_ROOT

# Add dependencies
FROM base AS deps

RUN apk add coreutils

# Copy in application code
FROM deps AS application

COPY . $APP_ROOT

# Generate final image
FROM application AS build

# Fetch the Elixir dependencies and compile and release the application
RUN MIX_ENV=prod mix do deps.get, compile, release

# Copy release into the rel directory for convenience
COPY --from=build $APP_ROOT/_build/prod/rel rel

ENTRYPOINT ["/app/rel/<app name>/bin/<app name>"]
CMD ["start"]

Example Jenkinsfile:

pipeline {
  stages {
    stage('Download Artifacts') {
      steps {
        // Download PLT files from Jenkins
        copyArtifacts(projectName: '<project name>/master', target: 'dialyxir', optional: true, flatten: true)
      }
    }

    stage('Build Docker Image') {
      sh 'docker build .'
    }

    stage('Run Tests and Dialyzer') {
      // Run Dialyzer with shared volume so we can share PLT files between Jenkins and the container
      sh 'docker run --mount "type=volume,src=/dialyxir,dst=/app/dialyxir,read_only=false" <project name> ./run_checks.sh'
    }

    stage('Upload Build Artifacts') {
      when {
        branch 'master'
      }

      stage('Upload Artifacts') {
        steps {
          // Upload the PLT files as Jenkins artifacts
          archiveArtifacts artifacts: 'dialyxir/*', onlyIfSuccessful: true
        }
      }
    }
  }
}

The Let It Crash Philosophy Outside Erlang

Abstract

Erlang is known for its use in systems that are fault tolerant and reliable. One of the ideas at the core of the Erlang runtime system’s design is the Let It Crash error handling philosophy. The Let It Crash philosophy is an approach to error handling that seeks preserve the integrity and reliability of a system by intentionally allowing certain faults to go unhandled. In this talk I will cover the principles of Let It Crash and then talk about how I think the principles can be applied to other software systems. I will conclude the talk by presenting a couple real world scripts written in various languages and then showing how they can be improved using the Let It Crash principles.

What is Let It Crash?

Let it crash is a fault tolerant design pattern. Like every design pattern, there are some variations on what it actually means. I looked around online and didn’t find a simple and authoritative definition of the term. The closest thing I found was a quote from Joe Armstrong.

only program the happy case, what the specification says the task is supposed to do

That’s a good, terse, description of what Let It Crash is. It’s about coding for the happy path and acknowledging the fact there are faults that will occur that we aren’t going to be able to handle properly because we don’t know how the program ought to handle them.

So how does approach improve our programs? And how do we apply this in practice?

Let It Crash principles

While I didn’t find an authoritative definition of what the Let It Crash philosophy was I have several years of experience applying it while developing software in Erlang. The core tenet of Let It Crash is code for the happy path. There are two other guiding principles that I follow as well, so here are my three principles of Let It Crash.

  • Code for the happy path

    Focus on what your code should do in the successful case. Code that first and only if necessary deal with some of the unhappy paths. You write software to DO something. Most of the code you write should be for doing that thing, not handling faults when they occur.

  • Don’t catch exceptions you can’t handle properly

    Never catch exceptions when you don’t know how to remedy them in the current context. Trying to work around something you can’t remedy can often lead to other problems and make things worse at inopportune times. Catching exceptions at higher levels is always possible. It is harder to re-throw an exception without losing information. Catch-all expressions that trap exceptions you didn’t anticipate are usually bad.

  • Software should fail noisily

    The worst thing software can do when encountering a fault is to continue silently. Unexpected faults should result in verbose exceptions that get logged and reported to an error tracking system.

Benefits

  • Less code

    Less code means less work and fewer places for bugs to hide. Error handling code is seldom used and often contains bugs of its own due to being poorly tested.

  • Less work

    Faster development in the short term. More predictable program behavior in the long term. Programs that simply crash when a fault is encountered are predictable. Simple restart mechanisms are easier to understand than specialized exception handling logic.

  • Reduced coupling

    Catching a specific exceptions is connascence of name. When you have a lot of exceptions that you are catching by name, your exception handling code is coupled to the code that raises the exceptions. If the code raising the exceptions changes you may have to change your exception handling code as well. Being judicious about handling exceptions reduces coupling.

  • Clarity

    Smaller programs are easier to understand, and the code more clearly indicate the programmer intent - unexpected faults are not handled by the code. Catch all expressions make the developers intent unclear to anyone reading the code by obscuring their assumptions about what could happen.

  • Better performance

    An uncaught exception is going to terminate the program or process almost immediately. Custom exception handling logic will use additional CPU cycles and may not be able to recover from the fault it encountered. Retry logic can also use a lot of resources and doesn’t always succeed.

Applying Let It Crash

Rather than try to explain more about the Let It Crash principles, I will show how they can be applied in other programming languages. While the application of these principles may vary from language to language. Let It Crash is applicable everywhere.

Let It Crash in Bash

I’m going to use Bash scripting as an example, because it’s something that many developers vaguely familiar with, and it’s radically different from Erlang as far as language semantics go.

Original Code

Here is a simple Bash script that reads from STDIN on the command line and uploads the input to Hastebin. I took this script from my own dotfiles.

#!/usr/bin/env bash

input_string=$(cat);
curl -X POST -s -d "$input_string" $HASTEBIN_URL/documents \
       | awk -v HASTEBIN_URL=$HASTEBIN_URL -F '"' '{print HASTEBIN_URL"/"$4}';

While this may seem like simple code there are a lot of potential faults that would cause unexpected behavior. There are at least three possible faults in this script that would result in unexpected behavior:

  1. If the cat command that reads data into the input_string variable fails the script will continue and will upload an empty snippet to Hastebin.

  2. If the curl command fails the awk command to parse out the Hastebin URL will still be executed. The awk command will not print out the snippet URL since there is no response body, and it might print out some other URL instead.

  3. If the HASTEBIN_URL environment variable is not set all of these commands will still be executed without it and will fail due to the empty variable.

Improved Code

This simple script can be greatly improved by setting some Bash flags. The improved code is shown below.

# Unoffical Bash "strict mode"
# http://redsymbol.net/articles/unofficial-bash-strict-mode/
set -euo pipefail
#ORIGINAL_IFS=$IFS
IFS=$'\t\n' # Stricter IFS settings

input_string=$(cat);
curl -X POST -s -d "$input_string" $HASTEBIN_URL/documents -v \
  | awk -v HASTEBIN_URL=$HASTEBIN_URL -F '"' '{print HASTEBIN_URL"/"$4}';

Before executing the commands we set -e / errexit to tell the script to exit when an non-zero (unsuccessful) exit code is returned by any command. This will the first and second issues pointed about above. We also set -u flag so any time an expression tries to use an unset variable the script exits with an error. The third flag we use is -o pipefail which tells Bash to use the exit code of the first unsuccessful command as the exit code for the whole pipeline, instead of using the exit code of the last command. This prevents a pipeline from returning a successful status code when a command in the pipeline fails. The script now exits if HASTEBIN_URL is not set and print a readable error message for the user. So the user knows what the script expects to be set by the environment.

Note About the Name

I don’t think Let It Crash is a good name for the philosophy Erlang has about handling faults. Yes, letting things crash is often the right thing to do, but it is a form of defensive programming (offensive programming actually) that allows software to recover to a known good state when faults are encountered. I also think defensive programming is a poor name for a design philosophy intended to improve software reliability as it seems to indicate the programmer should be defensive and proactively guard against unexpected faults. This is a topic for another blog post.