Stratus3D

Software Engineering, Web Development and 3D Design

Elixir Flame Graphs

Flame graphs are a way of visualizing profiled software by representing the call stack as a graph of horizontal layers with the x axis representing the time of each call and the y axis representing the nested calls on the stack as stacked layers. Flame graphs make it easy to see how long a function takes to execute as well how often each function is invoked.

Effective Debugging of Memory Leaks in Ruby

Recently at work I was tasked with finding a memory leak in a legacy Ruby application I was unfamiliar with. We had been monitoring the application’s memory usage for several weeks and it was clear that there was a slow memory leak somewhere in the application. When I began to look at the source code I realized it I didn’t know where to begin looking for the memory leak. The application was using a framework and several gems that I was unfamiliar with and was structured different than other Ruby projects I had been working on. I also found that I really missed Erlang’s excellent runtime_tools that I rely on for diagnosing issues in Erlang applications. After a lot of trial and error I finally found and fixed the memory leak. In this blog post I share the steps I devised to find the memory leak in the unfamiliar source code.

Steps to Find a Memory Leak

The list of steps below assumes you have determined there is a memory leak in your Ruby application but don’t know what is causing it. Your first instinct might be to reach for some memory profiling tool and begin looking at memory allocation over time. Some of these steps might seem like a waste of time, but in practice I’ve found them to the most of effective use of time as I often overlook simple things. These first two steps focus on gems. Often times third-party code is more widely used and pitfalls and memory leaks are more widely known.

1. Check for any unused gems in the Gemfile and remove them

There are numerous tools out there to help you find memory leaks in Ruby applications, but with a large codebase even the best tools still produce a lot of noise. If you find even one unused gem it will greatly reduce the amount of code you need to analyze to find the memory leak. Seldom have I not found at least one or two unused gems when reviewing the Gemfile of a legacy Ruby application. Removing unused gems sometimes has the added benefit of reducing overall memory usage.

2. Check the issue tracker of each gem still present in the Gemfile for reports of memory leaks

A gem may contain memory leaks that have already been reported on the gem’s issue tracker or mailing list. If you find a ticket or thread that describes something similar to the leak you are experiencing you may have found your memory leak. If a newer version of the gem contains a fix for the memory leak upgrade to the latest version. If you found an ticket but a fix is not available you may have to work with the maintainers of the gem to get it fixed or fork the project and fix it yourself. If you don’t find someone on the issue tracker describing your issue the changelogs will show if any released versions of the gem contain fixes for memory leaks. Even if a leak does not seem like the one you are experiencing it’s best upgrade to a version that doesn’t have any known leaks. If you do not find the source of the leak continue to step 3.

3. Run Rubocop with the rubocop-performance extension

This isn’t likely to find the cause of the memory leak, but it will alert you to general performance issues, which may give you clues as to where the leak is. If there are any Rubocop performance warnings correct the code and see if the memory leak is still present. The memory leak will likely still be present. If it is continue to step 4.

4. Visually review the Ruby code for possible memory leaks

Read through the application code and look for class attributes that grow in size, arrays that grow but never shrink, and long lived objects. Creating a memory leak is pretty easy. Fix any obvious issues. Don’t spend a ton of time on this; just read through the code quickly and look for any obvious issues. On applications with very large codebases you may need to skip this step as it will be too time consuming. If you haven’t found the cause of the leak continue to step 5.

5. If you still haven’t found the issue, use Ruby’s ObjectSpace class to find the leak

Now things get a little more involved. Follow the steps in the sections below to profile memory usage.

Profiling Memory Usage

We can use Ruby’s ObjectSpace class to dump the heap formatted as JSON. Doing so will allow us to see how memory was allocated at the time the heap was generated. A single dump may be enough to diagnose the cause of the leak but slow leaks will likely not be apparent in a single dump. As stated in Peter Wagenet’s excellent blog post on skylight.io comparing three dumps is best because it allows you to find memory that is allocated but not freed between the second and third dumps.

Generating Memory Dumps

Generating a memory dump is quite easy with the ObjectSpace class. To generate an accurate dump, we first start tracing object allocations, run the garbage collection so we get more accurate data, then dump all the allocations to file as NDJSON. Here is a helper function that does it all:

def dump_memory(file)
  # Start tracing object allocations
  ObjectSpace.trace_object_allocations_start
  # Run garbage collection to get a more accurate dump
  GC.start
  # Dump memory to file
  ObjectSpace.dump_all(output: file)
end

Now all that we need to do is invoke this function three times while our application is running to get the memory allocation dumps we need.

Hooking into the application

Since generating a memory dump is as easy as invoking the function above there are many different ways you can invoke the function. The three methods I use most often are listed below.

Add Directly to the Source Code

For debugging small applications or command line scripts adding dump_memory calls directly to the source code may be easiest. Just look for several places in the source code where dumping memory would make sense. The further apart in time the dump_memory calls are made the easier it will be to identify memory leaks.

Pry

If you already are using pry and have binding.pry calls placed in your code you can just invoke the dump_memory function from the Pry shell.

Rack

For larger applications it might not be clear when or where to dump memory, and it may take more time for a slow leak to manifest. For the most flexibility and ease of use add a simple Rack endpoint that invokes the function. If you are using something like Rack::URLMap adding a new endpoint is easy:

run Rack::URLMap.new(
  'dump-memory' -> MemoryEndpoint.new
  )

Then define the MemoryEndpoint class:

class MemoryEndpoint
  def call(env)
    # Open temp file
    time_str = Time.now.strftime('%Y-%m-%d_%H-%M-%S')
    File.open("tmp/dump-#{time_str}.json", 'w') do |file|
      # Use the helper function we defined above
      dump_memory(file)
    end
    [200, { 'Content-Type' => 'text/plain' }, ['Dumped Memory']]
  end
end

Now whenever you send a request to the /dump-memory endpoint the memory will be dumped to a file in the application’s current directory.

Finding Leaks by Comparing Dumps

Using one of the methods listed above generate at least three memory dumps. Once you have three or more dumps you can use the script below to generate a diff of three dumps. The script was created by Peter Wagenet at Skylight.io and taken from that same blog post I mentioned earlier.

Memory Dump Diff Script

#!/usr/bin/env ruby

require 'set'
require 'json'

if ARGV.length != 3
  puts "Usage: detect_leaks [FIRST.json] [SECOND.json] [THIRD.json]"
  exit 1
end

first_addrs = Set.new
third_addrs = Set.new

# Get a list of memory addresses from the first dump
File.open(ARGV[0], "r").each_line do |line|
  parsed = JSON.parse(line)
  first_addrs << parsed["address"] if parsed && parsed["address"]
end

# Get a list of memory addresses from the last dump
File.open(ARGV[2], "r").each_line do |line|
  parsed = JSON.parse(line)
  third_addrs << parsed["address"] if parsed && parsed["address"]
end

diff = []

# Get a list of all items present in both the second and
# third dumps but not in the first.
File.open(ARGV[1], "r").each_line do |line|
  parsed = JSON.parse(line)
  if parsed && parsed["address"]
    if !first_addrs.include?(parsed["address"]) && third_addrs.include?(parsed["address"])
      diff << parsed
    end
  end
end

# Group items
diff.group_by do |x|
  [x["type"], x["file"], x["line"]]
end.map do |x,y|
  # Collect memory size
  [x, y.count, y.inject(0){|sum,i| sum + (i['bytesize'] || 0) }, y.inject(0){|sum,i| sum + (i['memsize'] || 0) }]
end.sort do |a,b|
  b[1] <=> a[1]
end.each do |x,y,bytesize,memsize|
  # Output information about each potential leak
  puts "Retained #{y} #{x[0]} objects of size #{bytesize}/#{memsize} at: #{x[1]}:#{x[2]}"
end

# Also output total memory usage, because why not?
memsize = diff.inject(0){|sum,i| sum + (i['memsize'] || 0) }
bytesize = diff.inject(0){|sum,i| sum + (i['bytesize'] || 0) }
puts "\n\nTotal Size: #{bytesize}/#{memsize}"

Run this script and pass in the three dump files as arguments. The script will output lines like this:

Retained 77 STRING objects of size 10947/10961 at: ~/.asdf/installs/ruby/2.6.6/lib/ruby/gems/ruby-2.6.6/gems/actionview-4.1.6/lib/action_view/template.rb:297
Retained 15 DATA objects of size 0/51808 at: ~/.asdf/installs/ruby/2.6.6/lib/ruby/gems/ruby-2.6.6/gems/actionview-4.1.6/lib/action_view/template.rb:297
Retained 1 DATA objects of size 0/0 at: ~/.asdf/installs/ruby/2.6.6/lib/ruby/gems/ruby-2.6.6/gems/puma-2.7.1/lib/puma/client.rb:35
...

Each line contains information on objects that were allocated between the first and second dumps, but not freed between the second and third dumps. Objects that are allocated and not freed are possible memory leaks.

Review the output and look at the type and number of retained objects as well as where the objects came from. It may be obvious from the output where the memory leak is but don’t be discouraged if it is not immediately clear. Look for objects that are retained in higher numbers than you would expect, objects from files that are seldom used, or files that should not be creating new objects at all. If you think you’ve found the source of the leak see if you can confirm the leak by either fixing it or removing the leaky code altogether.

It may take a while for slow leak to manifest as a large number of objects in the diff output from this script. If you aren’t able to find the source of the leak you may need to increase the time between the memory dumps to make the leak more obvious. You may need to generate memory dumps several times before you are able to identify the leak in the diff output.

Conclusion

To make best use of our time it is best to resort to profiling only when we have exhausted the easier methods of finding the leak. Even with the ObjectSpace class memory profiling is time consuming. With this list of steps we start with the things that are most likely to help us find the source of the memory leak, like checking for gems with known memory leaks, before moving on to memory profiling.

Running Dialyzer in Jenkins Builds

Writing tests is an important way of validating the correctness of the software you are building and is a common practice for Erlang and Elixir projects. Type checking is also an important way of validating the correctness of the software you are working on. Dialyzer is still a valuable tool that is good at finding type errors in Erlang and Elixir source code. If you aren’t running Dialyzer on your Erlang and Elixir code your build is likely not catching bugs that could easily be found and fixed. Running Dialyzer automatically during Jenkins builds is a good way of ensuring obvious type errors are found and fixed.

In this blog post I will explain how I set up Jenkins to run Dialyzer during a Docker build for an Elixir project. I will also show how to cache Dialyzer’s PLT (persistent lookup table) files between builds using Jenkins artifacts and a Docker volume to speed up Dialyzer runs.

The instructions in this article are for Elixir 1.10 projects running in Docker, but the same steps can be applied to Erlang projects with minimal changes. The primary difference between the Elixir and Erlang builds would be the build paths.

Running Dialyzer

Running Dialyzer is pretty easy.

Add the dialyxir package to your project’s mix.exs file. Make sure to specify the environments the command will be needed in. In my case I needed the dialyzer command in dev and test environments:

defp deps do
  [
    ...
    {:dialyxir, "~> 1.0", only: [:dev, :test], runtime: false}
  ]
end

Running Dialyzer in Docker is as simple as adding the Dialyzer command right below the place where you run your Elixir tests:

RUN mix test
RUN MIX_ENV=test mix dialyzer

While this works, running Dialyzer during the Docker build like this isn’t going to allow us to reuse PLT files. There is a better way.

Reusing PLT Files

Building the type information is time consuming so Dialyzer writes the data to PLT files on disk. If you run Dialyzer during the Docker build in Jenkins the PLT files will not be present and Dialyzer will recreate them for every build. Building the PLT files will likely add 10-15 minutes to your build time. To speed up builds it is possible to cache files with Jenkins artifacts and reuse them on subsequent builds. Since we are using Docker the PLT files will be located inside the Docker container, so we will need a way to get the PLT files in and out of the Docker container.

Sharing PLT Files Between the Container and the Host

Putting files into a Docker container is easy. A simple COPY <source> <destination> command in the Dockerfile is all that’s needed. But we need to access the same PLT files both inside and outside of the container. Inside the container the PLT files will be used by Dialyzer to speed up the type analysis. After Dialyzer has finished inside the container the updated files will be needed outside of the container in the Jenkins build directory so they can be cached as Jenkins artifacts.

The easiest way of sharing files between the container and the host file system is a Docker volume. In order to use a Docker volume, you’ll need to first build the container with the application code, and then run the Dialyzer command with a volume mounted in the container.

Dockerfile

When running a container using docker run, you’ll need to use the --mount flag like this:

$ docker run ... --mount 'type=volume,src=/dialyxir,dst=/app/dialyxir,read_only=false'

Docker Compose

My Jenkins build actually ran mix dialyzer in a container that was started with Docker Compose. Creating a volume is also easy with Docker Compose:

volumes:
  - type: bind
    source: /dialyxir
    target: /app/dialyxir
    read_only: false

Putting PLT Files in the Volume

When Dialyzer runs in Elixir, it’s not going to write the PLT files to dialyxir/, it’s going to write them _build/test/. After Dialyzer runs we will need to copy the PLT files into the volume so they can be accessed on the host file system. We need to add two commands after the mix dialyxir command for this. The easiest way to do all this is to create a Bash script in the container that you can invoke with docker run. The script should look something like this. I named my script run_checks.sh:

run_checks.sh
#!/usr/bin/env bash
set -euo pipefail
IFS=$'\t\n' # Stricter IFS settings

# Run tests as usual if you want
mix test
# Run Dialyzer
MIX_ENV=test mix dialyzer
# Create the dialyxir directory if it doesn't already exist
mkdir -p dialyxir
# Copy the PLT files into the dialyxir directory
cp -f _build/dev/dialyxir* dialyxir
Note

In the past I have encountered permissions issues with files shared with the host through a Docker volume. Sometimes the host does not have permission to read the files, preventing Jenkins from uploading the PLT files as artifacts. You can work around this issue by copying the files into a new directory, and setting permission on that directory and its contents to 777:

cp -f _build/dev/dialyxir* dialyxir/new
chmod -R 777 dialyxir/new

Then outside the container in the Jenkins build copy the files from dialyxir/new/ into the parent dialyxir/ directory before saving them as Jenkins artifacts.

Now you can run the script using the docker run command after you have built the docker image:

$ docker run <container>:<tag> run_checks.sh ... --mount 'type=volume,src=/dialyxir,dst=/app/dialyxir,read_only=false'

After running this command you should see the PLT files in a dialyxir directory inside the same directory that you ran the command in. You now have something that you can cache using Jenkins artifacts.

Saving PLT Files as Jenkins Build Artifacts

If you are using Jenkins pipelines uploading the PLT files as Jenkins artifacts can be done as a stage that runs after the docker run that generates the PLT files. You only want to save PLT files for builds of the master branch. Feature branches may generate PLT files that contain new dependencies that are not used by the master branch or other feature branches and may result in PLT files aren’t suitable for use in later builds.

    stage('Upload Build Artifacts') {
      when {
        branch 'master'
      }
        stage('Upload Artifacts') {
          steps {
            // Upload the PLT files as Jenkins artifacts
            archiveArtifacts artifacts: 'dialyxir/*', onlyIfSuccessful: true
          }
        }
      }

Downloading PLT File Artifacts for Reuse

Now that the build is storing PLT files in Jenkins artifacts subsequent builds can download and use them instead of rebuilding the PLT files from scratch. This can be easily done with a Jenkins stage that downloads the files before running the run_checks.sh script in Docker:

  stages {
    stage('Download Artifacts') {
      steps {
        // download dialyzer artifacts from jenkins
        copyArtifacts(projectName: '<project name>/master', target: 'dialyxir', optional: true, flatten: true)
      }
    }
    ...

Summary

With these two new stages in place your builds should now be able to upload generated PLT files after running Dialyzer on your project’s master branch and download the cached PLT files on subsequent builds of any branch. PLT files will never need to be re-created and will only need to be updated when dependencies change. With PLT file caching my Elixir build times improved from an average of 15-20 minutes to around 4 minutes.

Example Files

Below is a Dockerfile and a Jenkinsfile so you can see what the final files might look like after all these changes.

Example Dockerfile:

FROM elixir:1.10.2-alpine AS base

LABEL project="<project name>"
ENV APP_ROOT=/app

WORKDIR $APP_ROOT

# Add dependencies
FROM base AS deps

RUN apk add coreutils

# Copy in application code
FROM deps AS application

COPY . $APP_ROOT

# Generate final image
FROM application AS build

# Fetch the Elixir dependencies and compile and release the application
RUN MIX_ENV=prod mix do deps.get, compile, release

# Copy release into the rel directory for convenience
COPY --from=build $APP_ROOT/_build/prod/rel rel

ENTRYPOINT ["/app/rel/<app name>/bin/<app name>"]
CMD ["start"]

Example Jenkinsfile:

pipeline {
  stages {
    stage('Download Artifacts') {
      steps {
        // Download PLT files from Jenkins
        copyArtifacts(projectName: '<project name>/master', target: 'dialyxir', optional: true, flatten: true)
      }
    }

    stage('Build Docker Image') {
      sh 'docker build .'
    }

    stage('Run Tests and Dialyzer') {
      // Run Dialyzer with shared volume so we can share PLT files between Jenkins and the container
      sh 'docker run --mount "type=volume,src=/dialyxir,dst=/app/dialyxir,read_only=false" <project name> ./run_checks.sh'
    }

    stage('Upload Build Artifacts') {
      when {
        branch 'master'
      }

      stage('Upload Artifacts') {
        steps {
          // Upload the PLT files as Jenkins artifacts
          archiveArtifacts artifacts: 'dialyxir/*', onlyIfSuccessful: true
        }
      }
    }
  }
}