Stratus3D

A blog on software engineering by Trevor Brown

The Let It Crash Philosophy Outside Erlang

Abstract

Erlang is known for its use in systems that are fault tolerant and reliable. One of the ideas at the core of the Erlang runtime system’s design is the Let It Crash error handling philosophy. The Let It Crash philosophy is an approach to error handling that seeks preserve the integrity and reliability of a system by intentionally allowing certain faults to go unhandled. In this talk I will cover the principles of Let It Crash and then talk about how I think the principles can be applied to other software systems. I will conclude the talk by presenting a couple real world scripts written in various languages and then showing how they can be improved using the Let It Crash principles.

What is Let It Crash?

Let it crash is a fault tolerant design pattern. Like every design pattern, there are some variations on what it actually means. I looked around online and didn’t find a simple and authoritative definition of the term. The closest thing I found was a quote from Joe Armstrong.

only program the happy case, what the specification says the task is supposed to do

That’s a good, terse, description of what Let It Crash is. It’s about coding for the happy path and acknowledging the fact there are faults that will occur that we aren’t going to be able to handle properly because we don’t know how the program ought to handle them.

So how does approach improve our programs? And how do we apply this in practice?

Let It Crash principles

While I didn’t find an authoritative definition of what the Let It Crash philosophy was I have several years of experience applying it while developing software in Erlang. The core tenet of Let It Crash is code for the happy path. There are two other guiding principles that I follow as well, so here are my three principles of Let It Crash.

  • Code for the happy path

    Focus on what your code should do in the successful case. Code that first and only if necessary deal with some of the unhappy paths. You write software to DO something. Most of the code you write should be for doing that thing, not handling faults when they occur.

  • Don’t catch exceptions you can’t handle properly

    Never catch exceptions when you don’t know how to remedy them in the current context. Trying to work around something you can’t remedy can often lead to other problems and make things worse at inopportune times. Catching exceptions at higher levels is always possible. It is harder to re-throw an exception without losing information. Catch-all expressions that trap exceptions you didn’t anticipate are usually bad.

  • Software should fail noisily

    The worst thing software can do when encountering a fault is to continue silently. Unexpected faults should result in verbose exceptions that get logged and reported to an error tracking system.

Benefits

  • Less code

    Less code means less work and fewer places for bugs to hide. Error handling code is seldom used and often contains bugs of its own due to being poorly tested.

  • Less work

    Faster development in the short term. More predictable program behavior in the long term. Programs that simply crash when a fault is encountered are predictable. Simple restart mechanisms are easier to understand than specialized exception handling logic.

  • Reduced coupling

    Catching a specific exceptions is connascence of name. When you have a lot of exceptions that you are catching by name, your exception handling code is coupled to the code that raises the exceptions. If the code raising the exceptions changes you may have to change your exception handling code as well. Being judicious about handling exceptions reduces coupling.

  • Clarity

    Smaller programs are easier to understand, and the code more clearly indicate the programmer intent - unexpected faults are not handled by the code. Catch all expressions make the developers intent unclear to anyone reading the code by obscuring their assumptions about what could happen.

  • Better performance

    An uncaught exception is going to terminate the program or process almost immediately. Custom exception handling logic will use additional CPU cycles and may not be able to recover from the fault it encountered. Retry logic can also use a lot of resources and doesn’t always succeed.

Applying Let It Crash

Rather than try to explain more about the Let It Crash principles, I will show how they can be applied in other programming languages. While the application of these principles may vary from language to language. Let It Crash is applicable everywhere.

Let It Crash in Bash

I’m going to use Bash scripting as an example, because it’s something that many developers vaguely familiar with, and it’s radically different from Erlang as far as language semantics go.

Original Code

Here is a simple Bash script that reads from STDIN on the command line and uploads the input to Hastebin. I took this script from my own dotfiles.

#!/usr/bin/env bash

input_string=$(cat);
curl -X POST -s -d "$input_string" $HASTEBIN_URL/documents \
       | awk -v HASTEBIN_URL=$HASTEBIN_URL -F '"' '{print HASTEBIN_URL"/"$4}';

While this may seem like simple code there are a lot of potential faults that would cause unexpected behavior. There are at least three possible faults in this script that would result in unexpected behavior:

  1. If the cat command that reads data into the input_string variable fails the script will continue and will upload an empty snippet to Hastebin.

  2. If the curl command fails the awk command to parse out the Hastebin URL will still be executed. The awk command will not print out the snippet URL since there is no response body, and it might print out some other URL instead.

  3. If the HASTEBIN_URL environment variable is not set all of these commands will still be executed without it and will fail due to the empty variable.

Improved Code

This simple script can be greatly improved by setting some Bash flags. The improved code is shown below.

# Unoffical Bash "strict mode"
# http://redsymbol.net/articles/unofficial-bash-strict-mode/
set -euo pipefail
#ORIGINAL_IFS=$IFS
IFS=$'\t\n' # Stricter IFS settings

input_string=$(cat);
curl -X POST -s -d "$input_string" $HASTEBIN_URL/documents -v \
  | awk -v HASTEBIN_URL=$HASTEBIN_URL -F '"' '{print HASTEBIN_URL"/"$4}';

Before executing the commands we set -e / errexit to tell the script to exit when an non-zero (unsuccessful) exit code is returned by any command. This will the first and second issues pointed about above. We also set -u flag so any time an expression tries to use an unset variable the script exits with an error. The third flag we use is -o pipefail which tells Bash to use the exit code of the first unsuccessful command as the exit code for the whole pipeline, instead of using the exit code of the last command. This prevents a pipeline from returning a successful status code when a command in the pipeline fails. The script now exits if HASTEBIN_URL is not set and print a readable error message for the user. So the user knows what the script expects to be set by the environment.

Note About the Name

I don’t think Let It Crash is a good name for the philosophy Erlang has about handling faults. Yes, letting things crash is often the right thing to do, but it is a form of defensive programming (offensive programming actually) that allows software to recover to a known good state when faults are encountered. I also think defensive programming is a poor name for a design philosophy intended to improve software reliability as it seems to indicate the programmer should be defensive and proactively guard against unexpected faults. This is a topic for another blog post.

Elixir 1.9 Configuration Merging

Elixir configuration can be placed in several different files. Config is often stored in config/config.exs and often config.exs is setup to load environment specific configuration files like config/test.exs, config/prod.exs, and so on. And now in Elixir 1.9 runtime configuration can be placed in config/releases.exs.

Configuration Evaluation

These various configuration files are evaluated at different times. config/config.exs is evaluated at compile time. Runtime configuration in config/releases.exs is evaluated inside the release at runtime.

Configuration Merging

If we use multiple configuration files configuration data is merged recursively. We can observe how this merging works by specifying multiple configurations in the same configuration file with the same root key and then loading the configuration values from it.

# Write this to a file named config_test.exs
import Mix.Config

config :config_test,
  param1: :some_value,
  param2: :another_value,
  nested_param: [keyword: :list]

config :config_test, :nested_param, [another: :keyword]

config :config_test,
  param1: :new_value

This file makes three config calls to set the configuration for the :config_test application. These config calls will be evaluated in the order they are defined in the file, but what will the final configuration be?

Loading the configuration using Config.Reader in the iex shell shows us the final values:

iex> Config.Reader.read!("config_test.exs")

[
  config_test: [
    param1: :new_value,
    param2: :another_value,
    nested_param: [keyword: :list, another: :keyword],
  ]
]

Config.Reader recursively merges the configurations in order until they are all combined.

The first config call defines the base configuration. The third changes the value of the :param1 key to :new_value. Even keys inside :nested_param keyword list get merged with the existing keyword list. We can also see that the whole configuration itself is just a set of nested keyword lists.

Conclusion

Elixir merges configuration spread across multiple files in a reasonable way, and makes it easy to combine configuration from config/config.exs, config/releases.exs, and any other file containing configuration data without having to duplicate any of the keys and values.

But remember, configuration files aren’t always best. Application configuration is global and therefore prevents libraries and applications from being used with different configurations in the same release. Avoiding configuration files entirely is usually the best approach for library applications.

Another important change related to configuration is that mix new will no longer generate a config/config.exs file. Relying on configuration is undesired for most libraries and the generated config files pushed library authors in the wrong direction.

Bash Errexit Inconsistency

I’m a huge fan of the unofficial Bash strict mode. I use it for all my personal scripts as well as for other projects when I can. One part of Bash strict mode is the errexit option, which is turned on by doing set -e or set -o errexit. With errexit turned on, any expression that exits with a non-zero exit code terminates execution of the script, and the exit code of the expression becomes the exit code of the script. errexit is a big improvement, since any unexpected error ends execution of a script rather than being ignored and allowing execution of the script to continue. If you have a command that you expect to fail, you can do one of two things:

# Use the OR operator
maybe_fails || true

# Use the command as an if condition
if maybe_fails; then
  # success
fi

This ensures every command must succeed and the only commands allowed to fail are that are part of expressions designed to handle non-zero exit codes like if statements. If you have a command that you expect to fail and don’t care about the failure you are best off using the OR operator with true as the second command. If you have a command that may fail, and you want to execute code conditional based on the exit code of the command an if statement is the best option. But as I recently learned, there is one pitfall with the way errexit works with functions invoked from if conditions.

The Problem

errexit isn’t used when executing functions inside an if condition. A function executed normally with errexit would return the exit code of the first expression that returned a non-zero exit code, and execution of the function would stop. A function executed inside an if statement with errexit would ignore non-zero exit codes from commands invoked by the function and would continue to execute until a return or exit command is encountered. Here is an example script:

#!/usr/bin/env bash

set -e

f() {
	false
	echo "after false"
}

if f; then
  echo "f was successful"
fi

f

In this code we invoke the function f twice. Once inside the if statement and once by itself as a single expression. You might think this script will not output anything, since the first expression in the function f is false, which is a command that always returns an exit code of 1. However, the output of this script is actually:

after false
f was success

And the exit code of this script is 1, indicating a failure. When f is executed inside an if condition it is considered successful, and both lines of the function are executed. When it is executed outside of an if statement, as we would expect, only the first line of the function is evaluated and the function returns the exit code of the false command, which is always 1.

To sum up, when using errexit, any expression that returns an non-zero exit code will halt execution of the current function and return the non-zero exit code, except for expressions in the following places:

  • The command list following until or while

  • Part of the test following if or elif

  • Preceding && or ||

  • Any expression in a pipeline except the last, unless you are using set -o pipeline

In these locations non-zero exit codes are ignored.

The Solution

Note

I originally had a solution here that was flat out wrong. I originally stated:

The solution to this is to simply never invoke functions from within if statement conditionals. Instead the function must be executed before the if statement and the return code must be captured in a variable.

Here is the same if statement above modified to use this approach:

# Execute command and capture exit code, regardless of whether the command succeeded or not
f && exit_code=$? || exit_code=$?

# Use the exit code variable as the condition for the if statement.
if [ "$exit_code" = 0 ]; then
  echo "f was successful"
fi

This code does not work because I was using && when executing the function, so it was invoked without errexit set. Thanks to Dr. Björn Kahlert for pointing this and providing me with a complete list of locations where errexit is disabled.

The real solution here is not quite as elegant as I had hoped. The solution is to define a function that disables errexit, runs a subshell, enables errexit inside the subshell, executes provided the function, captures the exit code of the subshell, and then turns errexit back on before returning. The code looks like this:

get_exit_code() {
  # We first disable errexit in the current shell
  set +e
  (
    # Then we set it again inside a subshell
    set -e;
    # ...and run the function
    "$@"
  )
  exit_code=$?
  # And finally turn errexit back on in the current shell
  set -e
}

exit_code=0
get_exit_code f

if [ "$exit_code" = 0 ]; then
  echo "f was successful"
else
  echo "f failed"
fi

While this is not very elegant it is easy to use in practice. Since the function is executed outside of the if statement it will always be executed with errexit set as expected. Output from the original function can still be captured and used if desired.

Update 8/6/2022

Olivier wrote in and shared his own solution to this problem:

#!/usr/bin/env bash

set -e

g(){
  return 42
}

f() {
  g || return
  echo "after false"
}

if f; then
  echo "f was successful"
fi

This works about the same as my solution above with two significant differences:

  • It doesn’t run code in a subshell so a separate process is not spawned. It may therefore be slightly faster.

  • It doesn’t capture the exit code to a variable. The if statement will only be able to indicate success or failure unless you capture and compare the exit code yourself inside the if condition with something like if (f; [ $? -eq 42 ]); then.