Python marshal module fuzzing

The marshal module provides a serialization mechanism for Python values. In other words, the module contains functions for writing/reading Python objects in a binary format. Unfortunately the format is undocumented, and Python maintainers may change the format in backward incompatible ways between Python version. The marshal module is used internally by other Python components, for example, for reading and writing .pyc files which contain pseudo-compiled Python code. But Python also has public API to access this serialization mechanism.

This post shows how the marshal module can be quickly tested with a simple dumb fuzzer, and why the module shouldn’t be used with untrusted data.

The marshal module is implemented in C, so the simplest goal of fuzzing here is just to look for typical issues in C code like buffer overflows, use-after-free, null-pointer dereferences, etc. AddressSanitizer (ASan) is a great memory checker which can help with identifying such issues. AddressSanitizer instruments code while compilation. The tool replaces malloc and free functions, and adds check for memory corruption issues. Then, at runtime it tries to detect memory corruptions, and report them immediately with lots of useful information. AddressSanitizer is part of GCC 4.8+ which can be used to build Python.

Building Python with AddressSanitizer

Python code (CPython) can be cloned with the following command:

If you run ./configure --help, you can see that it has --with-address-sanitizer option which is supposed to enable AddressSanitizer. But for some reason it didn't work for me, so I just used the following commands to build Python:

Let me quickly explain what those options mean:

  1. CFLAGS, LDFLAGS, CPPFLAGS are standard enviroment variable which specify options for C/C++ compiler and linker.
  2. -fsanitize=address enables AddressSanitizer (it has to be passed to both compiler and linker)
  3. -g makes GCC produce debugging information.
  4. -O0 turns off compiler optimizations (but slows down execution).
  5. -fno-omit-frame-pointer is for nicer stack traces.
  6. ASAN_OPTIONS is an environment variable which contains parameters for AddressSanitizer at runtime.
  7. ASAN_OPTIONS="detect_leaks=0" turns off memory leaks checker which is part of AddressSanitizer.
  8. --prefix specifies a directory where it should put output binaries, libs, etc.
  9. --disable-ipv6 disables IPv6 (nothing surprising).

If the build runs smoothly, you can run python3.6 --version as a smoke test.

Fuzzing Python marshal module

There are a lot of fuzzers. You can choose a simple dumb fuzzer like zzuf, or use something more intelligent like American Fuzzy Lop (ALF). Or, you can always invent a bicycle — here is a simple dumb fuzzer for the marshal module written in Python:

https://github.com/artem-smotrakov/python-marshal-fuzzer

In general, this fuzzer is very similar to zzuf. Here is a couple of words about how it works.

DumbByteArrayFuzzer class is a simple dumb fuzzer for a byte array. It takes a byte array, and randomly modifies it depending on initial settings.

  1. data is an original byte array to fuzz.
  2. seed parameter specifies a seed for pseudo-random generator.
  3. min_ratio and max_ratio parameters specify min and max fraction of the byte array to be fuzzed.
  4. DumbByteArrayFuzzer generates reproducible data (test case), and start_test parameter specifies a test case to start from.
  5. ignored_bytes specifies symbols that should be ignored while fuzzing.

First, fuzzer.py parses command line options.

Next, it defines value object which is then serialized by marshal.dumps() method.

In the end of fuzzer.py, it creates an instance of DumbByteArrayFuzzer, and starts the main fuzzing loop

In the loop, it calls next() method to generate fuzzed data which is passed to marshal.loads()

The spec says that the following exception are expected: EOFError, ValueError, TypeError. The fuzzer just ignores them.

The fuzzer can be run with default parameters with the command like the following (no checks for memory leaks):

Segmentation fault in the marshal module

After some time, AddressSanitizer reported the following problem:

Here is the original data structure which was used for fuzzing:

The fuzzer modified it with the following:

  1. First, it update type of int2 item to TYPE_SET.
  2. As a result, int3 item became a length of the set.
  3. Then, it updated float3 item to TYPE_REF which points to tuple1 item.

In other words, now it is a a recursive tuple. What happened when marshal.loads() tried to deserialize this fuzzed data:

int2 item is now a set of length 3.

First, It adds int4 item to the set.

Next, it adds tuple2 item:

  1. When an object is added to a set, it calculates a hash of this object
  2. When it calculates a hash of a tuple, it calculates hashes of all items from this tuple.
  3. During calculating a hash of tuple2, it calculates a hash of tuple1 because float3 now is a TYPE_REF item which points to tuple1.
  4. But tuple1 is not complete yet. The length of tuple1 is 4, but only string1 has been added to it so far.
  5. tuplehash() function reads a length of a tuple, and then calls PyObject_Hash() fucntion for each item of the tuple.
  6. But it doesn’t check if a tuple is complete, and all elements have been added to the tuple.
  7. As a result, a null-pointer dereference happens in tuplehash() function when it reads second item of tuple1.

See https://hg.python.org/cpython/file/tip/Objects/tupleobject.c#l347 for details:

For tuple1, Py_SIZE(v) returns 4, but tuple1 contains only one element string1. A null-pointer dereference happens in PyObject_Hash() while reading second element.

Even if it doesn’t seem to be a serious security issue, the problem was originally reported to Python Security Response Team. They said they don’t consider crashes due to malicious marshal data to be security bugs. And documentation for the marshal module has a note about it:

Warning: The marshal module is not intended to be secure against erroneous or maliciously constructed data. Never unmarshal data received from an untrusted or unauthenticated source.

Then, the problem was reported to Python maintainers, but they decided not to fix it probably because of performance.

Conclusion

As they mentioned in documentation for the marshal module, it should not be used for unmarshaling data received from an untrusted party because the module is not intended to be secure against malicious data. Furthermore, some issues (like above) are not going to be fixed even if they are known.

The interesting thing is that at the moment of posting this article I found 59601 usages of marshal.loads() function on GitHub. I hope they know what they are doing.

Originally published at https://blog.gypsyengineer.com on September 21, 2016.

I write about Java, security, electronics and DIY

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store