Tracing Perl memory leaks with Devel::MAT (part 1)
The
Devel::MAT
ecosystem allows developers of perl programs to inspect and analyse memory-related problems such as memory leaks, unexpected memory consumption, or odd state.
So says the first paragraph of the user guide, on CPAN at Devel::MAT::UserGuide
. The user guide then goes on a brief overview of how to use the tool and some initial explorations. But this brief explanation can be hard to grasp at first without a real problem in a real piece of code to serve as a concrete example and really demonstrate how it all works. Recently we've started using these tools to find and fix a few memory-related bugs, so here's a somewhat more concrete introduction and overview with some real details in it.
The Initial Problem
We have several long-running asynchronous processes. One problem we experience occasionally is the presence of unintentional reference cycles in callbacks, leading to gradual increase in memory use over time. Once we reach a predefined limit, the process will be restarted. If that single process is handling multiple client connections, there is sometimes no convenient time to do a graceful restart.
There are modules for testing reference count handling, for example:
However, sometimes issues only become apparent as part of the larger application. This is where Devel::MAT
comes in, since it can be used to view the memory usage across the entire process.
As the user guide explains:
The basic workflow consists of two main stages: first a heap dump file is generated from the perl process being debugged, at or around the time that the problem becomes apparent, and secondly this file is loaded by an analysis tool in order to inspect the contents.
These two stages are described here separately. It is important to note that they don't have to be done at the same time, on the same perl process, or even on the same sort of machine. It is fully possible to capture the heap from a process on, say, a small server, then copy the file to a development workstation or laptop and analyse it there at a later time.
In our case, this is exactly what happened. Not only were the two parts of the process performed on two different machines at two different times, but even by two different people.
Capturing Memory State
We used a hook that allows a memory capture from a running process via a signal handler - we'll describe this part of the process in a separate post. The specifics around typical methods of performing this task will vary a lot depending on what kind of program is being debugged, and how the problem tends to manifest itself. Suffice it to say for now, that this part of the process happened, and we got our heap dump in a file called 2017-02-22-124523-dump-4056.pmat
on our development workstation so we can start to analyse it
(It is traditional to name these files with the filename extension .pmat
, after the name of the main analysis tool, pmat
which stands for the Perl Memory Analysis Tool.)
Loading The Toolchain
The analysis of this heap dump file is performed by the pmat
program, which acts as an interactive command-based shell around the internal tools and plugins of the Devel::MAT
toolchain. First thing to do is start up that shell by loading the file into it. This part can be a little slow on especially on large files, so the shell helpfully gives us some progress indication while it loads, eventually giving an initial summary of the loaded file and printing the pmat>
prompt, indicating that it is ready for us to start typing commands.
$ pmat 2017-02-22-124523-dump-4056.pmat
Loading file 25925580 of 54130727 (47.89%)
$ pmat 2017-02-22-124523-dump-4056.pmat
Perl memory dumpfile from perl 5.14.2
Heap contains 463354 objects
pmat>
Firstly, we can observe that we have a capture from a perl 5.14.2
process, which matches what's running on production servers, so that's a good start. Secondly, the entire process contained barely half a million SVs - a number that isn't all that large when you consider that this accounts for every piece of code and every piece of data in memory at the time. It's typical for large long-running perl
programs to reach into the several millions there, so this particularly program isn't that big in terms of raw object count.
Initial Analysis
First off, a brief aside on terminology: All over the Devel::MAT
tooling appears a number of "perl internals" terminology words and phrases; some familiarity with these is useful. In this case, the word "SV" refers to an individual atomic chunk of memory representing a value. Every string or number is stored in an SV, but so too is a hash or array structure, referring to other SVs as elements. Any function, either named or anonymous, is represented by another kind of SV, as are elements of the symbol table which perl uses to keep track of all the globally-named symbols - loaded modules and functions and other variables defined by them. Finally there are several kinds of internal data structure which will be discovered later on. These are all different kinds of "SV".
Despite the small SV count, the running program had consumed a lot of system-level memory, so something is amiss here. A fairly typical course of analysis for this kind of issue is to start with a list of the largest individual SVs. This is obtained by the largest
command, written for exactly this purpose:
pmat> largest
SCALAR(PV) at 0x6a47708: 1.6 GiB
SCALAR(PV) at 0x1a59488: 4.0 MiB
HASH(0) at 0xfb4770: 1.5 MiB
SCALAR(PV) at 0x71b6468: 707.3 KiB
SCALAR(PV) at 0x71be2f0: 609.6 KiB
others: 46.2 MiB
pmat>
Here we have obtained a list of the five largest SVs in the heap by individual size. Each one is printed with a brief description of its overall type, its memory address, and its size in bytes.
Already this is an interesting result. The largest SV in the entire heap (the first to be listed) consumes a massive 1.6 gibibytes of memory. The next one down is a comparatively-tiny 4 mebibytes, and the entire total rest of the heap sums to barely 50 mebibytes. We can inspect this SV in more detail using the show
command on its address:
pmat> show 0x6a47708
SCALAR(PV) at 0x6a47708 with refcount 1
size 1.6 GiB (1746980608 bytes)
PV=
PVLEN 0
This SV is a non-referential scalar value (the SCALAR
part), which stores a string value (the PV
part - this is more perl internals terminology for basically, a string or "pointer value"). While the total byte size of this SV is 1.6 gibibytes, the SV currently contains no bytes at all of string content. This may sound initially unusual - why would perl need an SV this big just to store an empty string? However, several usage patterns can cause this. Most likely, this is a string value that at one point needed to store that much data, but has been emptied out of its contents perhaps by the substr
or s///
operators, which will consume parts of the string buffer but won't on their own reallocate the memory any smaller.
Whatever the history, it's safe to say at this point that this topmost SV is responsible for all of the memory growth. But how can we find out what it is?
Identifying An SV
Simply having this candidate SV isn't itself very useful because at the moment, we have no clue as to where in the program it lives or what it's doing. For this we need to apply some more analysis, this time in the form of the identify
tool. This is a tool that walks reference chains backwards, recursively following pointer chains to see what refers to this SV, then what refers to those, until it finds a well-defined root; an anchor point in the program that can be identified by name.
(In real analysis this tool would be applied all in one go, but it will perhaps help to illustrate how it works by applying it in several small stages, showing how the reference graph is built up over time.)
We start by asking to identify just the immediate referrers of this SV; that is, those at a depth of 1:
pmat> identify 0x6a47708 -depth 1
SCALAR(PV) at 0x6a47708 is:
└─value {buffer} of HASH(5)=Mojo::IOLoop::Stream at 0x704b9e8, which is:
└─not found at this depth
OK, so our string SV is the element under the key buffer
of some hash (the HASH
part); that hash has five keys in total and is blessed into the class Mojo::IOLoop::Stream
. Already this seems to confirm our earlier thoughts. Onwards to find out what this hash is:
pmat> identify 0x6a47708 -depth 2
SCALAR(PV) at 0x6a47708 is:
└─value {buffer} of HASH(5)=Mojo::IOLoop::Stream at 0x704b9e8, which is:
└─(via RV) value {stream} of HASH(1) at 0x706aea0, which is:
└─not found at this depth
More terminology here - we've found that the hash is referred to by another hash (this new one having only one key) via an RV, or "reference value". Remembering that in Perl, a hash cannot directly contain another hash, but it can contain a reference to one. In the heap this is represented by a value of RV type. Because these occur so often in perl heaps, the identify
tool will "elide" the value; that is, skip over the RV itself to find what referred to that, as the RV itself is almost never interesting in its own right. This helps to cut down on some of the noise in the output.
Now let's run the tool to a larger depth for real - 10 is usually a good bet to start with - and try to find out what our giant SV actually is:
pmat> identify 0x6a47708 -depth 10
SCALAR(PV) at 0x6a47708 is:
└─value {buffer} of HASH(5)=Mojo::IOLoop::Stream at 0x704b9e8, which is:
└─(via RV) value {stream} of HASH(1) at 0x706aea0, which is:
└─(via RV) value {72e32ab0dd36560e49a464b5249860d1} of HASH(14) at 0x4772288, which is:
└─(via RV) value {out} of HASH(8)=Mojo::IOLoop at 0x1aa0240, which is:
├─(via RV) the lexical $loop at depth 1 of CODE(PP,C) at 0x4778088, which is:
│ └─(via RV) value {QUIT} of HASH(70) at 0x1012968, which is:
│ └─the symbol '%main::SIG'
├─(via RV) the lexical $loop at depth 1 of CODE(PP) at 0x1fb3f30, which is:
│ └─the symbol '&Mojo::Server::Prefork::_spawn'
├─(via RV) the lexical $loop at depth 1 of CODE(PP) at 0x1a8c940, which is:
│ └─the symbol '&Mojo::IOLoop::singleton'
├─(via RV) the lexical $self at depth 1 of CODE(PP) at 0x1a8ce50, which is:
│ └─the symbol '&Mojo::IOLoop::start'
└─(via RV) value {ioloop} of HASH(20)=Mojo::Server::Prefork at 0x1fcbb00, which is:
├─(via RV) the lexical $prefork at depth 1 of CODE(PP) at 0x11eb648, which is:
│ └─the symbol '&Mojo::Server::Hypnotoad::run'
├─(via RV) the lexical $self at depth 1 of CODE(PP) at 0x1273308, which is:
│ └─the symbol '&Mojo::Server::Prefork::run'
├─(via RV) the lexical $self at depth 1 of CODE(PP) at 0x1fa99a8, which is:
│ └─the symbol '&Mojo::Server::Prefork::_manage'
└─(via RV) value {prefork} of HASH(1)=Mojo::Server::Hypnotoad at 0xfb4998, which is:
└─not found
This tells us that we have a stream attached to our main event loop, and when we examine the Mojo::IOLoop::Stream
code to see what {buffer}
is, we see that it is the write buffer:
sub write {
my ($self, $chunk, $cb) = @_;
# IO::Socket::SSL will corrupt data with the wrong internal representation
utf8::downgrade $chunk;
$self->{buffer} .= $chunk;
if ($cb) { $self->once(drain => $cb) }
elsif (!length $self->{buffer}) { return $self }
$self->reactor->watch($self->{handle}, !$self->{paused}, 1)
if $self->{handle};
return $self;
}
This is progress, but we still don't know what this stream relates to. We can trace this further using additional tools - for example, Devel::SizeMe
provides a total_size
function that can report on the actual size of a scalar. This can be used to report a stacktrace when the write buffer grows beyond a reasonable expected size.
In the next part, we'll cover a different scenario, where there are many smaller scalars that are not being released.
Note: This post has been ported from https://tech.binary.com/ (our old tech blog). Author of this post is https://metacpan.org/author/PEVANS