Tracing Perl memory leaks with Devel::MAT (part 2)

In the previous article, we took a first look at how to use Devel::MAT to help find the reason why a perl process kept using more and more memory. The cause last time was found to be a single string scalar that grew to a huge 1.6 GiB, consuming far more memory than the entire rest of the program put together. But of course, not all memory leaks will be quite so simple to identify, so today let's take a look at another case which will require a little more investigation to hunt down.

File Capture

This problem started off manifesting itself very similarly to the previous one; a long slow growth of memory requirements of the process at the system level, so naturally the same initial investigation took place. We capture a heap dump file from it and load it in to the main pmat shell as before:

$ pmat 2017-09-26-164515-dump-16026.pmat 
Perl memory dumpfile from perl 5.14.2
Heap contains 3225764 objects
pmat> 

At somewhat over 3 million objects, it also takes a lot longer to load the file and consumes a lot more memory on the analysis machine while doing so than the previous one. Already we can observe it's likely to be quite a different case to last time.

Initial Analysis

We start with the usual largest command as before, because that's a relatively easy start and sometimes finds something easily.

pmat> largest
HASH(0) at 0x9e5770: 3.5 MiB
ARRAY(457057,!REAL) at 0x49952e0: 3.5 MiB
ARRAY(457025) at 0x72a7140: 3.5 MiB
SCALAR(PV) at 0x1b83c30: 616.8 KiB
SCALAR(PV) at 0x1b83978: 614.9 KiB
others: 381.7 MiB

Unlike last time, nothing immediately obvious springs out of this. The largest SV in the entire heap is a hash, at a not too large 3.5 MiB. We'll start our investigation here anyway and see what turns up.

pmat> identify 0x9e5770
HASH(0) at 0x9e5770 is:
└─the shared string table HV

That's fairly typical for any large Perl program. The shared string table contains the string buffer of every hash key used in any hash on the heap, so that all keys are in fact shared between them (this saves a lot of memory in large programs where you may have thousands of object instances all using the same key names). The fact that it's the largest SV on the heap is nothing to worry about here.

(Curiously enough, it appears to contain zero keys despite being its size, but this merely indicates the heap file was captured prior to version 0.30, which fixed a bug related to this).

Onward to the next candidate:

pmat> identify 0x49952e0
ARRAY(457057,!REAL) at 0x49952e0 is:
└─the backrefs list of STASH(12) at 0x49b9300, which is:
  └─the symbol '%Binary::WebSocketAPI::Plugins::Helpers::'

Now this is curious. This array is a backrefs list to the stash of the package called Binary::WebSocketAPI::Plugins::Helpers and it contains a huge number of elements - over 450 thousand of them.

Backrefs

This is some more Perl internals terminology here. A "backrefs list" is a list relating to weak references; either those formed in regular Perl code using the Scalar::Util::weaken() function, or those built in Perl's various internal data structures. The list is attached to an object which has at least one weak reference pointing at it. Its job is to keep track of where those weak references are, so that if the referred-to SV is destroyed, those references can be reset to undef. Since the SV addresses stored in this backrefs list do not themselves contribute to the reference count of the SVs they refer to (otherwise this would partly defeat the purpose of the weakened reference in the first place), this array is marked as "not real", with the !REAL flag.

As various operations can cause weak references to be constructed pointing at entire packages, we do expect a package to have a backrefs list, but a list containing over 450 thousand of them is highly suspicious and almost certainly related to the memory leak issue at hand here. We can't yet tell exactly what the problem is but already we gain one clue.

A Pattern Emerges

One interesting thing to note already is that the element count in this backrefs list almost exactly matches the element count in the next array down in the largest list:

ARRAY(457057,!REAL) at 0x49952e0: 3.5 MiB
ARRAY(457025) at 0x72a7140: 3.5 MiB

This coïncidence of numbers suggests that they're almost certainly related, so lets take a look at that one too:

pmat> identify 0x72a7140
ARRAY(457025) at 0x72a7140 is:
└─(via RV) value {error} of HASH(3) at 0x490ea40, which is:
  └─(via RV) value {events} of HASH(8)=Mojo::Redis2 at 0x490e0b0, which is:
    └─(via RV) value {ws_redis_master} of HASH(4) at 0x3ebef40, which is:
...

This array looks to be a fairly standard user-accessible array value, which is accessible as the error key of a hash, which itself is the events key of a blessed hash (i.e. an object instance) of type Mojo::Redis2. (At this point the output does continue to identify where that is reachable from, but it's less interesting to our analysis at this point so I've truncated it.)

Let's now take an item from this array and inspect it to see what it actually is. It may be that the first few items in the array get put there before the leaking pattern emerges (e.g. during application startup), so rather than take the very first one we'll skip a few down, perhaps the 50th will be a more typical case. The elems command can be used to fetch that; asking to list a count of 1 element starting at index 50:

pmat> elems 0x72a7140 50 --count 1
  [50] REF() at 0x76b2f48
  ... (456974 more)

This element is a reference. We'll have to show it to see where it refers to. (I should probably have the show command do this by default on simple REF SVs in the next version, but for now we need two commands to do this):

pmat> show 0x76b2f48
REF() at 0x76b2f48 with refcount 1
  size 24 bytes
  RV=CODE(PP,C) at 0x7637630

pmat> show 0x7637630
CODE(PP,C) at 0x7637630 with refcount 1
  size 128 bytes
  name=&Binary::WebSocketAPI::Plugins::Helpers::__ANON__
  stash=STASH(12) at 0x49b9300
  glob=GLOB(*) at 0x49b96d8
  location=.../lib/Binary/WebSocketAPI/Plugins/Helpers.pm line 130
  scope=CODE(PP,C) at 0x4aba6a0
  padlist=PADLIST(2) at 0x7637678
  padnames=PADNAMES(19) at 0x76375e8
  pad[0]=PAD(19) at 0x76adcb0
Referenced globs:
  GLOB(@I) at 0x9e58f0, 

So, this array of 450 thousand SVs appears to contain CODE references. The SV type is CODE(PP,C), which identifies it as a CODE reference whose body is pure perl (PP), and is a cloned closure (C) - the name ending in __ANON__ also suggests that it's an anonymous closure. It was created in the package Binary::WebSocketAPI::Plugins::Helpers, and we get the file and line number of the source code where it appears.

This confirms our theory earlier relating to the backrefs list - each of these closures will refer weakly to the package, so it must appear in the backrefs list of that package. As a further check we can just try identifying this CODE reference, and including weak references in the output:

pmat> identify 0x7637630 -weak -depth 3
CODE(PP,C) at 0x7637630 is:
├─(via RV) element [50] of ARRAY(457025) at 0x72a7140, which is:
│ └─(via RV) value {error} of HASH(3) at 0x490ea40, which is:
│   └─(via RV) value {events} of HASH(8)=Mojo::Redis2 at 0x490e0b0, which is:
│     └─not found at this depth
└─[weak] element [81] of ARRAY(457057,!REAL) at 0x49952e0, which is:
  └─the backrefs list of STASH(12) at 0x49b9300, which is:
    └─the symbol '%Binary::WebSocketAPI::Plugins::Helpers::'

As we suspected - the CODE SV is weakly referenced by the backrefs list which we identified earlier.

We do now have a fairly firm idea that the long accumulation of so many of these CODE references is the cause of our memory leak, so we can now begin to take a look at how to fix it.

Fixing The Problem

Now that we know the file and line number as well we can go on to take a look in the source code where the closure is generated and see if we can get an idea from there as to what the closure is and how to fix the issue:

128                 $redis->on(
129                     error => sub {
130                         my ($self, $err) = @_;
131                         warn("Redis $redis_name error: $err");
132                         $app->stat->{redis_errors}++;
133                     });

This code does indeed appear to be adding an anonymous closure into the error list of a Redis-related instance, as our earlier identification would indicate. (Line 130 itself is actually the first line of code inside the closure, rather than the location of the sub itself, simply because that's the only location in the source code that is reliably available to the Devel::MAT::Dumper code at the time it dumps the information out.)

A wider look at the surrounding source code discovers that this error handler is being added to a shared singleton instance of Mojo::Redis2 every time it is obtained for use by an incoming connection, and that no code exists to actually remove it again. As can be expected, this long accumulation of error handlers with nothing removing them lead to the list containing 450 thousand of the things, which inflated the process's memory requirements and lead to our being aware of the problem in the first place.

Actually fixing the problem involved a fairly simple removal of these few lines of code, because it turned out not to be required at the individual-connection level; any error handling in the Redis instance is better established just once when the singleton is created. With this code gone, the long accretion of all these CODE references disappeared, and the process has been better-behaved ever since.

Postscript - Some Screenshots

It just remains for me to add that the text blocks above were copied from running the pmat tool, and I decided to keep them in plain text form so that the text itself is indexable by search engines and the like. While useful, that does remove the pretty formatting output that the tool creates which, I think at least, helps with reading them.

I captured screenshots of a couple of them running in the terminal, so here they are again this time in all their pretty coloured form:

STORIES-2017-09-26-fig1
STORIES-2017-09-26-fig2
Note: This post has been ported from https://tech.binary.com/ (our old tech blog). Author of this post is https://metacpan.org/author/PEVANS