Dealing with memory leaks

Background

This post describes my experience playing around with Scala and some of its concepts like tail recursion and futures. It also offers an opportunity to see how a memory leak can happen and how to analyse it.

I created a simple web crawler that takes a seed link (e.g. http://www.bbc.co.uk) and then, recursively (using tail recursion), collects all the links from the initial webpage and the successive webpages. As the algorithm uses a breadth-first approach, all the links belonging to the same level are collected before moving to the next level. The number of levels to crawl is established by a parameter called ‘depthLimit‘.

Every link is saved on a MongoDB collection along with its child links. Reading and parsing of HTML pages is done by the Java library jsoup.

Briefly, for each of the links in the same level:

  1. all its child links are retrieved
  2. parent link and child links are saved into MongoDB
  3. child links are added to a Collection containing the list of all next level links

When all the links in the current level are processed, the application moves one level further and processes the contents of the Collection containing the next level links.

Blocking implementation

The first approach (https://github.com/fjab76/scala-katas/tree/crawler_blocking) processes the links sequentially and carries out all previous three operations sequentially for each link. Given that those operations are blocking, the overall process is very slow.

What is more, this process has a big flaw, namely, as the program moves deeper into new levels, the number of links grows exponentially and the Collection containing the list of all next level links grows so much that the process runs out of memory (the heap size is about 1Gb). However, this limitation is not so evident when running the “blocking” version of the application as it takes several hours for the system to run out of memory.

Non-blocking implementation

A more refined approach (https://github.com/fjab76/scala-katas/tree/crawler_non-blocking) makes use of Future to encapsulate blocking calls like reading webpages and writing to MongoDB. The Futures are executed by a separate ForkJoinPool given by

scala.concurrent.ExecutionContext.Implicits.global

Additionally, the access to the Collection containing the list of all next level links needs to be synchronised. For that, the chosen implementation is a synchronised Set.

The Future encapsulating the call to read child links is composed with the Futures encapsulating the operation to persist child links on MongoDB and the operation to add child links to the next level links Collection. These two last operations run on separate Futures in a ‘fire and forget‘ fashion (the result of the operations is not checked) as the process goes on irrespective of the result of said operations.

Setting seed=http://www.bbc.co.uk and depthLevel=6, the process runs for a little more of 1.5h before running out of memory.

22:19:17.730 [run-main-0] INFO  Collector - waiting for current level to complete: 0
22:19:18.336 [run-main-0] INFO  Collector - waiting for current level to complete: 1
22:19:20.275 [run-main-0] INFO  Collector - waiting for current level to complete: 2
22:20:35.082 [run-main-0] INFO  Collector - waiting for current level to complete: 3
22:43:54.195 [run-main-0] INFO  Collector - waiting for current level to complete: 4
java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOfRange(Arrays.java:3664)
 at java.lang.String.<init>(String.java:201)
 at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:567)
 at java.nio.CharBuffer.toString(CharBuffer.java:1241)
 at org.jsoup.helper.DataUtil.parseByteData(DataUtil.java:121)
 at org.jsoup.helper.HttpConnection$Response.parse(HttpConnection.java:603)
 at org.jsoup.helper.HttpConnection.get(HttpConnection.java:195)
 at fjab.slinkc.Collector.fjab$slinkc$Collector$collectChildLinks$1(Collector.scala:42)
 at fjab.slinkc.Collector$anonfun$collect$1$1$anonfun$1.apply(Collector.scala:72)
 at fjab.slinkc.Collector$anonfun$collect$1$1$anonfun$1.apply(Collector.scala:72)
 at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
 at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
 at scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
 at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Next Java VisualVM graph clearly shows the existence of a Java leak: heap memory grows continuously without the GC being able to free memory.

JVisualVM

Running jmap to create a heapdump

jmap -dump:format=b,file=./heapdump.$(date +%H%M%S).hprof 5797

and analysing the result with Eclipse Memory Analyzer, we obtain a report with the biggest objects by retained size

Biggest object by retained size

The  Leak Suspects report contains this information

One instance of "java.util.Collections$SynchronizedSet" loaded by "<system class loader>" occupies 618,218,160 (69.39%) bytes. The instance is referenced by fjab.slinkc.Collector$anonfun$collect$1$1 @ 0x78eb35fc0 , loaded by "sbt.classpath.ClasspathUtilities$anon$1 @ 0x78124b258". The memory is accumulated in one instance of "java.util.HashMap$Node[]" loaded by "<system class loader>".

Keywords
java.util.Collections$SynchronizedSet
java.util.HashMap$Node[]
sbt.classpath.ClasspathUtilities$anon$1 @ 0x78124b258

Shortest path to accumulation point

So one instance of java.util.Collections$SynchronizedSet occupies 618Mb. More concretely, it is one instance of java.util.HashMap$Node.

Quick reminder: the implementation of SynchronizedSet used in the program is HashSet, which is internally implemented with a HashMap.

Eclipse Memory Analyzer can also list the objects of HashMap$Node as shown by the following graph

Instances

This graph shows the existence of more than 2 million nodes, each of them corresponding to one link.

After this analysis, the interpretation of the results is clear: the Set accumulating the next level links grows too much, acquiring all available memory. Since the last log displayed was

22:43:54.195 [run-main-0] INFO  Collector - waiting for current level to complete: 4

this means that the application runs out of memory while adding links of level 5 in the Set.

The heapdump can also be analyzed with Java VisualVM, although the results are more difficult to interpret. Hashmap$Node appears in 3rd place by number of instances

jvisualvm

When clicking on instances to get more details, it is clear that the instances correspond to the links collected

hashmap node instances

However, when clicking on compute retained sizes, Java VisualVM ran out of memory so it could not offer final evidence that it was the instances of HashMap$Node that were hogging the system’s memory.

Finally, as a complement to the analysis carried out earlier on, a thread dump can be obtained by running

jstack 5797 > threaddump.$(date +%H%M%S).tdump

The resulting dump can be analysed with Thread Dump Analyzer, that offers information about the status of each thread. The image below shows a list of 12 threads from the ForkJoinPool running the different tasks enclosed by the Futures

TDA

Non-blocking with BlockingQueue

Instead of using a SynchronizedSet, we could have used a BlockingQueue to add the next level links to the tail of the queue as the threads consume the elements from the head of the queue. This way, we would not need to use a latch to wait for all the links of the current level to be processed before moving to the next level. However, the system would still run out of memory as for every element consumed from the queue, tens or hundreds of new elements are to be added to said queue.

In any case, keeping this data in memory is not feasible and a different approach would be needed like persisting the collection of next level links (also in MongoDB or in some other data store).

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.