This post describes my experience playing around with Scala and some of its concepts like tail recursion and futures. It also offers an opportunity to see how a memory leak can happen and how to analyse it.
I created a simple web crawler that takes a seed link (e.g. http://www.bbc.co.uk) and then, recursively (using tail recursion), collects all the links from the initial webpage and the successive webpages. As the algorithm uses a breadth-first approach, all the links belonging to the same level are collected before moving to the next level. The number of levels to crawl is established by a parameter called ‘depthLimit‘.
Every link is saved on a MongoDB collection along with its child links. Reading and parsing of HTML pages is done by the Java library jsoup.
Briefly, for each of the links in the same level:
- all its child links are retrieved
- parent link and child links are saved into MongoDB
- child links are added to a Collection containing the list of all next level links
When all the links in the current level are processed, the application moves one level further and processes the contents of the Collection containing the next level links.
The first approach (https://github.com/fjab76/scala-katas/tree/crawler_blocking) processes the links sequentially and carries out all previous three operations sequentially for each link. Given that those operations are blocking, the overall process is very slow.
What is more, this process has a big flaw, namely, as the program moves deeper into new levels, the number of links grows exponentially and the Collection containing the list of all next level links grows so much that the process runs out of memory (the heap size is about 1Gb). However, this limitation is not so evident when running the “blocking” version of the application as it takes several hours for the system to run out of memory.
A more refined approach (https://github.com/fjab76/scala-katas/tree/crawler_non-blocking) makes use of Future to encapsulate blocking calls like reading webpages and writing to MongoDB. The Futures are executed by a separate ForkJoinPool given by
Additionally, the access to the Collection containing the list of all next level links needs to be synchronised. For that, the chosen implementation is a synchronised Set.
The Future encapsulating the call to read child links is composed with the Futures encapsulating the operation to persist child links on MongoDB and the operation to add child links to the next level links Collection. These two last operations run on separate Futures in a ‘fire and forget‘ fashion (the result of the operations is not checked) as the process goes on irrespective of the result of said operations.
Setting seed=http://www.bbc.co.uk and depthLevel=6, the process runs for a little more of 1.5h before running out of memory.
22:19:17.730 [run-main-0] INFO Collector - waiting for current level to complete: 0 22:19:18.336 [run-main-0] INFO Collector - waiting for current level to complete: 1 22:19:20.275 [run-main-0] INFO Collector - waiting for current level to complete: 2 22:20:35.082 [run-main-0] INFO Collector - waiting for current level to complete: 3 22:43:54.195 [run-main-0] INFO Collector - waiting for current level to complete: 4
java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOfRange(Arrays.java:3664) at java.lang.String.<init>(String.java:201) at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:567) at java.nio.CharBuffer.toString(CharBuffer.java:1241) at org.jsoup.helper.DataUtil.parseByteData(DataUtil.java:121) at org.jsoup.helper.HttpConnection$Response.parse(HttpConnection.java:603) at org.jsoup.helper.HttpConnection.get(HttpConnection.java:195) at fjab.slinkc.Collector.fjab$slinkc$Collector$$collectChildLinks$1(Collector.scala:42) at fjab.slinkc.Collector$$anonfun$collect$1$1$$anonfun$1.apply(Collector.scala:72) at fjab.slinkc.Collector$$anonfun$collect$1$1$$anonfun$1.apply(Collector.scala:72) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) at scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Next Java VisualVM graph clearly shows the existence of a Java leak: heap memory grows continuously without the GC being able to free memory.
Running jmap to create a heapdump
jmap -dump:format=b,file=./heapdump.$(date +%H%M%S).hprof 5797
and analysing the result with Eclipse Memory Analyzer, we obtain a report with the biggest objects by retained size
The Leak Suspects report contains this information
One instance of "java.util.Collections$SynchronizedSet" loaded by "<system class loader>" occupies 618,218,160 (69.39%) bytes. The instance is referenced by fjab.slinkc.Collector$$anonfun$collect$1$1 @ 0x78eb35fc0 , loaded by "sbt.classpath.ClasspathUtilities$$anon$1 @ 0x78124b258". The memory is accumulated in one instance of "java.util.HashMap$Node" loaded by "<system class loader>". Keywords java.util.Collections$SynchronizedSet java.util.HashMap$Node sbt.classpath.ClasspathUtilities$$anon$1 @ 0x78124b258
So one instance of java.util.Collections$SynchronizedSet occupies 618Mb. More concretely, it is one instance of java.util.HashMap$Node.
Quick reminder: the implementation of SynchronizedSet used in the program is HashSet, which is internally implemented with a HashMap.
Eclipse Memory Analyzer can also list the objects of HashMap$Node as shown by the following graph
This graph shows the existence of more than 2 million nodes, each of them corresponding to one link.
After this analysis, the interpretation of the results is clear: the Set accumulating the next level links grows too much, acquiring all available memory. Since the last log displayed was
22:43:54.195 [run-main-0] INFO Collector - waiting for current level to complete: 4
this means that the application runs out of memory while adding links of level 5 in the Set.
The heapdump can also be analyzed with Java VisualVM, although the results are more difficult to interpret. Hashmap$Node appears in 3rd place by number of instances
When clicking on instances to get more details, it is clear that the instances correspond to the links collected
However, when clicking on compute retained sizes, Java VisualVM ran out of memory so it could not offer final evidence that it was the instances of HashMap$Node that were hogging the system’s memory.
Finally, as a complement to the analysis carried out earlier on, a thread dump can be obtained by running
jstack 5797 > threaddump.$(date +%H%M%S).tdump
The resulting dump can be analysed with Thread Dump Analyzer, that offers information about the status of each thread. The image below shows a list of 12 threads from the ForkJoinPool running the different tasks enclosed by the Futures
Non-blocking with BlockingQueue
Instead of using a SynchronizedSet, we could have used a BlockingQueue to add the next level links to the tail of the queue as the threads consume the elements from the head of the queue. This way, we would not need to use a latch to wait for all the links of the current level to be processed before moving to the next level. However, the system would still run out of memory as for every element consumed from the queue, tens or hundreds of new elements are to be added to said queue.
In any case, keeping this data in memory is not feasible and a different approach would be needed like persisting the collection of next level links (also in MongoDB or in some other data store).
3155total visits,2visits today