You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (JIRA)" <ji...@apache.org> on 2017/09/26 13:43:00 UTC
[jira] [Commented] (NUTCH-2407) Memory leak causing Nutch Server to
run out of memory
[ https://issues.apache.org/jira/browse/NUTCH-2407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16180780#comment-16180780 ]
Sebastian Nagel commented on NUTCH-2407:
----------------------------------------
Thanks, [~Vyacheslav]! Most objects are related to HashMap-s, and they're probably related to the "leaked" important and PluginRepository objects:
{noformat}
start first second
1 10 11 org.apache.hadoop.conf.Configuration
0 1 2 org.apache.nutch.plugin.PluginRepository
{noformat}
Every configuration is assigned a random UUID (as "work-around" since NUTCH-844):
{code}
/*
* Configuration.hashCode() doesn't return values that correspond to a unique
* set of parameters. This is a workaround so that we can track instances of
* Configuration created by Nutch.
*/
private static void setUUID(Configuration conf) {
UUID uuid = UUID.randomUUID();
conf.set(UUID_KEY, uuid.toString());
}
{code}
The UUID is as unique hash key in the PluginRepository to make sure that for a different configuration a different plugin instance is returned.
Implementing a real hash key method for configurations should fix the problem if the configuration isn't changed. If there are changing "custom configurations" then also a method {{PluginRepository.delete(Configuration conf)}} needs to be implemented and called from {{org.apache.nutch.service.ConfManager.delete(...)}}.
> Memory leak causing Nutch Server to run out of memory
> -----------------------------------------------------
>
> Key: NUTCH-2407
> URL: https://issues.apache.org/jira/browse/NUTCH-2407
> Project: Nutch
> Issue Type: Bug
> Components: nutch server
> Affects Versions: 2.3.1
> Environment: Ubuntu 16.04 64-bit
> Oracle Java 8 64-bit
> Nutch 2.3.1 (standalone deployment)
> MongoDB 3.4
> Reporter: Vyacheslav Pascarel
> Attachments: first.txt, second.txt, started.txt
>
>
> My application is trying to perform continuous crawling using Nutch REST services. The application injects a seed URL and then repeats GENERATE/FETCH/PARSE/UPDATEDB sequence requested number of times (each step in the sequence is executed upon successful competition of the previous step then the whole sequence is repeated again). Here is a brief description of the job:
> * Number of GENERATE/FETCH/PARSE/UPDATEDB cycles per run: 50
> * 'topN' parameter value of GENERATE step in each cycle: 10
> * Seed URL: http://www.cnn.com
> * Regex URL filters for all jobs:
> ** *"-^.\{1000,\}$"* - exclude very long URLs
> ** *"+."* - include the rest
> To monitor Nutch server I use Java VisualVM that comes with Java SDK. After each run (50 cycles of GENERATE/FETCH/PARSE/UPDATEDB) I perform garbage collection using the mentioned tool and check memory usage. My observation is that Nutch Server leaks ~25MB per run.
> NOTES: I added custom HTTP DELETE services to clean job history in NutchServerPoolExecutor and remove all custom configurations from RAMConfManager after each run. So observed ~25MB memory leak is after job history/configuration cleanup.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)