You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Enrico Triolo (JIRA)" <ji...@apache.org> on 2006/08/21 13:59:13 UTC

[jira] Created: (NUTCH-356) Plugin repository cache can lead to memory leak

Plugin repository cache can lead to memory leak
-----------------------------------------------

                 Key: NUTCH-356
                 URL: http://issues.apache.org/jira/browse/NUTCH-356
             Project: Nutch
          Issue Type: Bug
    Affects Versions: 0.8
            Reporter: Enrico Triolo
         Attachments: NutchTest.java, patch.txt

While I was trying to solve a problem I reported a while ago (see Nutch-314), I found out that actually the problem was related to the plugin cache used in class PluginRepository.java.
As  I said in Nutch-314, I think I somehow 'force' the way nutch is meant to work, since I need to frequently submit new urls and append their contents to the index; I don't (and I can't) have an urls.txt file with all urls I'm going to fetch, but I recreate it each time a new url is submitted.
Thus,  I think in the majority of times you won't have problems using nutch as-is, since the problem I found occours only if nutch is used in a way similar to the one I use.
To simplify your test I'm attaching a class that performs something similar to what I need. It fetches and index some sample urls; to avoid webmasters complaints I left the sample urls list empty, so you should modify the source code and add some urls.
Then you only have to run it and watch your memory consumption with top. In my experience I get an OutOfMemoryException after a couple of minutes, but it clearly depends on your heap settings and on the plugins you are using (I'm using 'protocol-file|protocol-http|parse-(rss|html|msword|pdf|text)|language-identifier|index-(basic|more)|query-(basic|more|site|url)|urlfilter-regex|summary-basic|scoring-opic').

The problem is bound to the PluginRepository 'singleton' instance, since it never get released. It seems that some class maintains a reference to it and this class is never released since it is cached somewhere in the configuration.

So I modified the PluginRepository's 'get' method so that it never uses the cache and always returns a new instance (you can find the patch in attachment). This way the memory consumption is always stable and I get no OOM anymore.
Clearly this is not the solution, since I guess there are many performance issues involved, but for the moment it works.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (NUTCH-356) Plugin repository cache can lead to memory leak

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-356?page=comments#action_12429548 ] 
            
Chris A. Mattmann commented on NUTCH-356:
-----------------------------------------

-1 for closing this issue.

If there is a demonstrable memory leak in the plugin system, then I think it should be remedied. I haven't ran your test code, Enrico, nor experienced your problem before, but it would seem that this issue is worth investigating. 

> Plugin repository cache can lead to memory leak
> -----------------------------------------------
>
>                 Key: NUTCH-356
>                 URL: http://issues.apache.org/jira/browse/NUTCH-356
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.8
>            Reporter: Enrico Triolo
>         Attachments: NutchTest.java, patch.txt
>
>
> While I was trying to solve a problem I reported a while ago (see Nutch-314), I found out that actually the problem was related to the plugin cache used in class PluginRepository.java.
> As  I said in Nutch-314, I think I somehow 'force' the way nutch is meant to work, since I need to frequently submit new urls and append their contents to the index; I don't (and I can't) have an urls.txt file with all urls I'm going to fetch, but I recreate it each time a new url is submitted.
> Thus,  I think in the majority of times you won't have problems using nutch as-is, since the problem I found occours only if nutch is used in a way similar to the one I use.
> To simplify your test I'm attaching a class that performs something similar to what I need. It fetches and index some sample urls; to avoid webmasters complaints I left the sample urls list empty, so you should modify the source code and add some urls.
> Then you only have to run it and watch your memory consumption with top. In my experience I get an OutOfMemoryException after a couple of minutes, but it clearly depends on your heap settings and on the plugins you are using (I'm using 'protocol-file|protocol-http|parse-(rss|html|msword|pdf|text)|language-identifier|index-(basic|more)|query-(basic|more|site|url)|urlfilter-regex|summary-basic|scoring-opic').
> The problem is bound to the PluginRepository 'singleton' instance, since it never get released. It seems that some class maintains a reference to it and this class is never released since it is cached somewhere in the configuration.
> So I modified the PluginRepository's 'get' method so that it never uses the cache and always returns a new instance (you can find the patch in attachment). This way the memory consumption is always stable and I get no OOM anymore.
> Clearly this is not the solution, since I guess there are many performance issues involved, but for the moment it works.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (NUTCH-356) Plugin repository cache can lead to memory leak

Posted by "Stefan Groschupf (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-356?page=comments#action_12429534 ] 
            
Stefan Groschupf commented on NUTCH-356:
----------------------------------------

Hi Enrico, 
there will be as much PluginRepositories as Configuration objects. 
So in case you create many configuration objects you will have a problem with the memory. 
There is no way around having a singleton pluginrepository. However you can reset the the pluginRepository by remove the cached object from the configuration object. 
In any case do not cache the pluginrepository is a bad idea, thinkabout writing a own plugin that solve your problem that should be a cleaner solution for your problem. 

Would you agree to close this issue since we will not be able to commit your changes. 
Stefan  

> Plugin repository cache can lead to memory leak
> -----------------------------------------------
>
>                 Key: NUTCH-356
>                 URL: http://issues.apache.org/jira/browse/NUTCH-356
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.8
>            Reporter: Enrico Triolo
>         Attachments: NutchTest.java, patch.txt
>
>
> While I was trying to solve a problem I reported a while ago (see Nutch-314), I found out that actually the problem was related to the plugin cache used in class PluginRepository.java.
> As  I said in Nutch-314, I think I somehow 'force' the way nutch is meant to work, since I need to frequently submit new urls and append their contents to the index; I don't (and I can't) have an urls.txt file with all urls I'm going to fetch, but I recreate it each time a new url is submitted.
> Thus,  I think in the majority of times you won't have problems using nutch as-is, since the problem I found occours only if nutch is used in a way similar to the one I use.
> To simplify your test I'm attaching a class that performs something similar to what I need. It fetches and index some sample urls; to avoid webmasters complaints I left the sample urls list empty, so you should modify the source code and add some urls.
> Then you only have to run it and watch your memory consumption with top. In my experience I get an OutOfMemoryException after a couple of minutes, but it clearly depends on your heap settings and on the plugins you are using (I'm using 'protocol-file|protocol-http|parse-(rss|html|msword|pdf|text)|language-identifier|index-(basic|more)|query-(basic|more|site|url)|urlfilter-regex|summary-basic|scoring-opic').
> The problem is bound to the PluginRepository 'singleton' instance, since it never get released. It seems that some class maintains a reference to it and this class is never released since it is cached somewhere in the configuration.
> So I modified the PluginRepository's 'get' method so that it never uses the cache and always returns a new instance (you can find the patch in attachment). This way the memory consumption is always stable and I get no OOM anymore.
> Clearly this is not the solution, since I guess there are many performance issues involved, but for the moment it works.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (NUTCH-356) Plugin repository cache can lead to memory leak

Posted by "Enrico Triolo (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-356?page=comments#action_12429546 ] 
            
Enrico Triolo commented on NUTCH-356:
-------------------------------------

Thanks Stefan for your reply. The patch I submitted wasn't meant to be committed to the trunk, it was only a proof of concept to demonstrate that a potential leak really exists. I am aware that the cache shouldn't be removed, but since I'm not an expert at all, I was only reporting a possible problem, not a solution. 

I can see that there are as much PluginRepositories as Configurations, in fact if you look at the source code of the test class I attached you'll see there is only one Configuration instance involved. Nevertheless I keep getting OOM...

Furthermore I can't understand your suggestion of writing a plugin to solve my problem. Maybe I wasn't able to clearly explain it: while at first I thought it was the LanguageIdentifier, I found out that the cause is not the plugin itself, rather the plugin management system. I couldn't inspect the code in depth, but using a profiler I saw that many objects don't get released. Don't you think this alone would be an issue?

Anyway, if you think this is not an issue I can close it.
Enrico

> Plugin repository cache can lead to memory leak
> -----------------------------------------------
>
>                 Key: NUTCH-356
>                 URL: http://issues.apache.org/jira/browse/NUTCH-356
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.8
>            Reporter: Enrico Triolo
>         Attachments: NutchTest.java, patch.txt
>
>
> While I was trying to solve a problem I reported a while ago (see Nutch-314), I found out that actually the problem was related to the plugin cache used in class PluginRepository.java.
> As  I said in Nutch-314, I think I somehow 'force' the way nutch is meant to work, since I need to frequently submit new urls and append their contents to the index; I don't (and I can't) have an urls.txt file with all urls I'm going to fetch, but I recreate it each time a new url is submitted.
> Thus,  I think in the majority of times you won't have problems using nutch as-is, since the problem I found occours only if nutch is used in a way similar to the one I use.
> To simplify your test I'm attaching a class that performs something similar to what I need. It fetches and index some sample urls; to avoid webmasters complaints I left the sample urls list empty, so you should modify the source code and add some urls.
> Then you only have to run it and watch your memory consumption with top. In my experience I get an OutOfMemoryException after a couple of minutes, but it clearly depends on your heap settings and on the plugins you are using (I'm using 'protocol-file|protocol-http|parse-(rss|html|msword|pdf|text)|language-identifier|index-(basic|more)|query-(basic|more|site|url)|urlfilter-regex|summary-basic|scoring-opic').
> The problem is bound to the PluginRepository 'singleton' instance, since it never get released. It seems that some class maintains a reference to it and this class is never released since it is cached somewhere in the configuration.
> So I modified the PluginRepository's 'get' method so that it never uses the cache and always returns a new instance (you can find the patch in attachment). This way the memory consumption is always stable and I get no OOM anymore.
> Clearly this is not the solution, since I guess there are many performance issues involved, but for the moment it works.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (NUTCH-356) Plugin repository cache can lead to memory leak

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-356?page=comments#action_12431548 ] 
            
Enis Soztutar commented on NUTCH-356:
-------------------------------------

I observed strange behaviour, when one of the plug-ins could not be included. For example the plugin system fails to load plugins, when, there is a circular dependency among them or the name of the plug-in is misspelled in the configuration. 

> Plugin repository cache can lead to memory leak
> -----------------------------------------------
>
>                 Key: NUTCH-356
>                 URL: http://issues.apache.org/jira/browse/NUTCH-356
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.8
>            Reporter: Enrico Triolo
>         Attachments: NutchTest.java, patch.txt
>
>
> While I was trying to solve a problem I reported a while ago (see Nutch-314), I found out that actually the problem was related to the plugin cache used in class PluginRepository.java.
> As  I said in Nutch-314, I think I somehow 'force' the way nutch is meant to work, since I need to frequently submit new urls and append their contents to the index; I don't (and I can't) have an urls.txt file with all urls I'm going to fetch, but I recreate it each time a new url is submitted.
> Thus,  I think in the majority of times you won't have problems using nutch as-is, since the problem I found occours only if nutch is used in a way similar to the one I use.
> To simplify your test I'm attaching a class that performs something similar to what I need. It fetches and index some sample urls; to avoid webmasters complaints I left the sample urls list empty, so you should modify the source code and add some urls.
> Then you only have to run it and watch your memory consumption with top. In my experience I get an OutOfMemoryException after a couple of minutes, but it clearly depends on your heap settings and on the plugins you are using (I'm using 'protocol-file|protocol-http|parse-(rss|html|msword|pdf|text)|language-identifier|index-(basic|more)|query-(basic|more|site|url)|urlfilter-regex|summary-basic|scoring-opic').
> The problem is bound to the PluginRepository 'singleton' instance, since it never get released. It seems that some class maintains a reference to it and this class is never released since it is cached somewhere in the configuration.
> So I modified the PluginRepository's 'get' method so that it never uses the cache and always returns a new instance (you can find the patch in attachment). This way the memory consumption is always stable and I get no OOM anymore.
> Clearly this is not the solution, since I guess there are many performance issues involved, but for the moment it works.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira