You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jim Blomo <ji...@pbworks.com> on 2010/07/03 04:06:03 UTC

Re: general debugging techniques?

Just to confirm I'm not doing something insane, this is my general setup:

- index approx 1MM documents including HTML, pictures, office files, etc.
- files are not local to solr process
- use upload/extract to extract text from them through tika
- use commit=1 on each POST (reasons below)
- use optimize=1 every 150 documents or so (reasons below)

Through many manual restarts and modifications to the upload script,
I've got about half way (numDocs : 467372, disk usage 1.6G).  The
biggest problem is that any serious problem cannot be recovered from
without a restart to tomcat, and serious problems can't be
differentiated at the client level from non-serious problems (eg tika
exceptions thrown by bad documents).

On Wed, Jun 9, 2010 at 10:13 AM, Jim Blomo <ji...@pbworks.com> wrote:
> In any case I bumped up the heap to 3G as suggested, which has helped
> stability.  I have found that in practice I need to commit every
> extraction because a crash or error will wipe out all extractions
> after the last commit.

I've also found that I need to optimize very regularly because I kept
getting "too many file handles" errors (though they usually came up as
the more cryptic "directory, but cannot be listed: list() returned
null" returned empty error).

What I am running into now is

SEVERE: Exception invoking periodic operation:
java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.lang.String.substring(String.java:1940)
[full backtrace below]

After a restart and optimize this goes away for a while (~100
documents) but then comes back and every request after the error
fails.  Even if I can't prevent this error, is there a way I can
recover from it better?  Perhaps an option to solr or tomcat to just
restart itself if it hits that error?

Jim

SEVERE: Exception invoking periodic operation:
java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.lang.String.substring(String.java:1940)
        at java.lang.String.substring(String.java:1905)
        at java.io.File.getName(File.java:401)
        at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:229)
        at java.io.File.isDirectory(File.java:754)
        at org.apache.catalina.startup.HostConfig.checkResources(HostConfig.java:1000)
        at org.apache.catalina.startup.HostConfig.check(HostConfig.java:1214)
        at org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:293)
        at org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:120)
        at org.apache.catalina.core.ContainerBase.backgroundProcess(ContainerBase.java:1306)
        at org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.processChildren(ContainerBase.java:1570)
        at org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.processChildren(ContainerBase.java:1579)
        at org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.run(ContainerBase.java:1559)
        at java.lang.Thread.run(Thread.java:619)
Jul 3, 2010 1:32:20 AM
org.apache.solr.update.processor.LogUpdateProcessor finish

Re: general debugging techniques?

Posted by Lance Norskog <go...@gmail.com>.
Ah! I did not notice the 'too many open files' part. This means that
your mergeFactor setting is too high for what your operating system
allows. The default mergeFactor is 10 (which translates into thousands
of open file descriptors). You should lower this number.

On Tue, Jul 6, 2010 at 1:14 PM, Jim Blomo <ji...@pbworks.com> wrote:
> On Sat, Jul 3, 2010 at 1:10 PM, Lance Norskog <go...@gmail.com> wrote:
>> You don't need to optimize, only commit.
>
> OK, thanks for the tip, Lance.  I thought the "too many open files"
> problem was because I wasn't optimizing/merging frequently enough.  My
> understanding of your suggestion is that commit also does merging, and
> since I am only building the index, not querying or updating it, I
> don't need to optimize.
>
>> This means that the JVM spends 98% of its time doing garbage
>> collection. This means there is not enough memory.
>
> I'll increase the memory to 4G, decrease the documentCache to 5 and try again.
>
>> I made a mistake - the bug in Lucene is not about PDFs - it happens
>> with every field in every document you index in any way- so doing this
>> in Tika outside Solr does not help. The only trick I can think of is
>> to alternate between indexing large and small documents. This way the
>> bug does not need memory for two giant documents in a row.
>
> I've checked out and built solr from branch_3x with the
> tika-0.8-SNAPSHOT patch.  (Earlier I was having trouble with Tika
> crashing too frequently.)  I've confirmed that LUCENE-2387 is fixed in
> this branch so hopefully I won't run into that this time.
>
>> Also, do not query the indexer at all. If you must, don't do sorted or
>> faceting requests. These eat up a lot of memory that is only freed
>> with the next commit (index reload).
>
> Good to know, though I have not been querying the index and definitely
> haven't ventured into faceted requests yet.
>
> The advice is much appreciated,
>
> Jim
>



-- 
Lance Norskog
goksron@gmail.com

Re: general debugging techniques?

Posted by Jim Blomo <ji...@pbworks.com>.
On Sat, Jul 3, 2010 at 1:10 PM, Lance Norskog <go...@gmail.com> wrote:
> You don't need to optimize, only commit.

OK, thanks for the tip, Lance.  I thought the "too many open files"
problem was because I wasn't optimizing/merging frequently enough.  My
understanding of your suggestion is that commit also does merging, and
since I am only building the index, not querying or updating it, I
don't need to optimize.

> This means that the JVM spends 98% of its time doing garbage
> collection. This means there is not enough memory.

I'll increase the memory to 4G, decrease the documentCache to 5 and try again.

> I made a mistake - the bug in Lucene is not about PDFs - it happens
> with every field in every document you index in any way- so doing this
> in Tika outside Solr does not help. The only trick I can think of is
> to alternate between indexing large and small documents. This way the
> bug does not need memory for two giant documents in a row.

I've checked out and built solr from branch_3x with the
tika-0.8-SNAPSHOT patch.  (Earlier I was having trouble with Tika
crashing too frequently.)  I've confirmed that LUCENE-2387 is fixed in
this branch so hopefully I won't run into that this time.

> Also, do not query the indexer at all. If you must, don't do sorted or
> faceting requests. These eat up a lot of memory that is only freed
> with the next commit (index reload).

Good to know, though I have not been querying the index and definitely
haven't ventured into faceted requests yet.

The advice is much appreciated,

Jim

Re: general debugging techniques?

Posted by Lance Norskog <go...@gmail.com>.
You don't need to optimize, only commit.

This means that the JVM spends 98% of its time doing garbage
collection. This means there is not enough memory.

I made a mistake - the bug in Lucene is not about PDFs - it happens
with every field in every document you index in any way- so doing this
in Tika outside Solr does not help. The only trick I can think of is
to alternate between indexing large and small documents. This way the
bug does not need memory for two giant documents in a row.

Also, do not query the indexer at all. If you must, don't do sorted or
faceting requests. These eat up a lot of memory that is only freed
with the next commit (index reload).

On Sat, Jul 3, 2010 at 8:19 AM, Dennis Gearon <ge...@sbcglobal.net> wrote:
> I"ll be watching this one as I  hope to be loading lots of docs soon.
> Dennis Gearon
>
> Signature Warning
> ----------------
> EARTH has a Right To Life,
>  otherwise we all die.
>
> Read 'Hot, Flat, and Crowded'
> Laugh at http://www.yert.com/film.php
>
>
> --- On Fri, 7/2/10, Jim Blomo <ji...@pbworks.com> wrote:
>
>> From: Jim Blomo <ji...@pbworks.com>
>> Subject: Re: general debugging techniques?
>> To: solr-user@lucene.apache.org
>> Date: Friday, July 2, 2010, 7:06 PM
>> Just to confirm I'm not doing
>> something insane, this is my general setup:
>>
>> - index approx 1MM documents including HTML, pictures,
>> office files, etc.
>> - files are not local to solr process
>> - use upload/extract to extract text from them through
>> tika
>> - use commit=1 on each POST (reasons below)
>> - use optimize=1 every 150 documents or so (reasons below)
>>
>> Through many manual restarts and modifications to the
>> upload script,
>> I've got about half way (numDocs : 467372, disk usage
>> 1.6G).  The
>> biggest problem is that any serious problem cannot be
>> recovered from
>> without a restart to tomcat, and serious problems can't be
>> differentiated at the client level from non-serious
>> problems (eg tika
>> exceptions thrown by bad documents).
>>
>> On Wed, Jun 9, 2010 at 10:13 AM, Jim Blomo <ji...@pbworks.com>
>> wrote:
>> > In any case I bumped up the heap to 3G as suggested,
>> which has helped
>> > stability.  I have found that in practice I need to
>> commit every
>> > extraction because a crash or error will wipe out all
>> extractions
>> > after the last commit.
>>
>> I've also found that I need to optimize very regularly
>> because I kept
>> getting "too many file handles" errors (though they usually
>> came up as
>> the more cryptic "directory, but cannot be listed: list()
>> returned
>> null" returned empty error).
>>
>> What I am running into now is
>>
>> SEVERE: Exception invoking periodic operation:
>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>         at
>> java.lang.String.substring(String.java:1940)
>> [full backtrace below]
>>
>> After a restart and optimize this goes away for a while
>> (~100
>> documents) but then comes back and every request after the
>> error
>> fails.  Even if I can't prevent this error, is there a
>> way I can
>> recover from it better?  Perhaps an option to solr or
>> tomcat to just
>> restart itself if it hits that error?
>>
>> Jim
>>
>> SEVERE: Exception invoking periodic operation:
>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>         at
>> java.lang.String.substring(String.java:1940)
>>         at
>> java.lang.String.substring(String.java:1905)
>>         at
>> java.io.File.getName(File.java:401)
>>         at
>> java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:229)
>>         at
>> java.io.File.isDirectory(File.java:754)
>>         at
>> org.apache.catalina.startup.HostConfig.checkResources(HostConfig.java:1000)
>>         at
>> org.apache.catalina.startup.HostConfig.check(HostConfig.java:1214)
>>         at
>> org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:293)
>>         at
>> org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:120)
>>         at
>> org.apache.catalina.core.ContainerBase.backgroundProcess(ContainerBase.java:1306)
>>         at
>> org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.processChildren(ContainerBase.java:1570)
>>         at
>> org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.processChildren(ContainerBase.java:1579)
>>         at
>> org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.run(ContainerBase.java:1559)
>>         at
>> java.lang.Thread.run(Thread.java:619)
>> Jul 3, 2010 1:32:20 AM
>> org.apache.solr.update.processor.LogUpdateProcessor finish
>>
>



-- 
Lance Norskog
goksron@gmail.com

Re: general debugging techniques?

Posted by Dennis Gearon <ge...@sbcglobal.net>.
I"ll be watching this one as I  hope to be loading lots of docs soon.
Dennis Gearon

Signature Warning
----------------
EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Fri, 7/2/10, Jim Blomo <ji...@pbworks.com> wrote:

> From: Jim Blomo <ji...@pbworks.com>
> Subject: Re: general debugging techniques?
> To: solr-user@lucene.apache.org
> Date: Friday, July 2, 2010, 7:06 PM
> Just to confirm I'm not doing
> something insane, this is my general setup:
> 
> - index approx 1MM documents including HTML, pictures,
> office files, etc.
> - files are not local to solr process
> - use upload/extract to extract text from them through
> tika
> - use commit=1 on each POST (reasons below)
> - use optimize=1 every 150 documents or so (reasons below)
> 
> Through many manual restarts and modifications to the
> upload script,
> I've got about half way (numDocs : 467372, disk usage
> 1.6G).  The
> biggest problem is that any serious problem cannot be
> recovered from
> without a restart to tomcat, and serious problems can't be
> differentiated at the client level from non-serious
> problems (eg tika
> exceptions thrown by bad documents).
> 
> On Wed, Jun 9, 2010 at 10:13 AM, Jim Blomo <ji...@pbworks.com>
> wrote:
> > In any case I bumped up the heap to 3G as suggested,
> which has helped
> > stability.  I have found that in practice I need to
> commit every
> > extraction because a crash or error will wipe out all
> extractions
> > after the last commit.
> 
> I've also found that I need to optimize very regularly
> because I kept
> getting "too many file handles" errors (though they usually
> came up as
> the more cryptic "directory, but cannot be listed: list()
> returned
> null" returned empty error).
> 
> What I am running into now is
> 
> SEVERE: Exception invoking periodic operation:
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>         at
> java.lang.String.substring(String.java:1940)
> [full backtrace below]
> 
> After a restart and optimize this goes away for a while
> (~100
> documents) but then comes back and every request after the
> error
> fails.  Even if I can't prevent this error, is there a
> way I can
> recover from it better?  Perhaps an option to solr or
> tomcat to just
> restart itself if it hits that error?
> 
> Jim
> 
> SEVERE: Exception invoking periodic operation:
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>         at
> java.lang.String.substring(String.java:1940)
>         at
> java.lang.String.substring(String.java:1905)
>         at
> java.io.File.getName(File.java:401)
>         at
> java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:229)
>         at
> java.io.File.isDirectory(File.java:754)
>         at
> org.apache.catalina.startup.HostConfig.checkResources(HostConfig.java:1000)
>         at
> org.apache.catalina.startup.HostConfig.check(HostConfig.java:1214)
>         at
> org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:293)
>         at
> org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:120)
>         at
> org.apache.catalina.core.ContainerBase.backgroundProcess(ContainerBase.java:1306)
>         at
> org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.processChildren(ContainerBase.java:1570)
>         at
> org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.processChildren(ContainerBase.java:1579)
>         at
> org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.run(ContainerBase.java:1559)
>         at
> java.lang.Thread.run(Thread.java:619)
> Jul 3, 2010 1:32:20 AM
> org.apache.solr.update.processor.LogUpdateProcessor finish
>