You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Chantal Ackermann <ch...@biomax.de> on 2001/11/28 13:22:51 UTC

OutOfMemoryError

hi to all,

please help! I think I mixed my brain up already with this stuff...

I'm trying to index about 29 textfiles where the biggest one is ~700Mb and 
the smallest ~300Mb. I achieved once to run the whole index, with a merge 
factor = 10 and maxMergeDocs=10000. This took more than 35 hours I think 
(don't know exactly) and it didn't use much RAM (though it could have). 
unfortunately I had a call to optimize at the end and while optimization an 
IOException (File to big) occured (while merging).

As I run the program on a multi-processor machine I now changed the code to 
index each file in a single thread and write to one single IndexWriter. the 
merge factor is still at 10. maxMergeDocs is at 1.000.000. I set the maximum 
heap size to 1MB.

I tried to use RAMDirectory (as mentioned in the mailing list) and just use 
IndexWriter.addDocument(). At the moment it seems not to make any difference. 
after a while _all_ the threads exit one after another (not all at once!) 
with an OutOfMemoryError. the priority of all of them is at the minimum.

even if the multithreading doesn't increase performance I would be glad if I 
could just once get it running again.

I would be even happier if someone could give me a hint what would be the 
best way to index this amount of data. (the average size of an entry that 
gets parsed for a Document is about 1Kb.)

thanx for any help!
chantal

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: OutOfMemoryError

Posted by Winton Davies <wd...@overture.com>.

Watch out for merge sizes > 100 -- you'll run out of file descriptors 
-- ? Also does mergeFactor have any effect in RAMdirectory ?

  Winton

>I've loaded a large (but not as large as yours) index with mergeFactor
>set to 1000.  Was substantially faster than with default setting.
>Making it higher didn't seem to make things much faster but did cause
>it to use more memory. In addition I loaded the data in chunks in
>separate processes and optimized the index after each chunk, again
>in a separate process.  All done straight to disk, no messing about
>with RAMDirectories.
>
>Didn't play with maxMergeDocs and am not sure what you mean by
>"maximum heap size" but 1MB doesn't sound very large.
>
>
>
>--
>Ian.
>ian.lea@blackwell.co.uk
>
>
>Chantal Ackermann wrote:
>>
>> hi to all,
>>
>> please help! I think I mixed my brain up already with this stuff...
>>
>> I'm trying to index about 29 textfiles where the biggest one is ~700Mb and
>> the smallest ~300Mb. I achieved once to run the whole index, with a merge
>> factor = 10 and maxMergeDocs=10000. This took more than 35 hours I think
>> (don't know exactly) and it didn't use much RAM (though it could have).
>> unfortunately I had a call to optimize at the end and while optimization an
>> IOException (File to big) occured (while merging).
>>
>> As I run the program on a multi-processor machine I now changed the code to
>> index each file in a single thread and write to one single IndexWriter. the
>> merge factor is still at 10. maxMergeDocs is at 1.000.000. I set the maximum
>> heap size to 1MB.
>>
>> I tried to use RAMDirectory (as mentioned in the mailing list) and just use
>> IndexWriter.addDocument(). At the moment it seems not to make any 
>>difference.
>> after a while _all_ the threads exit one after another (not all at once!)
>> with an OutOfMemoryError. the priority of all of them is at the minimum.
>>
>> even if the multithreading doesn't increase performance I would be glad if I
>> could just once get it running again.
>>
>> I would be even happier if someone could give me a hint what would be the
>> best way to index this amount of data. (the average size of an entry that
>> gets parsed for a Document is about 1Kb.)
>>
>> thanx for any help!
>> chantal
>
>--
>To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
>For additional commands, e-mail: <ma...@jakarta.apache.org>


Winton Davies
Lead Engineer, Overture (NSDQ: OVER)
1820 Gateway Drive, Suite 360
San Mateo, CA 94404
work: (650) 403-2259
cell: (650) 867-1598
http://www.overture.com/

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: OutOfMemoryError

Posted by Ian Lea <ia...@blackwell.co.uk>.

I've loaded a large (but not as large as yours) index with mergeFactor
set to 1000.  Was substantially faster than with default setting. 
Making it higher didn't seem to make things much faster but did cause
it to use more memory. In addition I loaded the data in chunks in
separate processes and optimized the index after each chunk, again
in a separate process.  All done straight to disk, no messing about
with RAMDirectories.

Didn't play with maxMergeDocs and am not sure what you mean by
"maximum heap size" but 1MB doesn't sound very large.



--
Ian.
ian.lea@blackwell.co.uk


Chantal Ackermann wrote:
> 
> hi to all,
> 
> please help! I think I mixed my brain up already with this stuff...
> 
> I'm trying to index about 29 textfiles where the biggest one is ~700Mb and
> the smallest ~300Mb. I achieved once to run the whole index, with a merge
> factor = 10 and maxMergeDocs=10000. This took more than 35 hours I think
> (don't know exactly) and it didn't use much RAM (though it could have).
> unfortunately I had a call to optimize at the end and while optimization an
> IOException (File to big) occured (while merging).
> 
> As I run the program on a multi-processor machine I now changed the code to
> index each file in a single thread and write to one single IndexWriter. the
> merge factor is still at 10. maxMergeDocs is at 1.000.000. I set the maximum
> heap size to 1MB.
> 
> I tried to use RAMDirectory (as mentioned in the mailing list) and just use
> IndexWriter.addDocument(). At the moment it seems not to make any difference.
> after a while _all_ the threads exit one after another (not all at once!)
> with an OutOfMemoryError. the priority of all of them is at the minimum.
> 
> even if the multithreading doesn't increase performance I would be glad if I
> could just once get it running again.
> 
> I would be even happier if someone could give me a hint what would be the
> best way to index this amount of data. (the average size of an entry that
> gets parsed for a Document is about 1Kb.)
> 
> thanx for any help!
> chantal

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

GCJ and Lucene ?

Posted by Winton Davies <wd...@overture.com>.

Hi,

Another maybe quick question:

Has anyone tried using GCJ with Lucene ?

http://www.gnu.org/software/gcc/java/

As far as I tell, this tries to compile Java directly to native code. 
I think it is restricted to 1.1 classes, which might be a gotcha 
(does Lucene use any 1.2 classes ?)

  Winton

Winton Davies
Lead Engineer, Overture (NSDQ: OVER)
1820 Gateway Drive, Suite 360
San Mateo, CA 94404
work: (650) 403-2259
cell: (650) 867-1598
http://www.overture.com/

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Parallelising a query...

Posted by Winton Davies <wd...@overture.com>.

Hi again....

  Another dumb question :) (actually I'm too busy to look at the code :) )

   In the index, is the datastructure of termDocs (is that the right 
term), sorted by anything? Or is it just insertion order ? I could 
see how one might want to sort by the Doc with the highest term 
frequency  ? But I can also see why
it might not help.

  e.g.   Token1 -> doc1 (2) [occurences] -> doc2 (6) -> doc3 (3)

  or is it like this ?

         Token1 -> doc2 (6) -> doc3 (3) -> doc1 (2) ?

 
  I have an idea for an optimization I want to make, but I'm not sure 
exactly whether it is warrants investigation.

  Winton


Winton Davies
Lead Engineer, Overture (NSDQ: OVER)
1820 Gateway Drive, Suite 360
San Mateo, CA 94404
work: (650) 403-2259
cell: (650) 867-1598
http://www.overture.com/

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Parallelising a query...

Posted by Winton Davies <wd...@overture.com>.

Hi,
 
   Let say I want to retrieve all relevant listings for a query (just 
suppose)...

   I have 4 million documents... I could:
 
   Split these into 4 x 1 million document indexes  and then send a 
query to 4 Lucene processes ? At the end I would have to sort the 
results by relevance.

   Question for Doug or any other Search Engine guru -- would this 
reduce the time to find these results by 75% ?
 
   I know it is probably a hard question to answer (i.e. all the 
documents that match, might just be in one process...) but I'm more 
getting at the average length of the inverted indexes that have to be 
joined being reduced by 75%, hence the join should take only 25% of 
the time...

  Any thoughts on this idiocy ? Reason why I ask ? Well, lets say I 
can't fit a 4 million document RamDir index into 1GB heap space, but 
I could if I split it up :) ?

   Cheers,
    Winton
 

 
 
 

Winton Davies
Lead Engineer, Overture (NSDQ: OVER)
1820 Gateway Drive, Suite 360
San Mateo, CA 94404
work: (650) 403-2259
cell: (650) 867-1598
http://www.overture.com/

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

RE: javac -O ?

Posted by Winton Davies <wd...@overture.com>.

Hi Matt,

  Thanks!! :-)
 
  Pointers for Rockit look interesting. I'm having some degree of 
success with JVM 1.3.1_.10 from Sun -- I think there is a trick to 
get the tenuring stuff to work right, just a matter of finding the 
right balance of Eden/Old/Survivor space...

  Cheers,
    Winton

Winton Davies
Lead Engineer, Overture (NSDQ: OVER)
1820 Gateway Drive, Suite 360
San Mateo, CA 94404
work: (650) 403-2259
cell: (650) 867-1598
http://www.overture.com/

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

RE: javac -O ?

Posted by Matt Tucker <ma...@jivesoftware.com>.

Winton,

I'm not sure that javac -O actually does anything. From the 1.3 tool
documentation: 

"Note: the -O option does nothing in the current implementation of javac
and oldjavac."

In fact, the JDK 1.4 tool documentation doesn't even mention the -O
option (even though "javac -help" still lists the option).

In regards to your earlier email about JVM optimization -- you may want
to check out the Jrockit JVM if you have a chance. I haven't used it
yet, but the features sound interesting for server-side Java
performance. http://www.jrockit.com

Regards,
Matt

> -----Original Message-----
> From: Winton Davies [mailto:wdavies@overture.com] 
> Sent: Wednesday, November 28, 2001 4:51 PM
> To: Lucene Users List
> Subject: javac -O ?
> 
> 
> Hi,
> 
>   Is the nightly build compiled Optimized ? if not, has anyone ever 
> tried compiling Optimized, and using that ? Does it help improve 
> performance ? It would seem to me that given the compute intensive 
> nature of querying, that even slightly improved compilations would 
> speed things up ?
> 
>   Cheers,
>    Winton
> 
> Winton Davies
> Lead Engineer, Overture (NSDQ: OVER)
> 1820 Gateway Drive, Suite 360
> San Mateo, CA 94404
> work: (650) 403-2259
> cell: (650) 867-1598
> http://www.overture.com/
> 
> --
> To unsubscribe, e-mail:   
> <mailto:lucene-user-> unsubscribe@jakarta.apache.org>
> For 
> additional commands, 
> e-mail: <ma...@jakarta.apache.org>
> 


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

javac -O ?

Posted by Winton Davies <wd...@overture.com>.

Hi,

  Is the nightly build compiled Optimized ? if not, has anyone ever 
tried compiling Optimized, and using that ? Does it help improve 
performance ? It would seem to me that given the compute intensive 
nature of querying, that even slightly improved compilations would 
speed things up ?

  Cheers,
   Winton

Winton Davies
Lead Engineer, Overture (NSDQ: OVER)
1820 Gateway Drive, Suite 360
San Mateo, CA 94404
work: (650) 403-2259
cell: (650) 867-1598
http://www.overture.com/

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: OutOfMemoryError

Posted by Ian Lea <ia...@blackwell.co.uk>.

Doug sent the message below to the list on 3-Nov in response to
a query about file size limits.  There may have been more
related stuff on the thread as well.

--
Ian.

++++++++++++++++
>   *** Anyway, is there anyway to control how big the indexes 
> grow ? ****

The easiset thing is to set IndexWriter.maxMergeDocs. Since you hit 2GB at
8M docs, set this to 7M.  That will keep Lucene from trying to merge an
index that won't fit in your filesystem.  (It will actually effectively
round this down to the next lower power of Index.mergeFactor.  So with the
default mergeFactor=10, maxMergeDocs=7M will generate a series of 1M
document indexes, since merging 10 of these would exceed the max.)

Slightly more complex: you could further minimize the number of segments,
if, when you've added seven million documents, optimize the index and start
a new index.  Then use MultiSearcher to search.

Even more complex and optimal: write a version of FSDirectory that, when a
file exceeds 2GB, creates a subdirectory and represents the file as a series
of files.  (I've done this before, and found that, on at least the version
of Solaris that I was using, the files had to be a few 100k less than 2GB
for programs like 'cp' and 'ftp' to operate correctly on them.)

Doug
--------------------

Chantal Ackermann wrote:
> 
> hi Ian, hi Winton, hi all,
> 
> sorry I meant heap size of 100Mb. I'm  starting java with -Xmx100m. I'm not
> setting -Xms.
> 
> For what I know now, I had a bug in my own code. still I don't understand
> where these OutOfMemoryErrors came from. I will try to index again in one
> thread without RAMDirectory just to check if the program is sane.
> 
> the problem that the files get to big while merging remains. I wonder why
> there is not the possibility to tell lucene not to create files that are
> bigger than the system limit. how am i supposed to know after how many
> documents this limit is reached? lucene creates the documents - i just know
> the average size of a piece of text that is the input for a document. or am I
> missing something?!
> 
> chantal

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: OutOfMemoryError

Posted by "Steven J. Owens" <pu...@darksleep.com>.

I wrote:
> >      Java often has misleading error messages.  For example, on
> > solaris machines the default ulimit used to be 24 - that's 24 open
> > file handles!  Yeesh. This will cause an OutOfMemoryError.  So don't

Jeff Trent replied:
> Wow.  I did not know that!
> 
> I also don't see an option to increase that limit from java -X.  Do you know
> how to increase that limit?

     That's "used to be", I think it's larger on newer machines.  I
don't think there's a java command line option to set this, it's a
system limit.  The solaris command to check it is "ulimit".  To set it
for a given login process (assuming sufficient privileges) use "ulimit
number" (i.e.  "ulimit 128").  "ulimit -a" prints out all limits.

Steven J. Owens
puff@darksleep.com



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: OutOfMemoryError

Posted by Jeff Trent <jt...@structsoft.com>.

Wow.  I did not know that!

I also don't see an option to increase that limit from java -X.  Do you know
how to increase that limit?

----- Original Message -----
From: "Steven J. Owens" <pu...@darksleep.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>;
<ch...@biomax.de>
Sent: Thursday, November 29, 2001 11:46 AM
Subject: Re: OutOfMemoryError


> Chantal,
> > For what I know now, I had a bug in my own code. still I don't
understand
> > where these OutOfMemoryErrors came from. I will try to index again in
one
> > thread without RAMDirectory just to check if the program is sane.
>
>      Java often has misleading error messages.  For example, on
> solaris machines the default ulimit used to be 24 - that's 24 open
> file handles!  Yeesh. This will cause an OutOfMemoryError.  So don't
> assume it's actually a memory problem, particularly if a memory
> problem doesn't particularly make sense.  Just a thought.
>
> Steven J. Owens
> puff@darksleep.com
>
> --
> To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
> For additional commands, e-mail:
<ma...@jakarta.apache.org>
>
>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: OutOfMemoryError

Posted by "Steven J. Owens" <pu...@darksleep.com>.

Chantal,
> For what I know now, I had a bug in my own code. still I don't understand 
> where these OutOfMemoryErrors came from. I will try to index again in one 
> thread without RAMDirectory just to check if the program is sane.

     Java often has misleading error messages.  For example, on
solaris machines the default ulimit used to be 24 - that's 24 open
file handles!  Yeesh. This will cause an OutOfMemoryError.  So don't
assume it's actually a memory problem, particularly if a memory
problem doesn't particularly make sense.  Just a thought.

Steven J. Owens
puff@darksleep.com

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: OutOfMemoryError

Posted by Chantal Ackermann <ch...@biomax.de>.

hi Ian, hi Winton, hi all,

sorry I meant heap size of 100Mb. I'm  starting java with -Xmx100m. I'm not 
setting -Xms.

For what I know now, I had a bug in my own code. still I don't understand 
where these OutOfMemoryErrors came from. I will try to index again in one 
thread without RAMDirectory just to check if the program is sane.

the problem that the files get to big while merging remains. I wonder why 
there is not the possibility to tell lucene not to create files that are 
bigger than the system limit. how am i supposed to know after how many 
documents this limit is reached? lucene creates the documents - i just know 
the average size of a piece of text that is the input for a document. or am I 
missing something?!

chantal

Am Mittwoch, 28. November 2001 20:14 schrieben Sie:
> Were you using -mx and -ms (setting heap size ?)
>
>   Cheers,
>    Winton
>
> >As I run the program on a multi-processor machine I now changed the code
> > to index each file in a single thread and write to one single
> > IndexWriter. the merge factor is still at 10. maxMergeDocs is at
> > 1.000.000. I set the maximum heap size to 1MB.
> >

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: OutOfMemoryError

Posted by Winton Davies <wd...@overture.com>.

Were you using -mx and -ms (setting heap size ?)

  Cheers,
   Winton

>hi to all,
>
>please help! I think I mixed my brain up already with this stuff...
>
>I'm trying to index about 29 textfiles where the biggest one is ~700Mb and
>the smallest ~300Mb. I achieved once to run the whole index, with a merge
>factor = 10 and maxMergeDocs=10000. This took more than 35 hours I think
>(don't know exactly) and it didn't use much RAM (though it could have).
>unfortunately I had a call to optimize at the end and while optimization an
>IOException (File to big) occured (while merging).
>
>As I run the program on a multi-processor machine I now changed the code to
>index each file in a single thread and write to one single IndexWriter. the
>merge factor is still at 10. maxMergeDocs is at 1.000.000. I set the maximum
>heap size to 1MB.
>
>I tried to use RAMDirectory (as mentioned in the mailing list) and just use
>IndexWriter.addDocument(). At the moment it seems not to make any difference.
>after a while _all_ the threads exit one after another (not all at once!)
>with an OutOfMemoryError. the priority of all of them is at the minimum.
>
>even if the multithreading doesn't increase performance I would be glad if I
>could just once get it running again.
>
>I would be even happier if someone could give me a hint what would be the
>best way to index this amount of data. (the average size of an entry that
>gets parsed for a Document is about 1Kb.)
>
>thanx for any help!
>chantal
>
>--
>To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
>For additional commands, e-mail: <ma...@jakarta.apache.org>


Winton Davies
Lead Engineer, Overture (NSDQ: OVER)
1820 Gateway Drive, Suite 360
San Mateo, CA 94404
work: (650) 403-2259
cell: (650) 867-1598
http://www.overture.com/

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>