You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Erick Erickson <er...@gmail.com> on 2007/09/01 21:06:55 UTC

Re: OutOfMemoryError tokenizing a boring text file

I can't answer the question of why the same token
takes up memory, but I've indexed far more than
20M of data in a single document field. As in on the
order of 150M. Of course I allocated 1G or so to the
JVM, so you might try that....

Best
Erick

On 8/31/07, Per Lindberg <pe...@implior.com> wrote:
>
> I'm creating a tokenized "content" Field from a plain text file
> using an InputStreamReader and new Field("content", in);
>
> The text file is large, 20 MB, and contains zillions lines,
> each with the the same 100-character token.
>
> That causes an OutOfMemoryError.
>
> Given that all tokens are the *same*,
> why should this cause an OutOfMemoryError?
> Shouldn't StandardAnalyzer just chug along
> and just note "ho hum, this token is the same"?
> That shouldn't take too much memory.
>
> Or have I missed something?
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

SV: SV: OutOfMemoryError tokenizing a boring text file

Posted by Per Lindberg <pe...@implior.com>.
> Från: Chris Hostetter [mailto:hossman_lucene@fucit.org] 

> : Setting writer.setMaxFieldLength(5000) (default is 10000)
> : seems to eliminate the risk for an OutOfMemoryError,
> 
> that's because it now gives up after parsing 5000 tokens.
> 
> : To me, it appears that simply calling
> :    new Field("content", new InputStreamReader(in, "ISO-8859-1"))
> : on a plain text file causes Lucene to buffer it *all*.
> 
> Looking at this purely from an outside in perspective: how could that
> be true?  If it was then why would calling setMaxFieldLength(5000) 
> solve your problem -- limiting the number of tokens wouldn't 
> matter if the 
> problem occured becuase Lucene was buffering the entire reader.
> 
> 
> It definitely seems like there is some room for improvement 
> here ... it 
> sounds almost like mayber there is a [HAND WAVEY AIR QUOTES] 
> memory/object 
> leakish [/HAND WAVEY AIR QUOTES] situation where even after a 
> Token is 
> read off the TokenStream the Token isn't being GCed.
> 
> Per: perhaps you could open a Jira issue and attach a unit test 
> demonstrating the problem?  maybe something with an 
> artificial Reader that 
> just churns out a repeating sequence of characters forever?

Isolating the problem is exactly what I did. It took some time,
but the memory leak turned out to be somewhere else, not in
Lucene. (Memory leaks are slippery beasts!)

Just wanted to let y'all know.

Thanks and cheers,
Per




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: SV: OutOfMemoryError tokenizing a boring text file

Posted by Chris Hostetter <ho...@fucit.org>.
: Setting writer.setMaxFieldLength(5000) (default is 10000)
: seems to eliminate the risk for an OutOfMemoryError,

that's because it now gives up after parsing 5000 tokens.

: To me, it appears that simply calling
:    new Field("content", new InputStreamReader(in, "ISO-8859-1"))
: on a plain text file causes Lucene to buffer it *all*.

Looking at this purely from an outside in perspective: how could that
be true?  If it was then why would calling setMaxFieldLength(5000) 
solve your problem -- limiting the number of tokens wouldn't matter if the 
problem occured becuase Lucene was buffering the entire reader.


It definitely seems like there is some room for improvement here ... it 
sounds almost like mayber there is a [HAND WAVEY AIR QUOTES] memory/object 
leakish [/HAND WAVEY AIR QUOTES] situation where even after a Token is 
read off the TokenStream the Token isn't being GCed.

Per: perhaps you could open a Jira issue and attach a unit test 
demonstrating the problem?  maybe something with an artificial Reader that 
just churns out a repeating sequence of characters forever?




-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


SV: OutOfMemoryError tokenizing a boring text file

Posted by Per Lindberg <pe...@implior.com>.
Aha, that's interesting. However...

Setting writer.setMaxFieldLength(5000) (default is 10000)
seems to eliminate the risk for an OutOfMemoryError,
even with a JVM with only 64 MB max memory.
(I have tried larger values for JVM max memory, too).
 
(The name is imho slightly misleading, I would have
called it setMaxFieldTerms or something like that).

Still, 64 bits x 10000 = 78 KB. I can't see why that should
eat up 64 MB, unless the 100 char tokens also are multiplied.

(The 20 MB text file contains roughly 200 000 copies of the
same 100 char string).

To me, it appears that simply calling

   new Field("content", new InputStreamReader(in, "ISO-8859-1"))

on a plain text file causes Lucene to buffer it *all*.


> -----Ursprungligt meddelande-----
> Från: Karl Wettin [mailto:karl.wettin@gmail.com] 
> Skickat: den 1 september 2007 22:00
> Till: java-user@lucene.apache.org
> Ämne: Re: OutOfMemoryError tokenizing a boring text file
> 
> I belive the problem is that the text value is not the only data  
> associated with a token, there is for instance the position offset.  
> Depending on your JVM, each instance reference consume 64 
> bits or so,  
> so even if the text value is flyweighted by String.intern() there is  
> a cost. I doubt that a document is flushed to the segment prior to a  
> fields token stream has been exhaused.
> 
> -- 
> karl
> 
> 
> 1 sep 2007 kl. 21.50 skrev Askar Zaidi:
> 
> > I have indexed around 100 M of data with 512M to the JVM heap. So  
> > that gives
> > you an idea. If every token is the same word in one file, 
> shouldn't  
> > the
> > tokenizer recognize that ?
> >
> > Try using Luke. That helps solving lots of issues.
> >
> > -
> > AZ
> >
> > On 9/1/07, Erick Erickson <er...@gmail.com> wrote:
> >>
> >> I can't answer the question of why the same token
> >> takes up memory, but I've indexed far more than
> >> 20M of data in a single document field. As in on the
> >> order of 150M. Of course I allocated 1G or so to the
> >> JVM, so you might try that....
> >>
> >> Best
> >> Erick
> >>
> >> On 8/31/07, Per Lindberg <pe...@implior.com> wrote:
> >>>
> >>> I'm creating a tokenized "content" Field from a plain text file
> >>> using an InputStreamReader and new Field("content", in);
> >>>
> >>> The text file is large, 20 MB, and contains zillions lines,
> >>> each with the the same 100-character token.
> >>>
> >>> That causes an OutOfMemoryError.
> >>>
> >>> Given that all tokens are the *same*,
> >>> why should this cause an OutOfMemoryError?
> >>> Shouldn't StandardAnalyzer just chug along
> >>> and just note "ho hum, this token is the same"?
> >>> That shouldn't take too much memory.
> >>>
> >>> Or have I missed something?
> >>>
> >>>
> >>>
> >>>
> >>> 
> -------------------------------------------------------------------- 
> >>> -
> >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>
> >>>
> >>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: OutOfMemoryError tokenizing a boring text file

Posted by Karl Wettin <ka...@gmail.com>.
I belive the problem is that the text value is not the only data  
associated with a token, there is for instance the position offset.  
Depending on your JVM, each instance reference consume 64 bits or so,  
so even if the text value is flyweighted by String.intern() there is  
a cost. I doubt that a document is flushed to the segment prior to a  
fields token stream has been exhaused.

-- 
karl


1 sep 2007 kl. 21.50 skrev Askar Zaidi:

> I have indexed around 100 M of data with 512M to the JVM heap. So  
> that gives
> you an idea. If every token is the same word in one file, shouldn't  
> the
> tokenizer recognize that ?
>
> Try using Luke. That helps solving lots of issues.
>
> -
> AZ
>
> On 9/1/07, Erick Erickson <er...@gmail.com> wrote:
>>
>> I can't answer the question of why the same token
>> takes up memory, but I've indexed far more than
>> 20M of data in a single document field. As in on the
>> order of 150M. Of course I allocated 1G or so to the
>> JVM, so you might try that....
>>
>> Best
>> Erick
>>
>> On 8/31/07, Per Lindberg <pe...@implior.com> wrote:
>>>
>>> I'm creating a tokenized "content" Field from a plain text file
>>> using an InputStreamReader and new Field("content", in);
>>>
>>> The text file is large, 20 MB, and contains zillions lines,
>>> each with the the same 100-character token.
>>>
>>> That causes an OutOfMemoryError.
>>>
>>> Given that all tokens are the *same*,
>>> why should this cause an OutOfMemoryError?
>>> Shouldn't StandardAnalyzer just chug along
>>> and just note "ho hum, this token is the same"?
>>> That shouldn't take too much memory.
>>>
>>> Or have I missed something?
>>>
>>>
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: OutOfMemoryError tokenizing a boring text file

Posted by Askar Zaidi <as...@gmail.com>.
I have indexed around 100 M of data with 512M to the JVM heap. So that gives
you an idea. If every token is the same word in one file, shouldn't the
tokenizer recognize that ?

Try using Luke. That helps solving lots of issues.

-
AZ

On 9/1/07, Erick Erickson <er...@gmail.com> wrote:
>
> I can't answer the question of why the same token
> takes up memory, but I've indexed far more than
> 20M of data in a single document field. As in on the
> order of 150M. Of course I allocated 1G or so to the
> JVM, so you might try that....
>
> Best
> Erick
>
> On 8/31/07, Per Lindberg <pe...@implior.com> wrote:
> >
> > I'm creating a tokenized "content" Field from a plain text file
> > using an InputStreamReader and new Field("content", in);
> >
> > The text file is large, 20 MB, and contains zillions lines,
> > each with the the same 100-character token.
> >
> > That causes an OutOfMemoryError.
> >
> > Given that all tokens are the *same*,
> > why should this cause an OutOfMemoryError?
> > Shouldn't StandardAnalyzer just chug along
> > and just note "ho hum, this token is the same"?
> > That shouldn't take too much memory.
> >
> > Or have I missed something?
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>