You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by david m <dm...@gmail.com> on 2007/05/03 21:55:16 UTC

For indexing: how to estimate needed memory?

Our application includes an indexing server that writes to multiple
indexes in parallel (each thread writes to a single index). In order
to avoid an OutOfMemoryError, each request to index a document is
checked to see if the JVM has enough memory available to index the
document.

I know that IndexWriter.ramSizeInBytes() can be used to determine how
much memory was consumed at the conclusion of indexing a document, but
is there a way to know (or estimate) the peak memory consumed while
indexing a document?

For example, in a test set I have a 22 MB document where nearly every
"word" is unique. It has text like this:

'DestAddrType' bin: 00 0D
AttributeCustomerID 'Resources'
AttributeDNIS '7730'
AttributeUserData [295] 00 0E 00 00..
'DNIS_DATA' '323,000,TM,SDM1K5,AAR,,,'
'ENV_FLAG' 'P'
'T_APP_CODE' 'TM'
TelephoneLine' '8'
'C_CALL_DATE' '01/19/06'
'C_START_TIME' '145650'
'C_END_TIME' '145710'
 AttributeCallType 2

 and so on...

 We are indexing a handful of fields for document meta-data - but they
 are tiny compared to the body of the document. Eight of those fields are
 stored (like a messageid, posteddate, typecode).

 The body is indexed into a single field. Our Analyzer splits tokens
 based on Character.isLetterOrDigit() and when in uppercase, indexes a
 lowercase version of the term.

 After indexing that single document ramSizeInBytes() returns 15.7 MB.
 That seems ok to me.

 But for this particular document I found (via trial and error) that
 at -Xmx165m Lucene throws an OutOfMemoryError.

 At -Xmx170m the it indexes successfully.

 Just before calling addDoc() I see maximum available memory of: 160.5 MB

 The 160.5 MB is from this calc:

  Runtime rt = Runtime.getRuntime();
  long maxAvail = rt.maxMemory() - (rt.totalMemory() - rt.freeMemory());

 So it would appear that for this particular document, to avoid an
 OutOfMemoryError I'd need to be certain of having available memory
 approx 7x the doc size.

 I could require 7x the doc size available memory for each doc (on
 the assumption my test document is at the extreme), but for more
 typical documents I'd be over-reserving memory with a result of
 reduced throughput (as docs were forced to wait for sufficient
 available memory that they likely don't need).

 Instead I'm wondering if there is better way for the index server to
 know (or guesstimate) what the memory requirement will be for each
 document? - so that it doesn't start indexing in parallel more
 documents than available memory can support.

 Thanks,
 david.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: For indexing: how to estimate needed memory?

Posted by Erick Erickson <er...@gmail.com>.
A minor clarification. What I'm measuring is the size of the index
before I index a doc, and the size after (this involves "expert" usage
of my IndexWriter subclass....). Using those two figures, I get
a "bloat" factor. That is, how much the memory was consumed by
this document in the index. Say it's 7x the size of the original doc.

So when a doc comes in to be indexed, I multiply the incoming doc
size x7 x2 and flush if necessary.

I absolutely agree that this is a crude measure and doesn't take
into account memory consumption as that document is
indexed.

And I *strongly* suspect that the ridiculous factors I'm getting
(the index grows, at times, by 7x the size of the original doc)
are an artifact of some sort of "allocate some extra memory for
structure X" rather than a true measure of how much space that
document uses. Especially since I'm indexing 11G of raw
data in a 4G index. While storing lots of raw data. I'm imagining that
I index, say, a 20K document. And my index grows by 140K. All
because some internal structure say doubled in size. I have no
evidence for this, except that my index isn't 77G. Which suggests
that a better measure would be the total raw byte sizes of
all the documents compared to the total size of the index in
ram. Between flushes. Hmmmmm......

I guess in my situation, I'm indexing only as a batch process. Currently,
we build an index then deploy it then don't change except once in a
great while (months/years). So I can afford crude measurements and
figure out what to fix up when it explodes, since it has no chance
of affecting production systems. Which is a great luxury.....

I agree with all your points, it sounds like you've been around this
block once or twice. As I said, my main motivation is to have a
way to avoid experimenting with the various index factors, merge,etc.
on the *next* project. And because I got interested <G>. No doubt
I'll run into a situation where efficiency really does count, but so far
overnight is fast enough for building my indexes......

After noodling on this for a while, I'm probably going to throw it all
out and just flush when I have less than 100M free. Memory is
cheap and that would accomplish my real goal which is just to
have a class I can use to index the next project that doesn't
require me to fiddle with various factors and gives "good enough"
performance. If the indexing speed is painful, I'll revisit this. But
I suspect that squeezing the use of the last 90M in this case
won't buy me much. Now that I'm thinking of it, it would be
easy to measure how much better using the last 100M by
just comparing the times for building the index with an extra
100M allocated to the JVM. *Then* figure out whether the
speed gain was worth the complexity.....

It's nice to use writing things like this to figure out what I should
*really* be concerned with.

Best
Erick

On 5/4/07, david m <dm...@gmail.com> wrote:
>
> > First, are you sure you're free memory calculation is OK? Why not
> > just use freeMemory?
> I _think_ my calculation is ok - my reasoning:
>
> Runtime.maxMemory() - Amount of memory that can be given to the JVM -
> based on -Xmx<value>
>
> Runtime.totalMemory() - Amount of memory currently owned by the JVM
>
> Runtime.freeMemory() - Amount of unused memory currently owned by the JVM
>
> The amount of memory currently in use (inUse) = totalMemory() -
> freeMemory()
>
> The amount we can still get (before hitting -Xmx<value>) = maxMemory() -
> inUse
> And in the absence of -Xms, nothing to say we will be given that much.
>
> > Perhaps also calling the gc if the avail isn't
> > enough. Although I confess I don't know the innards of the
> > interplay of getting the various memory amounts.....
> I do call the gc - but sparingly. If I've done a flush to reclaim
> memory in hopes of having enough memory for a pending document, then
> I'll call the gc before checking if I now have enough memory
> available. However, I too know little of the gc workings. On the
> assumption that the JRE is smarter at knowing how & when to
> execute gc than I am, I operate on the premise that it is not a good
> practice to routinely be calling the gc.
>
> > The approach I've been using is to gather some data as I'm indexing
> > to decide whether to flush the indexwriter or not. That is, record the
> > size change that ramSizeInBytes() returns before I start to index
> > a document, record the amount after, and keep the worst
> > ratio around. This got easier when I subclassed IndexWriter and
> > overrode the add methods.
> I agree this gives a conservative measure of the worse case memory
> consumption by an indexed document. But it measures memory _after_
> indexing. My observation is that the peak memory usage occurs _during_
> indexing - so that if the process is low on memory, that is when the
> problem (OutOfMemoryError) will hit. In my mind it is the peak usage
> that really matters.
>
> If there were a way to record and retrieve peak usage for each
> document, we would be able to see if there is a relationship between
> the peak during indexing and ramSizeInBytes() after indexing. If there
> were a (somewhat) predictable relationship, then I think we'd have a
> more accurate value to decide on a factor to use for avoiding
> OutOfMemeoryErrors.
>
> > Then I'm requiring that I have 2X the worst case I've seen for the
> > incoming document, and flushing (perhaps gc-ing) if I don't have
> > enough.
> Based on the data I've collected, we've been using 1.5x - 2.0x of
> document size as our value (and made it a configuration parameter).
>
> > And I think that this is "good enough". What it allows (as does your
> > approach) is letting the usual cases of much smaller than 20M+ files
> > to accumulate and flush reasonably efficiently, and not penalizing
> > my speed by, say, always keeping 250M free or some such.
> Agreed... To me it is a balancing act of avoiding OutOfMemoryErrors
> without unnecessarily throttling throughput in order to keep that 250M
> (or whatever) of memory available for what we think is the unusual
> document - and one that arrives for indexing while available memory
> is relatively low. If it arrives when the indexer isn't busy with other
> documents, then likely not a problem anyway.
>
> >
> > Keep me posted if you come up with anything really cool!
> Ditto.
>
> Thanks, david.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: For indexing: how to estimate needed memory?

Posted by david m <dm...@gmail.com>.
> First, are you sure you're free memory calculation is OK? Why not
> just use freeMemory?
I _think_ my calculation is ok - my reasoning:

Runtime.maxMemory() - Amount of memory that can be given to the JVM -
based on -Xmx<value>

Runtime.totalMemory() - Amount of memory currently owned by the JVM

Runtime.freeMemory() - Amount of unused memory currently owned by the JVM

The amount of memory currently in use (inUse) = totalMemory() - freeMemory()

The amount we can still get (before hitting -Xmx<value>) = maxMemory() - inUse
And in the absence of -Xms, nothing to say we will be given that much.

> Perhaps also calling the gc if the avail isn't
> enough. Although I confess I don't know the innards of the
> interplay of getting the various memory amounts.....
I do call the gc - but sparingly. If I've done a flush to reclaim
memory in hopes of having enough memory for a pending document, then
I'll call the gc before checking if I now have enough memory
available. However, I too know little of the gc workings. On the
assumption that the JRE is smarter at knowing how & when to
execute gc than I am, I operate on the premise that it is not a good
practice to routinely be calling the gc.

> The approach I've been using is to gather some data as I'm indexing
> to decide whether to flush the indexwriter or not. That is, record the
> size change that ramSizeInBytes() returns before I start to index
> a document, record the amount after, and keep the worst
> ratio around. This got easier when I subclassed IndexWriter and
> overrode the add methods.
I agree this gives a conservative measure of the worse case memory
consumption by an indexed document. But it measures memory _after_
indexing. My observation is that the peak memory usage occurs _during_
indexing - so that if the process is low on memory, that is when the
problem (OutOfMemoryError) will hit. In my mind it is the peak usage
that really matters.

If there were a way to record and retrieve peak usage for each
document, we would be able to see if there is a relationship between
the peak during indexing and ramSizeInBytes() after indexing. If there
were a (somewhat) predictable relationship, then I think we'd have a
more accurate value to decide on a factor to use for avoiding
OutOfMemeoryErrors.

> Then I'm requiring that I have 2X the worst case I've seen for the
> incoming document, and flushing (perhaps gc-ing) if I don't have
> enough.
Based on the data I've collected, we've been using 1.5x - 2.0x of
document size as our value (and made it a configuration parameter).

> And I think that this is "good enough". What it allows (as does your
> approach) is letting the usual cases of much smaller than 20M+ files
> to accumulate and flush reasonably efficiently, and not penalizing
> my speed by, say, always keeping 250M free or some such.
Agreed... To me it is a balancing act of avoiding OutOfMemoryErrors
without unnecessarily throttling throughput in order to keep that 250M
(or whatever) of memory available for what we think is the unusual
document - and one that arrives for indexing while available memory
is relatively low. If it arrives when the indexer isn't busy with other
documents, then likely not a problem anyway.

>
> Keep me posted if you come up with anything really cool!
Ditto.

Thanks, david.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: For indexing: how to estimate needed memory?

Posted by Erick Erickson <er...@gmail.com>.
Coincidentally, I'm hacking at this very problem....

First, are you sure you're free memory calculation is OK? Why not
just use freeMemory? Perhaps also calling the gc if the avail isn't
enough. Although I confess I don't know the innards of the
interplay of getting the various memory amounts.....

The approach I've been using is to gather some data as I'm indexing
to decide whether to flush the indexwriter or not. That is, record the
size change that ramSizeInBytes() returns before I start to index
a document, record the amount after, and keep the worst
ratio around. This got easier when I subclassed IndexWriter and
overrode the add methods.

But it does require that you call into your writer before you start
adding fields to a document to record the start size......

Then I'm requiring that I have 2X the worst case I've seen for the
incoming document, and flushing (perhaps gc-ing) if I don't have
enough.

Mostly, this is to keep from having to experiment with each different
data set that we get to find the right MERGE & etc. factors to use,
I'm not actually entirely sure that this is giving me any measurable
performance gains.

And I think that this is "good enough". What it allows (as does your
approach) is letting the usual cases of much smaller than 20M+ files
to accumulate and flush reasonably efficiently, and not penalizing
my speed by, say, always keeping 250M free or some such. Again,
the critical thing is that you have to call into here *before* you index.

Curiously, I also got ratios of around 7X, so there's a lot going on.

Keep me posted if you come up with anything really cool!

Best
Erick


On 5/3/07, david m <dm...@gmail.com> wrote:
>
> Our application includes an indexing server that writes to multiple
> indexes in parallel (each thread writes to a single index). In order
> to avoid an OutOfMemoryError, each request to index a document is
> checked to see if the JVM has enough memory available to index the
> document.
>
> I know that IndexWriter.ramSizeInBytes() can be used to determine how
> much memory was consumed at the conclusion of indexing a document, but
> is there a way to know (or estimate) the peak memory consumed while
> indexing a document?
>
> For example, in a test set I have a 22 MB document where nearly every
> "word" is unique. It has text like this:
>
> 'DestAddrType' bin: 00 0D
> AttributeCustomerID 'Resources'
> AttributeDNIS '7730'
> AttributeUserData [295] 00 0E 00 00..
> 'DNIS_DATA' '323,000,TM,SDM1K5,AAR,,,'
> 'ENV_FLAG' 'P'
> 'T_APP_CODE' 'TM'
> TelephoneLine' '8'
> 'C_CALL_DATE' '01/19/06'
> 'C_START_TIME' '145650'
> 'C_END_TIME' '145710'
> AttributeCallType 2
>
> and so on...
>
> We are indexing a handful of fields for document meta-data - but they
> are tiny compared to the body of the document. Eight of those fields are
> stored (like a messageid, posteddate, typecode).
>
> The body is indexed into a single field. Our Analyzer splits tokens
> based on Character.isLetterOrDigit() and when in uppercase, indexes a
> lowercase version of the term.
>
> After indexing that single document ramSizeInBytes() returns 15.7 MB.
> That seems ok to me.
>
> But for this particular document I found (via trial and error) that
> at -Xmx165m Lucene throws an OutOfMemoryError.
>
> At -Xmx170m the it indexes successfully.
>
> Just before calling addDoc() I see maximum available memory of: 160.5 MB
>
> The 160.5 MB is from this calc:
>
>   Runtime rt = Runtime.getRuntime();
>   long maxAvail = rt.maxMemory() - (rt.totalMemory() - rt.freeMemory());
>
> So it would appear that for this particular document, to avoid an
> OutOfMemoryError I'd need to be certain of having available memory
> approx 7x the doc size.
>
> I could require 7x the doc size available memory for each doc (on
> the assumption my test document is at the extreme), but for more
> typical documents I'd be over-reserving memory with a result of
> reduced throughput (as docs were forced to wait for sufficient
> available memory that they likely don't need).
>
> Instead I'm wondering if there is better way for the index server to
> know (or guesstimate) what the memory requirement will be for each
> document? - so that it doesn't start indexing in parallel more
> documents than available memory can support.
>
> Thanks,
> david.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Language detection library

Posted by Bob Carpenter <ca...@alias-i.com>.
>> Anyone knows of a good language detection library that can detect what
>> language a document (text) is ?

Language detection is easy.  It's just a simple
text classification problem.

One way you can do this is using Lucene
itself.  Create a so-called pseudo-document
for each language consisting of lots of text
(1 MB or more, ideally).  Then build a Lucene
index using a character n-gram tokenizer.
Eg. "John Smith" tokenizes to "Jo", "oh",
"hn", "n ", " S", "Sm", "mi", "it", "th"
with 2-grams.

You'll have to make sure to index beyond the
first 1000 tokens or whatever Lucene is set to
by default.

To do language ID, just treat the language
to be identified as the basis of a query.
Parse it using the same character n-gram
tokenizer.  The highest-scoring result is
the answer and if two score high, you know
there may be some ambiguity.  You can't trust
Lucene's normalized scoring for rejection,
though.

Make sure the tokenizer includes spaces as
well as non-space characters (though all
spaces may be normalized to a single whitespace).
Using more orders (1-grams, 2-grams, 3-grams,
etc.) gives more accuracy; the IDF weighting
is quite sensible here and will work out the
details for the counts for you.

For a more sophisticated approach, check out
LingPipe's language ID tutorial, which is
based on probabilistic character language models.
Think of it as similar to the Lucene model but
with different term weighting.

    http://www.alias-i.com/lingpipe/demos/tutorial/langid/read-me.html

Here's accuracy vs. input length on a set of 15
languages from the Leipzig Corpus collection (just
one of the many evals in the tutorial):

#chars  accuracy
1	22.59%
2	34.82%
4	58.55%
8	81.17%
16	92.45%
32	97.33%
64	98.99%
128	99.67%

The end of the tutorial has references to other
popular language ID packages online (e.g. TextCat,
which is Gertjan van Noord's Perl package).  And it
also has references to the technical background
on TF/IDF classification with n-grams and
character language models.

- Bob Carpenter
   Alias-i

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Language detection library

Posted by karl wettin <ka...@gmail.com>.
4 maj 2007 kl. 02.20 skrev Chris Lu:

> I suppose if a document is indexed as English or French,
> when users searching the document,
> we need to parse the query as English or French also?

If you do some language specific token analysis such as stemming, yes.

Detecting the language on such small texts is sort of tricky though.  
You might want to introduce more dimensions in the classifier: user  
location, user locale, et c. Perhaps you want to store stemmed data  
in language specific fields. It might also be a good idea to place an  
initial query and re-classifiy to one of the top n scoring language  
and then replace the query.

The easiest way out is to simply ask the user what language they want  
to search in. And that seems to be the most common.


>
> -- 
> Chris Lu
> -------------------------
> Instant Scalable Full-Text Search On Any Database/Application
> site: http://www.dbsight.net
> demo: http://search.dbsight.com
> Lucene Database Search in 3 minutes:
> http://wiki.dbsight.com/index.php? 
> title=Create_Lucene_Database_Search_in_3_minutes
>
>
> On 5/3/07, karl wettin <ka...@gmail.com> wrote:
>>
>> 3 maj 2007 kl. 22.06 skrev Mordo, Aviran (EXP N-NANNATEK):
>>
>> > Anyone knows of a good language detection library that can  
>> detect what
>> > language a document (text) is ?
>>
>> I posted this some time back:
>>
>> https://issues.apache.org/jira/browse/LUCENE-826
>>
>> A bit of proof-of-concept:ish, but it does the job well if you ask
>> me. Uses Weka (GPL) and requires at least 150 characters to be  
>> trusted.
>>
>>
>> --
>> karl
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Language detection library

Posted by Chris Lu <ch...@gmail.com>.
I suppose if a document is indexed as English or French,
when users searching the document,
we need to parse the query as English or French also?

-- 
Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes


On 5/3/07, karl wettin <ka...@gmail.com> wrote:
>
> 3 maj 2007 kl. 22.06 skrev Mordo, Aviran (EXP N-NANNATEK):
>
> > Anyone knows of a good language detection library that can detect what
> > language a document (text) is ?
>
> I posted this some time back:
>
> https://issues.apache.org/jira/browse/LUCENE-826
>
> A bit of proof-of-concept:ish, but it does the job well if you ask
> me. Uses Weka (GPL) and requires at least 150 characters to be trusted.
>
>
> --
> karl
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Language detection library

Posted by karl wettin <ka...@gmail.com>.
3 maj 2007 kl. 22.06 skrev Mordo, Aviran (EXP N-NANNATEK):

> Anyone knows of a good language detection library that can detect what
> language a document (text) is ?

I posted this some time back:

https://issues.apache.org/jira/browse/LUCENE-826

A bit of proof-of-concept:ish, but it does the job well if you ask  
me. Uses Weka (GPL) and requires at least 150 characters to be trusted.


-- 
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Language detection library

Posted by "Mordo, Aviran (EXP N-NANNATEK)" <av...@lmco.com>.
Anyone knows of a good language detection library that can detect what
language a document (text) is ?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org