You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Artem Vasiliev <ar...@gmail.com> on 2006/04/01 12:13:02 UTC

Re[2]: OutOfMemory with search(Query, Sort)

Hello Yonik,

Thanks, it explains my issue and that's definitely a hit - I tried to
sort by filePath field which can be 100 bytes at average meaning 400M
RAM for the cache + IO excess to load them from 3G index. I wish this
caching were configurable as lazy or switched off, do you know if
that's possible?

>> I've tried to utilize Lucene's sorting function
YS> [...]
>> But on large index (4mln docs) I
>> get big delay with CPU to 100% and then OutOfMemoryError even when
>> there's only 1 document in the resultset!

YS> The first time you sort on a field, a FieldCache entry is populated,
YS> enabling random access to that field value.  A single int field for a
YS> 4M index == int[4000000] == 16MB memory.

-- 
Best regards,
 Artem                            mailto:artvas@gmail.com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re[4]: OutOfMemory with search(Query, Sort)

Posted by Chris Hostetter <ho...@fucit.org>.
: Thanks for your answer, you're right, filepathes are pretty much
: unique. Anyway I don't want this total-field-cache-loading situation occur
: in any circumstances - it's too expensive. My app usually crawls while
: user searches are performed. Crawl involves additions and deletions so
: IndexSearcher get closed relatively frequently. Seems like Lucene
: would reload the whole field cache for each new IndexSearcher, which
: would be a big hit anyway. So I'll try FieldCache overriding solution
: proposed by you and Yonik and may be commit it to Lucene as a patch.

it will, but you can structure your app so that you don't re-open your
IndexSearcher for every query -- just do it on a periodic basis (ie: "if
N time elapsed since last open, and index version has increased, reopen
searcher; sleep S time")

: Btw do I understand right that concrete FieldCache class isn't pluggable
: at Lucene at the moment?

it's definitley pluggable, just assign the implimentation you want to use
to FieldCache.DEFAULT and all coe will use it, but anything that calls
FieldCache.DEFAULT.getStringIndex(...) may expect that the string array
will be populated, so instead of replacing it the FieldCache isntance
completley, you can impliment your own SortField that uses
FieldCache.getCustom to build a cache that only contains the index ints.




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re[4]: OutOfMemory with search(Query, Sort)

Posted by Artem Vasiliev <ar...@gmail.com>.
Hello Hoss,

Thanks for your answer, you're right, filepathes are pretty much
unique. Anyway I don't want this total-field-cache-loading situation occur
in any circumstances - it's too expensive. My app usually crawls while
user searches are performed. Crawl involves additions and deletions so
IndexSearcher get closed relatively frequently. Seems like Lucene
would reload the whole field cache for each new IndexSearcher, which
would be a big hit anyway. So I'll try FieldCache overriding solution
proposed by you and Yonik and may be commit it to Lucene as a patch.

Btw do I understand right that concrete FieldCache class isn't pluggable
at Lucene at the moment?

: >> sort by filePath field which can be 100 bytes at average meaning 400M
: >> RAM for the cache
CH> :
CH> : Well, it's probably not quite that bad...

CH> yeah, but in his case he's dealing with filepaths -- i'm guessing that
CH> each document represents a file, and no two files will have the same path.

CH> some benefit may be gained in spliting the filepath field up into a
CH> dirpath field and a filename field, and then sortinging on "dirpath,
CH> filename" .. this should reduce the size quite a bit if the number of

-- 
Best regards,
 Artem

http://sharehound.sourceforge.net sharehound, the open source filesystems indexer


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Re[2]: OutOfMemory with search(Query, Sort)

Posted by Chris Hostetter <ho...@fucit.org>.
: > sort by filePath field which can be 100 bytes at average meaning 400M
: > RAM for the cache
:
: Well, it's probably not quite that bad...
:
: For string sorting, a FieldCache.StringIndex is used.
: It contains a sorted String[num_unique_terms_in_field], and an int[maxDoc]
: So if 10 documents share a large string field value, that value will
: only be in the fieldCache once.

yeah, but in his case he's dealing with filepaths -- i'm guessing that
each document represents a file, and no two files will have the same path.

some benefit may be gained in spliting the filepath field up into a
dirpath field and a filename field, and then sortinging on "dirpath,
filename" .. this should reduce the size quite a bit if the number of
unique files is significantly greater then the number of unique
directories -- of course how much it helps also depends greatly on wether
your fielnames are really long compared to your directory paths.

In general, Yonik's suggestion is really the best way to go.  as i
understand it the only reason the StringIndex FieldCache maintains the
list of strings permenantly is for use in a MultiSearcher, so if you
aren't worried about that i think it would work very nicely.

it would also make a really great PATCH, espcially if IndexSearcher got a
new option that let you tell it to use this new scaled down
Sort/FieldCache option for strings because you weren't using it in a
MultiSearcher.


: If you are just using an IndexSearcher (no multisearchers), then the
: String[] isn't strictly needed... only the ordering (the int[]) is
: needed from the StringIndex.  One option is to create your own
: FieldCache that doesn't create/store that String[].


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Re[4]: OutOfMemory with search(Query, Sort)

Posted by Yonik Seeley <ys...@gmail.com>.
On 4/5/06, Artem Vasiliev <ar...@gmail.com> wrote:
> The int[] array here contains references to String[] and to populate
> it still all the field values need to be loaded and compared/sorted

Terms are stored and iterated in sorted order, so no sorting needs to be done.
It's still the case that all the terms for that field need to be
iterated over though.

Another approach might be to store term vectors and retrieve the term
only from documents matching a particular query.  It might be slower
per query, but wouldn't have the overhead of populating the int[]

-Yonik
http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re[4]: OutOfMemory with search(Query, Sort)

Posted by Artem Vasiliev <ar...@gmail.com>.
>>I tried to
>> sort by filePath field which can be 100 bytes at average meaning 400M
>> RAM for the cache

YS> For string sorting, a FieldCache.StringIndex is used.
YS> It contains a sorted String[num_unique_terms_in_field], and an int[maxDoc]
YS> So if 10 documents share a large string field value, that value will
YS> only be in the fieldCache once.

YS> If you are just using an IndexSearcher (no multisearchers), then the
YS> String[] isn't strictly needed... only the ordering (the int[]) is
YS> needed from the StringIndex.  One option is to create your own
YS> FieldCache that doesn't create/store that String[].

The int[] array here contains references to String[] and to populate
it still all the field values need to be loaded and compared/sorted
which is what I want to avoid. I guess my option is not to use
FieldCache at all.

-- 
Best regards,
 Artem

http://sharehound.sourceforge.net sharehound, the open source filesystems indexer


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Re[2]: OutOfMemory with search(Query, Sort)

Posted by Yonik Seeley <ys...@gmail.com>.
On 4/1/06, Artem Vasiliev <ar...@gmail.com> wrote:
>I tried to
> sort by filePath field which can be 100 bytes at average meaning 400M
> RAM for the cache

Well, it's probably not quite that bad...

For string sorting, a FieldCache.StringIndex is used.
It contains a sorted String[num_unique_terms_in_field], and an int[maxDoc]
So if 10 documents share a large string field value, that value will
only be in the fieldCache once.

If you are just using an IndexSearcher (no multisearchers), then the
String[] isn't strictly needed... only the ordering (the int[]) is
needed from the StringIndex.  One option is to create your own
FieldCache that doesn't create/store that String[].

-Yonik
http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org