You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by ma...@thomson.com on 2008/02/08 16:51:44 UTC

large term vectors

Hi,

 

I have a large index which is around 275GB. As I search different parts
of the index, the memory footprint grows with large byte arrays being
stored. They never seem to get unloaded or GC'ed. Is there any way to
control this behavior so that I can periodically unload cached
information?

 

The nature of the data being indexed doesn't allow me to reduce the
number of terms per field, although I might be able to reduce the number
of overall fields (I have some which aren't currently being searched
by).

 

I've just begun investigating and profiling the problem, so I don't have
a lot of details at this time. Any support would be extremely welcome.

 

Thanks,

 

Marc Dumontier
Manager, Software Development
Thomson Scientific (Canada)
1 Yonge Street, Suite 1801
Toronto, Ontario M5E 1W7

 

Direct +1 416 214 3448
Mobile +1 416 454 3147

 


Re: large term vectors

Posted by Karl Wettin <ka...@gmail.com>.
http://lucene.apache.org/java/2_3_0/api/org/apache/lucene/document/Field.Index.html#NO_NORMS

?


11 feb 2008 kl. 15.55 skrev <ma...@thomson.com>:

> Hi Grant,
>
> Lucene 2.2.0
>
> I'm not actually explicitely storing term vectors. It seems the huge
> amount of byte arrays is actually coming from SegmentReader.norms.  
> Maybe
> that cache constantly grows as I read somewhere that it's on-demand.  
> I'm
> not using any field or document boosting..is there some way to  
> optimize
> around this?
>
> Marc
>
>
> -----Original Message-----
> From: Grant Ingersoll [mailto:gsingers@apache.org]
> Sent: Monday, February 11, 2008 7:46 AM
> To: java-user@lucene.apache.org
> Subject: Re: large term vectors
>
> Hi Marc,
>
> Can you give more info about what your field properties are?  Your
> subject line implies you are storing term vectors, is that the case?
>
> Also, what version of Lucene are you using?
>
> Cheers,
> Grant
>
> On Feb 8, 2008, at 10:51 AM, <ma...@thomson.com>
> <marc.dumontier@thomson.com
>> wrote:
>
>> Hi,
>>
>>
>>
>> I have a large index which is around 275GB. As I search different
>> parts
>> of the index, the memory footprint grows with large byte arrays being
>> stored. They never seem to get unloaded or GC'ed. Is there any way to
>> control this behavior so that I can periodically unload cached
>> information?
>>
>>
>>
>> The nature of the data being indexed doesn't allow me to reduce the
>> number of terms per field, although I might be able to reduce the
>> number
>> of overall fields (I have some which aren't currently being searched
>> by).
>>
>>
>>
>> I've just begun investigating and profiling the problem, so I don't
>> have
>> a lot of details at this time. Any support would be extremely  
>> welcome.
>>
>>
>>
>> Thanks,
>>
>>
>>
>> Marc Dumontier
>> Manager, Software Development
>> Thomson Scientific (Canada)
>> 1 Yonge Street, Suite 1801
>> Toronto, Ontario M5E 1W7
>>
>>
>>
>> Direct +1 416 214 3448
>> Mobile +1 416 454 3147
>>
>>
>>
>
> --------------------------
> Grant Ingersoll
> http://lucene.grantingersoll.com
> http://www.lucenebootcamp.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: large term vectors

Posted by ma...@thomson.com.
Hi Grant,

Lucene 2.2.0

I'm not actually explicitely storing term vectors. It seems the huge
amount of byte arrays is actually coming from SegmentReader.norms. Maybe
that cache constantly grows as I read somewhere that it's on-demand. I'm
not using any field or document boosting..is there some way to optimize
around this?

Marc


-----Original Message-----
From: Grant Ingersoll [mailto:gsingers@apache.org] 
Sent: Monday, February 11, 2008 7:46 AM
To: java-user@lucene.apache.org
Subject: Re: large term vectors

Hi Marc,

Can you give more info about what your field properties are?  Your  
subject line implies you are storing term vectors, is that the case?

Also, what version of Lucene are you using?

Cheers,
Grant

On Feb 8, 2008, at 10:51 AM, <ma...@thomson.com>
<marc.dumontier@thomson.com 
 > wrote:

> Hi,
>
>
>
> I have a large index which is around 275GB. As I search different  
> parts
> of the index, the memory footprint grows with large byte arrays being
> stored. They never seem to get unloaded or GC'ed. Is there any way to
> control this behavior so that I can periodically unload cached
> information?
>
>
>
> The nature of the data being indexed doesn't allow me to reduce the
> number of terms per field, although I might be able to reduce the  
> number
> of overall fields (I have some which aren't currently being searched
> by).
>
>
>
> I've just begun investigating and profiling the problem, so I don't  
> have
> a lot of details at this time. Any support would be extremely welcome.
>
>
>
> Thanks,
>
>
>
> Marc Dumontier
> Manager, Software Development
> Thomson Scientific (Canada)
> 1 Yonge Street, Suite 1801
> Toronto, Ontario M5E 1W7
>
>
>
> Direct +1 416 214 3448
> Mobile +1 416 454 3147
>
>
>

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com
http://www.lucenebootcamp.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: large term vectors

Posted by Grant Ingersoll <gs...@apache.org>.
Hi Marc,

Can you give more info about what your field properties are?  Your  
subject line implies you are storing term vectors, is that the case?

Also, what version of Lucene are you using?

Cheers,
Grant

On Feb 8, 2008, at 10:51 AM, <ma...@thomson.com> <marc.dumontier@thomson.com 
 > wrote:

> Hi,
>
>
>
> I have a large index which is around 275GB. As I search different  
> parts
> of the index, the memory footprint grows with large byte arrays being
> stored. They never seem to get unloaded or GC'ed. Is there any way to
> control this behavior so that I can periodically unload cached
> information?
>
>
>
> The nature of the data being indexed doesn't allow me to reduce the
> number of terms per field, although I might be able to reduce the  
> number
> of overall fields (I have some which aren't currently being searched
> by).
>
>
>
> I've just begun investigating and profiling the problem, so I don't  
> have
> a lot of details at this time. Any support would be extremely welcome.
>
>
>
> Thanks,
>
>
>
> Marc Dumontier
> Manager, Software Development
> Thomson Scientific (Canada)
> 1 Yonge Street, Suite 1801
> Toronto, Ontario M5E 1W7
>
>
>
> Direct +1 416 214 3448
> Mobile +1 416 454 3147
>
>
>

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com
http://www.lucenebootcamp.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: large term vectors

Posted by Cedric Ho <ce...@gmail.com>.
I guess it would be quite different for different apps.

For me, I do index update on a single machine: index each incoming
documents into one chunk according to some rule to ensure even
distribution. Then copy all the updated indexes to some other machines
for searching. Each machine will then reopen the updated index.

For searching you can look at RemoteSearchable + ParallelSearcher. But
if you need redundancy / failover, etc, you will probably need to do
it yourself.

Cedric


On Feb 11, 2008 11:14 AM, Briggs <ac...@gmail.com> wrote:
> So, I have a question about 'splitting indexes'.  I see people say
> this all over, but how have people been handling this.  I'm going to
> start a new thread, and there probably was one back in the day, but I
> am going to fire it up again.   But, how did you do it?
>
>
> On Feb 10, 2008 9:18 PM, Cedric Ho <ce...@gmail.com> wrote:
> > Is it a single index ? My index is also in the 200G range, but I never
> > managed to get
> > a single index of size > 20G and still get acceptable performance (in
> > both searching and updating).
> > So I split my indexes into chunks of < 10G
> >
> > I am curious as to how you manage such a single large index.
> >
> > Cedric
> >
> >
> >
> >
> > On Feb 8, 2008 11:51 PM,  <ma...@thomson.com> wrote:
> > > Hi,
> > >
> > >
> > >
> > > I have a large index which is around 275GB. As I search different parts
> > > of the index, the memory footprint grows with large byte arrays being
> > > stored. They never seem to get unloaded or GC'ed. Is there any way to
> > > control this behavior so that I can periodically unload cached
> > > information?
> > >
> > >
> > >
> > > The nature of the data being indexed doesn't allow me to reduce the
> > > number of terms per field, although I might be able to reduce the number
> > > of overall fields (I have some which aren't currently being searched
> > > by).
> > >
> > >
> > >
> > > I've just begun investigating and profiling the problem, so I don't have
> > > a lot of details at this time. Any support would be extremely welcome.
> > >
> > >
> > >
> > > Thanks,
> > >
> > >
> > >
> > > Marc Dumontier
> > > Manager, Software Development
> > > Thomson Scientific (Canada)
> > > 1 Yonge Street, Suite 1801
> > > Toronto, Ontario M5E 1W7
> > >
> > >
> > >
> > > Direct +1 416 214 3448
> > > Mobile +1 416 454 3147
> > >
> > >
> > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>
>
> --
> "Conscious decisions by conscious minds are what make reality real"
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: large term vectors

Posted by Briggs <ac...@gmail.com>.
So, I have a question about 'splitting indexes'.  I see people say
this all over, but how have people been handling this.  I'm going to
start a new thread, and there probably was one back in the day, but I
am going to fire it up again.   But, how did you do it?

On Feb 10, 2008 9:18 PM, Cedric Ho <ce...@gmail.com> wrote:
> Is it a single index ? My index is also in the 200G range, but I never
> managed to get
> a single index of size > 20G and still get acceptable performance (in
> both searching and updating).
> So I split my indexes into chunks of < 10G
>
> I am curious as to how you manage such a single large index.
>
> Cedric
>
>
>
>
> On Feb 8, 2008 11:51 PM,  <ma...@thomson.com> wrote:
> > Hi,
> >
> >
> >
> > I have a large index which is around 275GB. As I search different parts
> > of the index, the memory footprint grows with large byte arrays being
> > stored. They never seem to get unloaded or GC'ed. Is there any way to
> > control this behavior so that I can periodically unload cached
> > information?
> >
> >
> >
> > The nature of the data being indexed doesn't allow me to reduce the
> > number of terms per field, although I might be able to reduce the number
> > of overall fields (I have some which aren't currently being searched
> > by).
> >
> >
> >
> > I've just begun investigating and profiling the problem, so I don't have
> > a lot of details at this time. Any support would be extremely welcome.
> >
> >
> >
> > Thanks,
> >
> >
> >
> > Marc Dumontier
> > Manager, Software Development
> > Thomson Scientific (Canada)
> > 1 Yonge Street, Suite 1801
> > Toronto, Ontario M5E 1W7
> >
> >
> >
> > Direct +1 416 214 3448
> > Mobile +1 416 454 3147
> >
> >
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
"Conscious decisions by conscious minds are what make reality real"

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: large term vectors

Posted by ma...@thomson.com.
No, it's split into about 100 individual indexes. But I'm running my
64-bit JVM with around 10gb max memory in order to avoid running out of
memory after running all my unit tests (I have some other indexes as
well running as part of this application).

Upon further investigation, it seems to have something to do with the
norms
(SegmentReader.norms)

Marc


-----Original Message-----
From: Cedric Ho [mailto:cedric.ho@gmail.com] 
Sent: Sunday, February 10, 2008 9:19 PM
To: java-user@lucene.apache.org
Subject: Re: large term vectors

Is it a single index ? My index is also in the 200G range, but I never
managed to get
a single index of size > 20G and still get acceptable performance (in
both searching and updating).
So I split my indexes into chunks of < 10G

I am curious as to how you manage such a single large index.

Cedric



On Feb 8, 2008 11:51 PM,  <ma...@thomson.com> wrote:
> Hi,
>
>
>
> I have a large index which is around 275GB. As I search different
parts
> of the index, the memory footprint grows with large byte arrays being
> stored. They never seem to get unloaded or GC'ed. Is there any way to
> control this behavior so that I can periodically unload cached
> information?
>
>
>
> The nature of the data being indexed doesn't allow me to reduce the
> number of terms per field, although I might be able to reduce the
number
> of overall fields (I have some which aren't currently being searched
> by).
>
>
>
> I've just begun investigating and profiling the problem, so I don't
have
> a lot of details at this time. Any support would be extremely welcome.
>
>
>
> Thanks,
>
>
>
> Marc Dumontier
> Manager, Software Development
> Thomson Scientific (Canada)
> 1 Yonge Street, Suite 1801
> Toronto, Ontario M5E 1W7
>
>
>
> Direct +1 416 214 3448
> Mobile +1 416 454 3147
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: large term vectors

Posted by Cedric Ho <ce...@gmail.com>.
Is it a single index ? My index is also in the 200G range, but I never
managed to get
a single index of size > 20G and still get acceptable performance (in
both searching and updating).
So I split my indexes into chunks of < 10G

I am curious as to how you manage such a single large index.

Cedric



On Feb 8, 2008 11:51 PM,  <ma...@thomson.com> wrote:
> Hi,
>
>
>
> I have a large index which is around 275GB. As I search different parts
> of the index, the memory footprint grows with large byte arrays being
> stored. They never seem to get unloaded or GC'ed. Is there any way to
> control this behavior so that I can periodically unload cached
> information?
>
>
>
> The nature of the data being indexed doesn't allow me to reduce the
> number of terms per field, although I might be able to reduce the number
> of overall fields (I have some which aren't currently being searched
> by).
>
>
>
> I've just begun investigating and profiling the problem, so I don't have
> a lot of details at this time. Any support would be extremely welcome.
>
>
>
> Thanks,
>
>
>
> Marc Dumontier
> Manager, Software Development
> Thomson Scientific (Canada)
> 1 Yonge Street, Suite 1801
> Toronto, Ontario M5E 1W7
>
>
>
> Direct +1 416 214 3448
> Mobile +1 416 454 3147
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org