You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Gregory Tarr <Gr...@detica.com> on 2009/07/15 12:48:41 UTC

Index doubling in size when adding extra terms

I have added a new field to each document in my index containing
substrings of another field to speed up initial-wildcard searches.

Each document has a field "text" which might contain "the quick brown
fox jumped over the lazy dogs"
The new field - "text_substrings" would then contain "the quick uick ick
brown rown own fox jumped umped mped ped over ver the lazy azy dogs ogs"

This allows me to convert initial wildcard queries "*own" into a term
query "own".

However adding this field has exactly doubled the size of the index.
Given that the term list is a small fraction of the index (?), I find
this strange. I think it might be storing the documents twice.

Is there any way to stop this from happening?

Thanks

Greg Tarr




This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately.
Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory.  The contents of this email may relate to dealings with other companies within the Detica Limited group of companies.

Detica Limited is registered in England under No: 1337451.

Registered offices: Surrey Research Park, Guildford, Surrey, GU2 7YP, England.

Re: Index doubling in size when adding extra terms

Posted by Michael McCandless <lu...@mikemccandless.com>.

It looks like your "text_substrings" field will have many more unique
terms than the original text, right?  And, since it's indexed (I
assume), the docIDs will in fact be stored twice (once in the postings
for your orig text and once in the postings for text_substrings).  So
I think it's expected that the postings (*.frq/.prx/.tis/.tii) would
at least double in size.

Mike

On Wed, Jul 15, 2009 at 6:48 AM, Gregory Tarr<Gr...@detica.com> wrote:
> I have added a new field to each document in my index containing
> substrings of another field to speed up initial-wildcard searches.
>
> Each document has a field "text" which might contain "the quick brown
> fox jumped over the lazy dogs"
> The new field - "text_substrings" would then contain "the quick uick ick
> brown rown own fox jumped umped mped ped over ver the lazy azy dogs ogs"
>
> This allows me to convert initial wildcard queries "*own" into a term
> query "own".
>
> However adding this field has exactly doubled the size of the index.
> Given that the term list is a small fraction of the index (?), I find
> this strange. I think it might be storing the documents twice.
>
> Is there any way to stop this from happening?
>
> Thanks
>
> Greg Tarr
>
>
>
>
> This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately.
> Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory.  The contents of this email may relate to dealings with other companies within the Detica Limited group of companies.
>
> Detica Limited is registered in England under No: 1337451.
>
> Registered offices: Surrey Research Park, Guildford, Surrey, GU2 7YP, England.
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Index doubling in size when adding extra terms

Posted by Uwe Schindler <uw...@thetaphi.de>.

Field.Store.NO used for the text_substrings field? :-)

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Gregory Tarr [mailto:Gregory.tarr@detica.com]
> Sent: Wednesday, July 15, 2009 12:49 PM
> To: java-user@lucene.apache.org
> Subject: Index doubling in size when adding extra terms
> 
> I have added a new field to each document in my index containing
> substrings of another field to speed up initial-wildcard searches.
> 
> Each document has a field "text" which might contain "the quick brown
> fox jumped over the lazy dogs"
> The new field - "text_substrings" would then contain "the quick uick ick
> brown rown own fox jumped umped mped ped over ver the lazy azy dogs ogs"
> 
> This allows me to convert initial wildcard queries "*own" into a term
> query "own".
> 
> However adding this field has exactly doubled the size of the index.
> Given that the term list is a small fraction of the index (?), I find
> this strange. I think it might be storing the documents twice.
> 
> Is there any way to stop this from happening?
> 
> Thanks
> 
> Greg Tarr
> 
> 
> 
> 
> This message should be regarded as confidential. If you have received this
> email in error please notify the sender and destroy it immediately.
> Statements of intent shall only become binding when confirmed in hard copy
> by an authorised signatory.  The contents of this email may relate to
> dealings with other companies within the Detica Limited group of
> companies.
> 
> Detica Limited is registered in England under No: 1337451.
> 
> Registered offices: Surrey Research Park, Guildford, Surrey, GU2 7YP,
> England.



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org