You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by jm <jm...@gmail.com> on 2007/03/14 11:03:21 UTC

ways to minimize index size?

Hi,

I want to make my index as small as possible. I noticed about
field.setOmitNorms(true), I read in the list the diff is 1 byte per
field per doc, not huge but hey...is the only effect the score being
different? I hardly mind about the score so that would be ok.

And can I add to an index without norms when it has previous doc with norms?

Any other way to minimize size of index? Most of my fields but one are
Field.Store.NO, Field.Index.TOKENIZED and Field.TermVector.NO, one is
Field.Store.YES, Field.Index.UN_TOKENIZED and Field.TermVector.NO. I
tried compressing that one and size is reduced around 1% (it's a small
field), but I guess compression means worse performance so I am not
sure about applying that.

thanks

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: ways to minimize index size?

Posted by Erick Erickson <er...@gmail.com>.
OK, I caused more confusion than rendered help by my stemming
statement. The only reason I mentioned it was to illustrate that
performance is not linearly related to size.

It took some effort to put stemming into the index, see
PorterStemmer etc. This is NOT the default. So I took it out
to see what the effect would be.

Why not stemming made things shorter: because we also
have the requirement that phrases (i.e. words in double quotes)
do NOT match the stemmed version. Thus if we index
running watching, the following searches have the
indicated results
run - hits
watch - hits
running - hits
"run watch" does NOT hit.
"running watching" hits

So I indexed the following terms...

run
running$
watch
watching&

with the two forms of run indexed in the same position (0)
and the two forms of watch in the same position (1).

I agree that if we didn't have the exact-phrase-match requirement
the stemmed version of the index should be smaller....


Sorry for the confusion
Erick

On 3/14/07, jm <jm...@gmail.com> wrote:
>
> hi Erick,
>
>
> Well, typically my application will start with some hundreds of
> indexes...and then grow at a rate of several per day, for ever. At
> some point I know I can do some merging etc if needed.
>
> Size is dependant on the customer, could be up to a 1G per index. That
> is way I would like to minimize them. I am not worried with search
> performance.
>
> I dont understand how not stemming can reduce the size of an index...I
> would think it happens the other way, does not stemming makes the
> words shorter? (I dont stemm, so I never looked into it)
>
> thanks
> On 3/14/07, Erick Erickson <er...@gmail.com> wrote:
> > Store as little as possible, index as little as possible <G>.....
> >
> > How big is your index, and how much do you expect it to grow?
> > I ask this because it's probably not worth your time to try to
> > reduce the index size below some threshold... I found that
> > reducing my index from 8G to 4G (through not stemming) gave
> > me about a 10% performance improvement, so at some point
> > it's just not worth the effort. Also, if you posted the index size,
> > it would give folks a chance to say "there's not much you can
> > gain by reducing things more". As it is, I don't have a clue
> > whether your index is 100M or 100T. The former is in the
> > "don't waste your time" class, and the latter is...er...
> > different....
> >
> > I wouldn't bother compressing for 1%....
> >
> > Question for "the guys" so I can check an assumption....
> > Is there any difference between these two?
> > Field(Name, Value, Store, index)
> > *<
> file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/document/Field.html#Field%28java.lang.String,%20java.lang.String,%20org.apache.lucene.document.Field.Store,%20org.apache.lucene.document.Field.Index,%20org.apache.lucene.document.Field.TermVector%29
> >
> > *Field(Name, Value, Store, index, Field.TermVector.NO)
> >
> >
> > Best
> > Erick
> >
> > On 3/14/07, jm <jm...@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > I want to make my index as small as possible. I noticed about
> > > field.setOmitNorms(true), I read in the list the diff is 1 byte per
> > > field per doc, not huge but hey...is the only effect the score being
> > > different? I hardly mind about the score so that would be ok.
> > >
> > > And can I add to an index without norms when it has previous doc with
> > > norms?
> > >
> > > Any other way to minimize size of index? Most of my fields but one are
> > > Field.Store.NO, Field.Index.TOKENIZED and Field.TermVector.NO, one is
> > > Field.Store.YES, Field.Index.UN_TOKENIZED and Field.TermVector.NO. I
> > > tried compressing that one and size is reduced around 1% (it's a small
> > > field), but I guess compression means worse performance so I am not
> > > sure about applying that.
> > >
> > > thanks
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: ways to minimize index size?

Posted by jm <jm...@gmail.com>.
hi Erick,


Well, typically my application will start with some hundreds of
indexes...and then grow at a rate of several per day, for ever. At
some point I know I can do some merging etc if needed.

Size is dependant on the customer, could be up to a 1G per index. That
is way I would like to minimize them. I am not worried with search
performance.

I dont understand how not stemming can reduce the size of an index...I
would think it happens the other way, does not stemming makes the
words shorter? (I dont stemm, so I never looked into it)

thanks
On 3/14/07, Erick Erickson <er...@gmail.com> wrote:
> Store as little as possible, index as little as possible <G>.....
>
> How big is your index, and how much do you expect it to grow?
> I ask this because it's probably not worth your time to try to
> reduce the index size below some threshold... I found that
> reducing my index from 8G to 4G (through not stemming) gave
> me about a 10% performance improvement, so at some point
> it's just not worth the effort. Also, if you posted the index size,
> it would give folks a chance to say "there's not much you can
> gain by reducing things more". As it is, I don't have a clue
> whether your index is 100M or 100T. The former is in the
> "don't waste your time" class, and the latter is...er...
> different....
>
> I wouldn't bother compressing for 1%....
>
> Question for "the guys" so I can check an assumption....
> Is there any difference between these two?
> Field(Name, Value, Store, index)
> *<file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/document/Field.html#Field%28java.lang.String,%20java.lang.String,%20org.apache.lucene.document.Field.Store,%20org.apache.lucene.document.Field.Index,%20org.apache.lucene.document.Field.TermVector%29>
> *Field(Name, Value, Store, index, Field.TermVector.NO)
>
>
> Best
> Erick
>
> On 3/14/07, jm <jm...@gmail.com> wrote:
> >
> > Hi,
> >
> > I want to make my index as small as possible. I noticed about
> > field.setOmitNorms(true), I read in the list the diff is 1 byte per
> > field per doc, not huge but hey...is the only effect the score being
> > different? I hardly mind about the score so that would be ok.
> >
> > And can I add to an index without norms when it has previous doc with
> > norms?
> >
> > Any other way to minimize size of index? Most of my fields but one are
> > Field.Store.NO, Field.Index.TOKENIZED and Field.TermVector.NO, one is
> > Field.Store.YES, Field.Index.UN_TOKENIZED and Field.TermVector.NO. I
> > tried compressing that one and size is reduced around 1% (it's a small
> > field), but I guess compression means worse performance so I am not
> > sure about applying that.
> >
> > thanks
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: ways to minimize index size?

Posted by Erick Erickson <er...@gmail.com>.
Store as little as possible, index as little as possible <G>.....

How big is your index, and how much do you expect it to grow?
I ask this because it's probably not worth your time to try to
reduce the index size below some threshold... I found that
reducing my index from 8G to 4G (through not stemming) gave
me about a 10% performance improvement, so at some point
it's just not worth the effort. Also, if you posted the index size,
it would give folks a chance to say "there's not much you can
gain by reducing things more". As it is, I don't have a clue
whether your index is 100M or 100T. The former is in the
"don't waste your time" class, and the latter is...er...
different....

I wouldn't bother compressing for 1%....

Question for "the guys" so I can check an assumption....
Is there any difference between these two?
Field(Name, Value, Store, index)
*<file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/document/Field.html#Field%28java.lang.String,%20java.lang.String,%20org.apache.lucene.document.Field.Store,%20org.apache.lucene.document.Field.Index,%20org.apache.lucene.document.Field.TermVector%29>
*Field(Name, Value, Store, index, Field.TermVector.NO)


Best
Erick

On 3/14/07, jm <jm...@gmail.com> wrote:
>
> Hi,
>
> I want to make my index as small as possible. I noticed about
> field.setOmitNorms(true), I read in the list the diff is 1 byte per
> field per doc, not huge but hey...is the only effect the score being
> different? I hardly mind about the score so that would be ok.
>
> And can I add to an index without norms when it has previous doc with
> norms?
>
> Any other way to minimize size of index? Most of my fields but one are
> Field.Store.NO, Field.Index.TOKENIZED and Field.TermVector.NO, one is
> Field.Store.YES, Field.Index.UN_TOKENIZED and Field.TermVector.NO. I
> tried compressing that one and size is reduced around 1% (it's a small
> field), but I guess compression means worse performance so I am not
> sure about applying that.
>
> thanks
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: ways to minimize index size?

Posted by Sebastin <se...@gmail.com>.
Hi Erick do u have any idea on this?

jm-27 wrote:
> 
> Hi,
> 
> I want to make my index as small as possible. I noticed about
> field.setOmitNorms(true), I read in the list the diff is 1 byte per
> field per doc, not huge but hey...is the only effect the score being
> different? I hardly mind about the score so that would be ok.
> 
> And can I add to an index without norms when it has previous doc with
> norms?
> 
> Any other way to minimize size of index? Most of my fields but one are
> Field.Store.NO, Field.Index.TOKENIZED and Field.TermVector.NO, one is
> Field.Store.YES, Field.Index.UN_TOKENIZED and Field.TermVector.NO. I
> tried compressing that one and size is reduced around 1% (it's a small
> field), but I guess compression means worse performance so I am not
> sure about applying that.
> 
> thanks
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/ways-to-minimize-index-size--tf3401213.html#a11214251
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: ways to minimize index size?

Posted by Sebastin <se...@gmail.com>.
Steve,
          i use your idea it works for me great,once again i say thanks to
you.But when i use
                    (Index.No_NORMS ) it increase the size in the same time
when i use(Index.TOKENIZED)it will reduce the size.

           i use the code given by you   
BigInteger  _bi = new java.math.BigInteger("9198408365809", 10);
 System.out.println(_bi.toString(36));     

 other RADIX increase the size.                    

          Modifications I made in my code is below:

    String outgoingNumber="9198408365809";

     String incomingNumber="9840861114";
     String datesc="070601";
     String imsiNumber="444021365987";
     String callType="1";


     String outgoingRoute="DJZ01" ;
     String incomingRoute="BSC01";

BigInteger  _on = new java.math.BigInteger(outgoingNumber, 10);
 String compOutgoingNumber= _on.toString(36);

BigInteger  _in = new java.math.BigInteger( incomingNumber, 10);
 String compIncomingNumber= _in.toString(36);

BigInteger  _ds = new java.math.BigInteger(dateSc, 10);
 String compDateSc= _ds.toString(36);

BigInteger  _im = new java.math.BigInteger(imsiNumber, 10);
 String compImsiNumber= _im.toString(36);

String contents(compOutgoingNumber+" "+compIncomingNumber+" "+compDateSc+"
"+compImsiNumber+callTYpe);

String records=((compOutgoingNumber+" "+compIncomingNumber+" "+compDateSc+ "
" +outgoingRoute+" "+incomingRoute);

File indexDir = new File("/home/Mediation/Index");
IndexWriter indexWriter =new IndexWriter(indexDir, new StandardAnalyzer(),
true);
Document doc=new Document();
doc.add("contents",contents,Field.Store.NO,Field.Index.TOKENIZED);
doc.add("records",records,Field.Store.YES ,Field.Index.No);
indexWriter.addDocument(document);

please help me to acheive that


Sebastin wrote:
> 
> Hi Steve,
>      thanks for your reply a lot.its now compress upto 50% of the original
> size.is there any other possiblity using this code compress upto 80%.
> 
> Steve Liles wrote:
>> 
>> Compression aside you could index the "contents" as terms in separate 
>> fields instead of tokenized text, and disable storing of norms:
>> 
>> String outgoingNumber="9198408365809";
>> String incomingNumber="9840861114";
>> 
>> _doc.add(new Field("outgoingNumber", outgoingNumber, Store.NO, 
>> Index.NO_NORMS));
>> _doc.add(new Field("incomingNumber", incomingNumber, Store.NO, 
>> Index.NO_NORMS));
>> 
>> According to the docs "Index.NO_NORMS" will save you one byte per 
>> document in the index.
>> 
>> Or you could index all of the data as separate terms in the same 
>> "contents" field if you wanted (make the first param "contents" for all 
>> of the terms), which is more comparable to what you are currently doing.
>> (Another advantage is that the Analyzer will not be used for fields 
>> which are untokenized, and indexing should be faster.)
>> 
>> ...
>> 
>> One way to compress numerical data (possibly not the best - i'm no 
>> expert) is to change the base of the number that is indexed / stored in 
>> the index.
>> 
>> java.lang.Long and java.math.BigInteger have methods for converting from 
>> one radix to another. Taking your "outgoingNumber" as an example:
>> 
>> //compression
>> BigInteger  _bi = new java.math.BigInteger("9198408365809", 10);
>> System.out.println(_bi.toString(36));
>> 
>>  > 39douufap
>> 
>> //decompression
>> BigInteger _bi = new java.math.BigInteger("39douufap", 36);
>> System.out.println(_bi.toString(10));
>> 
>>  >9198408365809
>> 
>> Converting to a higher radix will give you better compression but you'll
>> have to do it yourself as the jdk classes only work up to base 36
>> <http://en.wikipedia.org/wiki/Base_36>.
>> 
>> It's worth compressing your unstored "contents" field as well as your 
>> stored "records" field, as the unique terms in the "contents" field will 
>> effectively be stored.
>> 
>> Also don't forget to convert the terms when you search too, otherwise 
>> you won't find anything ;)
>> 
>> Steve.
>> 
>> 
>> Sebastin wrote:
>>> When i use the standardAnalyzer storage size increases.how can i
>>> minimize
>>> index store
>>>
>>> Sebastin wrote:
>>>   
>>>>                        
>>>> String outgoingNumber="9198408365809";
>>>> String incomingNumber="9840861114";
>>>> String datesc="070601";
>>>> String imsiNumber="444021365987";
>>>> String callType="1";
>>>>
>>>> //Search Fields
>>>>  String contents=(outgoingNumber+" "+incomingNumber+" "+dateSc+"
>>>> "+imsiNumber+" "+callType );
>>>>
>>>> //Display Fields
>>>>                      
>>>>                           String records=(callingPartyNumber+"
>>>> "+calledPartyNumber+" "+dateSc+" "+chargDur+" "+incomingRoute+"
>>>> "+outgoingRoute+" "+timeSc);
>>>>                           
>>>>                      
>>>>                        IndexWriter indexWriter = new
>>>> IndexWriter(indexDir,new StandardAnalyzer(),true);  
>>>>                         
>>>>                           Document document = new Document();
>>>>   
>>>>                              document.add(new
>>>> Field("contents",contents,Field.Store.NO,Field.Index.TOKENIZED));
>>>>                              
>>>>                      
>>>>                      
>>>>                 document.add(new
>>>> Field("records",records,Field.Store.YES,Field.Index.NO));
>>>>                              
>>>>                            
>>>>                              indexWriter.setUseCompoundFile(true);
>>>>                              indexWriter.addDocument(document);
>>>>                           }
>>>>
>>>> please help me to acheive the minimum size
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Erick Erickson wrote:
>>>>     
>>>>> Show us the code you use to index. Are you storing the fields?
>>>>> omitting norms? Throwing out stop words?
>>>>>
>>>>> Best
>>>>> Erick
>>>>>
>>>>> On 6/19/07, Sebastin <se...@gmail.com> wrote:
>>>>>       
>>>>>> Hi Does anyone give me an idea to reduce the Index size to down.now i
>>>>>> am
>>>>>> getting 42% compression in my index store.i want to reduce upto 70%.i
>>>>>> use
>>>>>> standardanalyzer to write the document.when i use SimpleAnalyzer it
>>>>>> reduce
>>>>>> upto 58% but i couldnt search the document.please help me to acheive.
>>>>>>
>>>>>> Thanks in advance
>>>>>>
>>>>>> Jeff-188 wrote:
>>>>>>         
>>>>>>>> I found that reducing my index from 8G to 4G (through not stemming)
>>>>>>>>             
>>>>>> gave
>>>>>> me
>>>>>>         
>>>>>>> about a 10% performance improvement.
>>>>>>>
>>>>>>> How did you do this? I don't see this as an option.
>>>>>>>
>>>>>>> Jeff
>>>>>>>
>>>>>>>
>>>>>>>           
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://www.nabble.com/ways-to-minimize-index-size--tf3401213.html#a11195406
>>>>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>
>>>>>>
>>>>>>         
>>>>>       
>>>>     
>>>
>>>   
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
>> 
>> 
> 
> 

-- 
View this message in context: http://www.nabble.com/ways-to-minimize-index-size--tf3401213.html#a11253761
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: ways to minimize index size?

Posted by Sebastin <se...@gmail.com>.
Hi Steve,
     thanks for your reply a lot.its now compress upto 50% of the original
size.is there any other possiblity using this code compress upto 80%.

Steve Liles wrote:
> 
> Compression aside you could index the "contents" as terms in separate 
> fields instead of tokenized text, and disable storing of norms:
> 
> String outgoingNumber="9198408365809";
> String incomingNumber="9840861114";
> 
> _doc.add(new Field("outgoingNumber", outgoingNumber, Store.NO, 
> Index.NO_NORMS));
> _doc.add(new Field("incomingNumber", incomingNumber, Store.NO, 
> Index.NO_NORMS));
> 
> According to the docs "Index.NO_NORMS" will save you one byte per 
> document in the index.
> 
> Or you could index all of the data as separate terms in the same 
> "contents" field if you wanted (make the first param "contents" for all 
> of the terms), which is more comparable to what you are currently doing.
> (Another advantage is that the Analyzer will not be used for fields 
> which are untokenized, and indexing should be faster.)
> 
> ...
> 
> One way to compress numerical data (possibly not the best - i'm no 
> expert) is to change the base of the number that is indexed / stored in 
> the index.
> 
> java.lang.Long and java.math.BigInteger have methods for converting from 
> one radix to another. Taking your "outgoingNumber" as an example:
> 
> //compression
> BigInteger  _bi = new java.math.BigInteger("9198408365809", 10);
> System.out.println(_bi.toString(36));
> 
>  > 39douufap
> 
> //decompression
> BigInteger _bi = new java.math.BigInteger("39douufap", 36);
> System.out.println(_bi.toString(10));
> 
>  >9198408365809
> 
> Converting to a higher radix will give you better compression but you'll
> have to do it yourself as the jdk classes only work up to base 36
> <http://en.wikipedia.org/wiki/Base_36>.
> 
> It's worth compressing your unstored "contents" field as well as your 
> stored "records" field, as the unique terms in the "contents" field will 
> effectively be stored.
> 
> Also don't forget to convert the terms when you search too, otherwise 
> you won't find anything ;)
> 
> Steve.
> 
> 
> Sebastin wrote:
>> When i use the standardAnalyzer storage size increases.how can i minimize
>> index store
>>
>> Sebastin wrote:
>>   
>>>                        
>>> String outgoingNumber="9198408365809";
>>> String incomingNumber="9840861114";
>>> String datesc="070601";
>>> String imsiNumber="444021365987";
>>> String callType="1";
>>>
>>> //Search Fields
>>>  String contents=(outgoingNumber+" "+incomingNumber+" "+dateSc+"
>>> "+imsiNumber+" "+callType );
>>>
>>> //Display Fields
>>>                      
>>>                           String records=(callingPartyNumber+"
>>> "+calledPartyNumber+" "+dateSc+" "+chargDur+" "+incomingRoute+"
>>> "+outgoingRoute+" "+timeSc);
>>>                           
>>>                      
>>>                        IndexWriter indexWriter = new
>>> IndexWriter(indexDir,new StandardAnalyzer(),true);  
>>>                         
>>>                           Document document = new Document();
>>>   
>>>                              document.add(new
>>> Field("contents",contents,Field.Store.NO,Field.Index.TOKENIZED));
>>>                              
>>>                      
>>>                      
>>>                 document.add(new
>>> Field("records",records,Field.Store.YES,Field.Index.NO));
>>>                              
>>>                            
>>>                              indexWriter.setUseCompoundFile(true);
>>>                              indexWriter.addDocument(document);
>>>                           }
>>>
>>> please help me to acheive the minimum size
>>>
>>>
>>>
>>>
>>>
>>> Erick Erickson wrote:
>>>     
>>>> Show us the code you use to index. Are you storing the fields?
>>>> omitting norms? Throwing out stop words?
>>>>
>>>> Best
>>>> Erick
>>>>
>>>> On 6/19/07, Sebastin <se...@gmail.com> wrote:
>>>>       
>>>>> Hi Does anyone give me an idea to reduce the Index size to down.now i
>>>>> am
>>>>> getting 42% compression in my index store.i want to reduce upto 70%.i
>>>>> use
>>>>> standardanalyzer to write the document.when i use SimpleAnalyzer it
>>>>> reduce
>>>>> upto 58% but i couldnt search the document.please help me to acheive.
>>>>>
>>>>> Thanks in advance
>>>>>
>>>>> Jeff-188 wrote:
>>>>>         
>>>>>>> I found that reducing my index from 8G to 4G (through not stemming)
>>>>>>>             
>>>>> gave
>>>>> me
>>>>>         
>>>>>> about a 10% performance improvement.
>>>>>>
>>>>>> How did you do this? I don't see this as an option.
>>>>>>
>>>>>> Jeff
>>>>>>
>>>>>>
>>>>>>           
>>>>> --
>>>>> View this message in context:
>>>>> http://www.nabble.com/ways-to-minimize-index-size--tf3401213.html#a11195406
>>>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>>         
>>>>       
>>>     
>>
>>   
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/ways-to-minimize-index-size--tf3401213.html#a11249562
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: ways to minimize index size?

Posted by Steve Liles <st...@knowledgeview.co.uk>.
Compression aside you could index the "contents" as terms in separate 
fields instead of tokenized text, and disable storing of norms:

String outgoingNumber="9198408365809";
String incomingNumber="9840861114";

_doc.add(new Field("outgoingNumber", outgoingNumber, Store.NO, 
Index.NO_NORMS));
_doc.add(new Field("incomingNumber", incomingNumber, Store.NO, 
Index.NO_NORMS));

According to the docs "Index.NO_NORMS" will save you one byte per 
document in the index.

Or you could index all of the data as separate terms in the same 
"contents" field if you wanted (make the first param "contents" for all 
of the terms), which is more comparable to what you are currently doing.
(Another advantage is that the Analyzer will not be used for fields 
which are untokenized, and indexing should be faster.)

...

One way to compress numerical data (possibly not the best - i'm no 
expert) is to change the base of the number that is indexed / stored in 
the index.

java.lang.Long and java.math.BigInteger have methods for converting from 
one radix to another. Taking your "outgoingNumber" as an example:

//compression
BigInteger  _bi = new java.math.BigInteger("9198408365809", 10);
System.out.println(_bi.toString(36));

 > 39douufap

//decompression
BigInteger _bi = new java.math.BigInteger("39douufap", 36);
System.out.println(_bi.toString(10));

 >9198408365809

Converting to a higher radix will give you better compression but you'll
have to do it yourself as the jdk classes only work up to base 36
<http://en.wikipedia.org/wiki/Base_36>.

It's worth compressing your unstored "contents" field as well as your 
stored "records" field, as the unique terms in the "contents" field will 
effectively be stored.

Also don't forget to convert the terms when you search too, otherwise 
you won't find anything ;)

Steve.


Sebastin wrote:
> When i use the standardAnalyzer storage size increases.how can i minimize
> index store
>
> Sebastin wrote:
>   
>>                        
>> String outgoingNumber="9198408365809";
>> String incomingNumber="9840861114";
>> String datesc="070601";
>> String imsiNumber="444021365987";
>> String callType="1";
>>
>> //Search Fields
>>  String contents=(outgoingNumber+" "+incomingNumber+" "+dateSc+"
>> "+imsiNumber+" "+callType );
>>
>> //Display Fields
>>                      
>>                           String records=(callingPartyNumber+"
>> "+calledPartyNumber+" "+dateSc+" "+chargDur+" "+incomingRoute+"
>> "+outgoingRoute+" "+timeSc);
>>                           
>>                      
>>                        IndexWriter indexWriter = new
>> IndexWriter(indexDir,new StandardAnalyzer(),true);  
>>                         
>>                           Document document = new Document();
>>   
>>                              document.add(new
>> Field("contents",contents,Field.Store.NO,Field.Index.TOKENIZED));
>>                              
>>                      
>>                      
>>                 document.add(new
>> Field("records",records,Field.Store.YES,Field.Index.NO));
>>                              
>>                            
>>                              indexWriter.setUseCompoundFile(true);
>>                              indexWriter.addDocument(document);
>>                           }
>>
>> please help me to acheive the minimum size
>>
>>
>>
>>
>>
>> Erick Erickson wrote:
>>     
>>> Show us the code you use to index. Are you storing the fields?
>>> omitting norms? Throwing out stop words?
>>>
>>> Best
>>> Erick
>>>
>>> On 6/19/07, Sebastin <se...@gmail.com> wrote:
>>>       
>>>> Hi Does anyone give me an idea to reduce the Index size to down.now i am
>>>> getting 42% compression in my index store.i want to reduce upto 70%.i
>>>> use
>>>> standardanalyzer to write the document.when i use SimpleAnalyzer it
>>>> reduce
>>>> upto 58% but i couldnt search the document.please help me to acheive.
>>>>
>>>> Thanks in advance
>>>>
>>>> Jeff-188 wrote:
>>>>         
>>>>>> I found that reducing my index from 8G to 4G (through not stemming)
>>>>>>             
>>>> gave
>>>> me
>>>>         
>>>>> about a 10% performance improvement.
>>>>>
>>>>> How did you do this? I don't see this as an option.
>>>>>
>>>>> Jeff
>>>>>
>>>>>
>>>>>           
>>>> --
>>>> View this message in context:
>>>> http://www.nabble.com/ways-to-minimize-index-size--tf3401213.html#a11195406
>>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>         
>>>       
>>     
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: ways to minimize index size?

Posted by Sebastin <se...@gmail.com>.
When i use the standardAnalyzer storage size increases.how can i minimize
index store

Sebastin wrote:
> 
>                        
> String outgoingNumber="9198408365809";
> String incomingNumber="9840861114";
> String datesc="070601";
> String imsiNumber="444021365987";
> String callType="1";
> 
> //Search Fields
>  String contents=(outgoingNumber+" "+incomingNumber+" "+dateSc+"
> "+imsiNumber+" "+callType );
> 
> //Display Fields
>                      
>                           String records=(callingPartyNumber+"
> "+calledPartyNumber+" "+dateSc+" "+chargDur+" "+incomingRoute+"
> "+outgoingRoute+" "+timeSc);
>                           
>                      
>                        IndexWriter indexWriter = new
> IndexWriter(indexDir,new StandardAnalyzer(),true);  
>                         
>                           Document document = new Document();
>   
>                              document.add(new
> Field("contents",contents,Field.Store.NO,Field.Index.TOKENIZED));
>                              
>                      
>                      
>                 document.add(new
> Field("records",records,Field.Store.YES,Field.Index.NO));
>                              
>                            
>                              indexWriter.setUseCompoundFile(true);
>                              indexWriter.addDocument(document);
>                           }
> 
> please help me to acheive the minimum size
> 
> 
> 
> 
> 
> Erick Erickson wrote:
>> 
>> Show us the code you use to index. Are you storing the fields?
>> omitting norms? Throwing out stop words?
>> 
>> Best
>> Erick
>> 
>> On 6/19/07, Sebastin <se...@gmail.com> wrote:
>>>
>>>
>>> Hi Does anyone give me an idea to reduce the Index size to down.now i am
>>> getting 42% compression in my index store.i want to reduce upto 70%.i
>>> use
>>> standardanalyzer to write the document.when i use SimpleAnalyzer it
>>> reduce
>>> upto 58% but i couldnt search the document.please help me to acheive.
>>>
>>> Thanks in advance
>>>
>>> Jeff-188 wrote:
>>> >
>>> >>I found that reducing my index from 8G to 4G (through not stemming)
>>> gave
>>> me
>>> > about a 10% performance improvement.
>>> >
>>> > How did you do this? I don't see this as an option.
>>> >
>>> > Jeff
>>> >
>>> >
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/ways-to-minimize-index-size--tf3401213.html#a11195406
>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>> 
>> 
> 
> 

-- 
View this message in context: http://www.nabble.com/ways-to-minimize-index-size--tf3401213.html#a11207318
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: ways to minimize index size?

Posted by Sebastin <se...@gmail.com>.
                       
String outgoingNumber="9198408365809";
String incomingNumber="9840861114";
String datesc="070601";
String imsiNumber="444021365987";
String callType="1";

//Search Fields
 String contents=(outgoingNumber+" "+incomingNumber+" "+dateSc+"
"+imsiNumber+" "+callType );

//Display Fields
                     
                          String records=(callingPartyNumber+"
"+calledPartyNumber+" "+dateSc+" "+chargDur+" "+incomingRoute+"
"+outgoingRoute+" "+timeSc);
                          
                     
                       IndexWriter indexWriter = new
IndexWriter(indexDir,new StandardAnalyzer(),true);  
                        
                          Document document = new Document();
  
                             document.add(new
Field("contents",contents,Field.Store.NO,Field.Index.TOKENIZED));
                             
                     
                     
                document.add(new
Field("records",records,Field.Store.YES,Field.Index.NO));
                             
                           
                             indexWriter.setUseCompoundFile(true);
                             indexWriter.addDocument(document);
                          }

please help me to acheive the minimum size





Erick Erickson wrote:
> 
> Show us the code you use to index. Are you storing the fields?
> omitting norms? Throwing out stop words?
> 
> Best
> Erick
> 
> On 6/19/07, Sebastin <se...@gmail.com> wrote:
>>
>>
>> Hi Does anyone give me an idea to reduce the Index size to down.now i am
>> getting 42% compression in my index store.i want to reduce upto 70%.i use
>> standardanalyzer to write the document.when i use SimpleAnalyzer it
>> reduce
>> upto 58% but i couldnt search the document.please help me to acheive.
>>
>> Thanks in advance
>>
>> Jeff-188 wrote:
>> >
>> >>I found that reducing my index from 8G to 4G (through not stemming)
>> gave
>> me
>> > about a 10% performance improvement.
>> >
>> > How did you do this? I don't see this as an option.
>> >
>> > Jeff
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/ways-to-minimize-index-size--tf3401213.html#a11195406
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/ways-to-minimize-index-size--tf3401213.html#a11195897
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: ways to minimize index size?

Posted by Erick Erickson <er...@gmail.com>.
Show us the code you use to index. Are you storing the fields?
omitting norms? Throwing out stop words?

Best
Erick

On 6/19/07, Sebastin <se...@gmail.com> wrote:
>
>
> Hi Does anyone give me an idea to reduce the Index size to down.now i am
> getting 42% compression in my index store.i want to reduce upto 70%.i use
> standardanalyzer to write the document.when i use SimpleAnalyzer it reduce
> upto 58% but i couldnt search the document.please help me to acheive.
>
> Thanks in advance
>
> Jeff-188 wrote:
> >
> >>I found that reducing my index from 8G to 4G (through not stemming) gave
> me
> > about a 10% performance improvement.
> >
> > How did you do this? I don't see this as an option.
> >
> > Jeff
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/ways-to-minimize-index-size--tf3401213.html#a11195406
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: ways to minimize index size?

Posted by Sebastin <se...@gmail.com>.
Hi Does anyone give me an idea to reduce the Index size to down.now i am 
getting 42% compression in my index store.i want to reduce upto 70%.i use
standardanalyzer to write the document.when i use SimpleAnalyzer it reduce
upto 58% but i couldnt search the document.please help me to acheive.

 Thanks in advance

Jeff-188 wrote:
> 
>>I found that reducing my index from 8G to 4G (through not stemming) gave
me
> about a 10% performance improvement.
> 
> How did you do this? I don't see this as an option.
> 
> Jeff
> 
> 

-- 
View this message in context: http://www.nabble.com/ways-to-minimize-index-size--tf3401213.html#a11195406
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org