You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Niraj Alok <ni...@emacmillan.com> on 2004/09/01 06:45:18 UTC

Re: indexing size

I was also thinking on the same lines.
Actually the original code was written by some one else who has left and so
I have to own this.

At almost all the places, it is Field.Text and at some few places its
Field.UnIndexed.
I looked at the javadocs and found that there is Field.UnStored also.

The problem is I am not too sure which one to change to what. It would be
really enlightening if you could point the differences
between those three and what would I need to change in my search code.

If I make some of them Field.Unstored, I can see from the javadocs that it
will be indexed and tokenized but not stored. If it is not stored, how can I
use it while searching? Basically what is meant by indexed and stored,
indexed and not stored and not indexed and stored?

Regards,
Niraj
----- Original Message -----
From: "petite_abeille" <pe...@mac.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Tuesday, August 31, 2004 8:57 PM
Subject: Re: indexing size

>
> On Aug 31, 2004, at 17:17, Otis Gospodnetic wrote:
>
> > You also have a large number of
> > fields, and it looks like a lot (all?) of them are stored and indexed.
> > That's what that large .fdt file indicated.  That file is > 206 MB in
> > size.
>
> Try using Field.UnStored() to avoid storing all those data in your
> indices as it's usually not necessary.
>
> PA.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>

Re: indexing size

Posted by Bernhard Messer <Be...@intrafind.de>.

Dmitry Serebrennikov wrote:

> Niraj Alok wrote:
>
>> Hi PA,
>>
>> Thanks for the detail ! Since we are using lucene to store the data 
>> also, I
>> guess I would not be able to use it.
>>  
>>
> By the way, I could be wrong, but I think the 35% figure you 
> referenced in the your first e-mail actually does not include any 
> stored fields. The deal with 35% was, I think, to illustrate that 
> index data structures used for searching by Lucene are efficient. But 
> Lucene does nothing special about stored content - no compression or 
> anything like that. So you end up with the pure size of your data plus 
> the 35% of the indexed data.

There will be a patch available to the end of this week, which allows 
you to store binary values compressed within a lucene index. It means 
that you will be able to store and retrieve whole documents within 
lucene in a very efficient way ;-)

regards
bernhard

>
>
> Cheers.
> Dmitry.
>
>> Regards,
>> Niraj
>> ----- Original Message -----
>> From: "petite_abeille" <pe...@mac.com>
>> To: "Lucene Users List" <lu...@jakarta.apache.org>
>> Sent: Wednesday, September 01, 2004 1:14 PM
>> Subject: Re: indexing size
>>
>>
>>  
>>
>>> Hi Niraj,
>>>
>>> On Sep 01, 2004, at 06:45, Niraj Alok wrote:
>>>
>>>   
>>>
>>>> If I make some of them Field.Unstored, I can see from the javadocs
>>>> that it
>>>> will be indexed and tokenized but not stored. If it is not stored, how
>>>> can I
>>>> use it while searching?
>>>>     
>>>
>>> The different type of fields don't impact how you do your search. This
>>> is always the same.
>>>
>>> Using Unstored fields simply means that you use Lucene as a pure index
>>> for search purpose only, not for storing any data.
>>>
>>> Specifically, the assumption is that your original data lives somewhere
>>> else, outside of Lucene. If this assumption is true, then you can index
>>> everything as Unstored with the addition of one Keyword per document.
>>> The Keyword field holds some sort of unique identifier which allows you
>>> to retrieve the original data if necessary (e.g. a primary key, an URI,
>>> what not).
>>>
>>> Here is an example of this approach:
>>>
>>> (1) For indexing, check the indexValuesWithID() method
>>>
>>> http://cvs.sourceforge.net/viewcvs.py/zoe/ZOE/Frameworks/SZObject/
>>> SZIndex.java?view=markup
>>>
>>> Note the addition of a Field.Keyword for each document and the use of
>>> Field.UnStored for everything else
>>>
>>> (2) For fetching, check objectsWithSpecificationAndHitsInStore()
>>>
>>> http://cvs.sourceforge.net/viewcvs.py/zoe/ZOE/Frameworks/SZObject/
>>> SZFinder.java?view=markup
>>>
>>> HTH.
>>>
>>> Cheers,
>>>
>>> PA.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>>
>>>   
>>
>>
>>  
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: indexing size

Posted by Dmitry Serebrennikov <dm...@earthlink.net>.

Niraj Alok wrote:

>Hi PA,
>
>Thanks for the detail ! Since we are using lucene to store the data also, I
>guess I would not be able to use it.
>  
>
By the way, I could be wrong, but I think the 35% figure you referenced 
in the your first e-mail actually does not include any stored fields. 
The deal with 35% was, I think, to illustrate that index data structures 
used for searching by Lucene are efficient. But Lucene does nothing 
special about stored content - no compression or anything like that. So 
you end up with the pure size of your data plus the 35% of the indexed 
data.

Cheers.
Dmitry.

>Regards,
>Niraj
>----- Original Message -----
>From: "petite_abeille" <pe...@mac.com>
>To: "Lucene Users List" <lu...@jakarta.apache.org>
>Sent: Wednesday, September 01, 2004 1:14 PM
>Subject: Re: indexing size
>
>
>  
>
>>Hi Niraj,
>>
>>On Sep 01, 2004, at 06:45, Niraj Alok wrote:
>>
>>    
>>
>>>If I make some of them Field.Unstored, I can see from the javadocs
>>>that it
>>>will be indexed and tokenized but not stored. If it is not stored, how
>>>can I
>>>use it while searching?
>>>      
>>>
>>The different type of fields don't impact how you do your search. This
>>is always the same.
>>
>>Using Unstored fields simply means that you use Lucene as a pure index
>>for search purpose only, not for storing any data.
>>
>>Specifically, the assumption is that your original data lives somewhere
>>else, outside of Lucene. If this assumption is true, then you can index
>>everything as Unstored with the addition of one Keyword per document.
>>The Keyword field holds some sort of unique identifier which allows you
>>to retrieve the original data if necessary (e.g. a primary key, an URI,
>>what not).
>>
>>Here is an example of this approach:
>>
>>(1) For indexing, check the indexValuesWithID() method
>>
>>http://cvs.sourceforge.net/viewcvs.py/zoe/ZOE/Frameworks/SZObject/
>>SZIndex.java?view=markup
>>
>>Note the addition of a Field.Keyword for each document and the use of
>>Field.UnStored for everything else
>>
>>(2) For fetching, check objectsWithSpecificationAndHitsInStore()
>>
>>http://cvs.sourceforge.net/viewcvs.py/zoe/ZOE/Frameworks/SZObject/
>>SZFinder.java?view=markup
>>
>>HTH.
>>
>>Cheers,
>>
>>PA.
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>    
>>
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: indexing size

Posted by Niraj Alok <ni...@emacmillan.com>.

Hi PA,

Thanks for the detail ! Since we are using lucene to store the data also, I
guess I would not be able to use it.

Regards,
Niraj
----- Original Message -----
From: "petite_abeille" <pe...@mac.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Wednesday, September 01, 2004 1:14 PM
Subject: Re: indexing size


> Hi Niraj,
>
> On Sep 01, 2004, at 06:45, Niraj Alok wrote:
>
> > If I make some of them Field.Unstored, I can see from the javadocs
> > that it
> > will be indexed and tokenized but not stored. If it is not stored, how
> > can I
> > use it while searching?
>
> The different type of fields don't impact how you do your search. This
> is always the same.
>
> Using Unstored fields simply means that you use Lucene as a pure index
> for search purpose only, not for storing any data.
>
> Specifically, the assumption is that your original data lives somewhere
> else, outside of Lucene. If this assumption is true, then you can index
> everything as Unstored with the addition of one Keyword per document.
> The Keyword field holds some sort of unique identifier which allows you
> to retrieve the original data if necessary (e.g. a primary key, an URI,
> what not).
>
> Here is an example of this approach:
>
> (1) For indexing, check the indexValuesWithID() method
>
> http://cvs.sourceforge.net/viewcvs.py/zoe/ZOE/Frameworks/SZObject/
> SZIndex.java?view=markup
>
> Note the addition of a Field.Keyword for each document and the use of
> Field.UnStored for everything else
>
> (2) For fetching, check objectsWithSpecificationAndHitsInStore()
>
> http://cvs.sourceforge.net/viewcvs.py/zoe/ZOE/Frameworks/SZObject/
> SZFinder.java?view=markup
>
> HTH.
>
> Cheers,
>
> PA.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>

Re: indexing size

Posted by petite_abeille <pe...@mac.com>.

Hi Niraj,

On Sep 01, 2004, at 06:45, Niraj Alok wrote:

> If I make some of them Field.Unstored, I can see from the javadocs  
> that it
> will be indexed and tokenized but not stored. If it is not stored, how  
> can I
> use it while searching?

The different type of fields don't impact how you do your search. This  
is always the same.

Using Unstored fields simply means that you use Lucene as a pure index  
for search purpose only, not for storing any data.

Specifically, the assumption is that your original data lives somewhere  
else, outside of Lucene. If this assumption is true, then you can index  
everything as Unstored with the addition of one Keyword per document.  
The Keyword field holds some sort of unique identifier which allows you  
to retrieve the original data if necessary (e.g. a primary key, an URI,  
what not).

Here is an example of this approach:

(1) For indexing, check the indexValuesWithID() method

http://cvs.sourceforge.net/viewcvs.py/zoe/ZOE/Frameworks/SZObject/ 
SZIndex.java?view=markup

Note the addition of a Field.Keyword for each document and the use of  
Field.UnStored for everything else

(2) For fetching, check objectsWithSpecificationAndHitsInStore()

http://cvs.sourceforge.net/viewcvs.py/zoe/ZOE/Frameworks/SZObject/ 
SZFinder.java?view=markup

HTH.

Cheers,

PA.

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: indexing size

Posted by Stephane James Vaucher <va...@cirano.qc.ca>.

On Wed, 1 Sep 2004, Niraj Alok wrote
> I was also thinking on the same lines.
> Actually the original code was written by some one else who has left and so
> I have to own this.
>
> At almost all the places, it is Field.Text and at some few places its
> Field.UnIndexed.
> I looked at the javadocs and found that there is Field.UnStored also.
>
> The problem is I am not too sure which one to change to what. It would be
> really enlightening if you could point the differences
> between those three and what would I need to change in my search code.
>
> If I make some of them Field.Unstored, I can see from the javadocs that
> it will be indexed and tokenized but not stored. If it is not stored,
> how can I use it while searching? Basically what is meant by indexed and
> stored, indexed and not stored and not indexed and stored?

If all you need is to seach a field, you do not need to store it. If it is
not stored it can still be tokenised and analysed by lucene. It will then
be only stored as a set of token, but not as whole. You can thus use it
for fields that you never need to retrieve from the index.

For example:
the quick brown fox jumped over the lazy dog.

will be store in lucene only as tokens, not as a whole, so using a
whitespace analyser using a stopword list {the}:

You will have these tokens in lucene:
quick
brown
fox
jumped
over
dog

You will NOT be able to retrieve the original text, but you will be able
to search it.

HTH,
sv

>
> Regards,
> Niraj
> ----- Original Message -----
> From: "petite_abeille" <pe...@mac.com>
> To: "Lucene Users List" <lu...@jakarta.apache.org>
> Sent: Tuesday, August 31, 2004 8:57 PM
> Subject: Re: indexing size
>
>
> >
> > On Aug 31, 2004, at 17:17, Otis Gospodnetic wrote:
> >
> > > You also have a large number of
> > > fields, and it looks like a lot (all?) of them are stored and indexed.
> > > That's what that large .fdt file indicated.  That file is > 206 MB in
> > > size.
> >
> > Try using Field.UnStored() to avoid storing all those data in your
> > indices as it's usually not necessary.
> >
> > PA.
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org