You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by nick19701 <to...@yahoo.com> on 2007/03/26 21:57:04 UTC

which one will save hard disk space?

 <field name="signature" type="string" indexed="false" stored="true"
compressed="true"/>
 <field name="signature" type="string" indexed="true" stored="true"
compressed="true"/>

I don't need to search the "signature" field. But my intuition tells me that
if I index this field, I will use less hard disk space since a lot of docs
may have the same signature.

Am I right?

-- 
View this message in context: http://www.nabble.com/which-one-will-save-hard-disk-space--tf3469131.html#a9680164
Sent from the Solr - User mailing list archive at Nabble.com.


Re: which one will save hard disk space?

Posted by Chris Hostetter <ho...@fucit.org>.
: Now suppose I have a lot of docs with same signature and signature
: is a very long string. It seems to me indexing the signature will save me
: hard disk space.

that's true, and if you were using Lucene directly you could do this and
then use the StringIndex FieldCache to lookup the value for each doc, but
Solr doesn't have any special optimization like that at the moment.

If you don't store it, none of hte standard request handlers will retrieve
it when generating results, but you could write a custom request handler
to do that if you wished (it could even be done fairly programaticly: look
for any fields with type "string" which are indexed but not stored and
return them)



-Hoss


Re: which one will save hard disk space?

Posted by Mike Klaas <mi...@gmail.com>.
On 3/26/07, nick19701 <to...@yahoo.com> wrote:

> But here the "signature" field has field type "string". when you index it,
> you put the whole string somewhere and give it an id, for example, 323454.
>
> In a doc, you only need to reference this id 323454 if the doc happens to
> contain
> the same signature value.
>
> Now suppose I have a lot of docs with same signature and signature
> is a very long string. It seems to me indexing the signature will save me
> hard disk space.
>
> In short, what I mean is that if you index a "string" field, you can
> retrieve it
> without loss. So you don't need to store it separately. what do you think?

In theory that might be true, but lucene is not implemented that way,
I'm afraid.  If this is the a priori situation, it is probably easier
to implement this outside of lucene and "store" the id in your
external index.

-Mike

Re: which one will save hard disk space?

Posted by nick19701 <to...@yahoo.com>.

Mike Klaas wrote:
> 
> Storing and indexing are completely disjoint: indexing is a lossy
> operation, so if you want to be able retrieve the original contents,
> they must be stored separately (ie., the first option uses the least
> space).
> 
> -MIke
> 
> 

But here the "signature" field has field type "string". when you index it,
you put the whole string somewhere and give it an id, for example, 323454.

In a doc, you only need to reference this id 323454 if the doc happens to
contain
the same signature value.

Now suppose I have a lot of docs with same signature and signature
is a very long string. It seems to me indexing the signature will save me
hard disk space. 

In short, what I mean is that if you index a "string" field, you can
retrieve it
without loss. So you don't need to store it separately. what do you think?
-- 
View this message in context: http://www.nabble.com/which-one-will-save-hard-disk-space--tf3469131.html#a9682449
Sent from the Solr - User mailing list archive at Nabble.com.


Re: which one will save hard disk space?

Posted by Mike Klaas <mi...@gmail.com>.
On 3/26/07, nick19701 <to...@yahoo.com> wrote:
>
>  <field name="signature" type="string" indexed="false" stored="true"
> compressed="true"/>
>  <field name="signature" type="string" indexed="true" stored="true"
> compressed="true"/>
>
> I don't need to search the "signature" field. But my intuition tells me that
> if I index this field, I will use less hard disk space since a lot of docs
> may have the same signature.
>
> Am I right?

Storing and indexing are completely disjoint: indexing is a lossy
operation, so if you want to be able retrieve the original contents,
they must be stored separately (ie., the first option uses the least
space).

-MIke