You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Kevin Osborn <os...@yahoo.com> on 2011/08/10 02:02:19 UTC

unique terms and multi-valued fields

Please verify my understanding. I have a field called "category" and it has a value "computers". If I use this same field and value for all of my documents, it is really only stored on disk once because "category:computers" is a unique term. Is this correct?

But, what about multi-valued fields. So, I have a field called "category". For 100 documents, it has the values "computers" and "laptops". For 100 other documents, it has the values "computers" and "tablets". Is this stored as "category:computers", "category:laptops", "category:tablets", meaning 3 unique terms. Or is it stored as "category:computers,laptops" and "category:computers,tablets". I believe it is the first case (hopefully), but I am not sure.

Thanks.

Re: unique terms and multi-valued fields

Posted by Erick Erickson <er...@gmail.com>.
Here's a very useful page for looking at what "index size" means.
http://lucene.apache.org/java/3_0_2/fileformats.html#file-names
Note that the files having to do with stored data (e.g. *.fdt) have very
little impact on searching, they don't consume very many valuable
resources.

The "stored=true"-related files *do* have an impact on replication, and
perhaps assembling the results pages though....

One bit of clarification about the indexed portion of the files. The
terms are stored once, but each term has the doc IDs associated
with it, so even though the term is only there once, having it appear
in multiple documents will increase the size because of having to
store the document associations....

Best
Erick

On Thu, Aug 11, 2011 at 4:30 PM, Kevin Osborn <os...@yahoo.com> wrote:
> Thant makes sense. There are actually stored fields. I was mostly just trying to figure out how much my index size might grow. These fields I am dealing with are large and repetitive (but mixed).
>
>
> ________________________________
> From: Erick Erickson <er...@gmail.com>
> To: solr-user@lucene.apache.org; Kevin Osborn <os...@yahoo.com>
> Sent: Wednesday, August 10, 2011 7:08 AM
> Subject: Re: unique terms and multi-valued fields
>
> Well, it depends (tm).
>
> If you're talking about *indexed* terms, then the value is stored only
> once in both the cases you mentioned below. There's really very little
> difference between a non-multi-valued field and a multi-valued field
> in terms of how it's stored in the searchable portion of the index,
> except for some position information.
>
> So, having an XML doc with a single-valued field
>
> <field name="category">computers laptops</field>
>
> is almost identical (except for position info as positionIncrementGap) as a
>
> <field name="category">computers</field>
> <field name="category">laptops</field>
>
> multiValued refers to the *input*, not whether more than one word is
> allowed in that field.
>
>
> Now, about *stored* fields. If you store the data, verbatim copies are
> kept in the
> storage-specific files in each segment, and the values will be on disk for
> each document.
>
> But you probably don't care much because this data is only referenced when you
> assemble a document for return to the client, it's irrelevant for searching.
>
> Best
> Erick
>
> On Tue, Aug 9, 2011 at 8:02 PM, Kevin Osborn <os...@yahoo.com> wrote:
>> Please verify my understanding. I have a field called "category" and it has a value "computers". If I use this same field and value for all of my documents, it is really only stored on disk once because "category:computers" is a unique term. Is this correct?
>>
>> But, what about multi-valued fields. So, I have a field called "category". For 100 documents, it has the values "computers" and "laptops". For 100 other documents, it has the values "computers" and "tablets". Is this stored as "category:computers", "category:laptops", "category:tablets", meaning 3 unique terms. Or is it stored as "category:computers,laptops" and "category:computers,tablets". I believe it is the first case (hopefully), but I am not sure.
>>
>> Thanks.

Re: unique terms and multi-valued fields

Posted by Kevin Osborn <os...@yahoo.com>.
Thant makes sense. There are actually stored fields. I was mostly just trying to figure out how much my index size might grow. These fields I am dealing with are large and repetitive (but mixed).


________________________________
From: Erick Erickson <er...@gmail.com>
To: solr-user@lucene.apache.org; Kevin Osborn <os...@yahoo.com>
Sent: Wednesday, August 10, 2011 7:08 AM
Subject: Re: unique terms and multi-valued fields

Well, it depends (tm).

If you're talking about *indexed* terms, then the value is stored only
once in both the cases you mentioned below. There's really very little
difference between a non-multi-valued field and a multi-valued field
in terms of how it's stored in the searchable portion of the index,
except for some position information.

So, having an XML doc with a single-valued field

<field name="category">computers laptops</field>

is almost identical (except for position info as positionIncrementGap) as a

<field name="category">computers</field>
<field name="category">laptops</field>

multiValued refers to the *input*, not whether more than one word is
allowed in that field.


Now, about *stored* fields. If you store the data, verbatim copies are
kept in the
storage-specific files in each segment, and the values will be on disk for
each document.

But you probably don't care much because this data is only referenced when you
assemble a document for return to the client, it's irrelevant for searching.

Best
Erick

On Tue, Aug 9, 2011 at 8:02 PM, Kevin Osborn <os...@yahoo.com> wrote:
> Please verify my understanding. I have a field called "category" and it has a value "computers". If I use this same field and value for all of my documents, it is really only stored on disk once because "category:computers" is a unique term. Is this correct?
>
> But, what about multi-valued fields. So, I have a field called "category". For 100 documents, it has the values "computers" and "laptops". For 100 other documents, it has the values "computers" and "tablets". Is this stored as "category:computers", "category:laptops", "category:tablets", meaning 3 unique terms. Or is it stored as "category:computers,laptops" and "category:computers,tablets". I believe it is the first case (hopefully), but I am not sure.
>
> Thanks.

Re: unique terms and multi-valued fields

Posted by Erick Erickson <er...@gmail.com>.
Well, it depends (tm).

If you're talking about *indexed* terms, then the value is stored only
once in both the cases you mentioned below. There's really very little
difference between a non-multi-valued field and a multi-valued field
in terms of how it's stored in the searchable portion of the index,
except for some position information.

So, having an XML doc with a single-valued field

<field name="category">computers laptops</field>

is almost identical (except for position info as positionIncrementGap) as a

<field name="category">computers</field>
<field name="category">laptops</field>

multiValued refers to the *input*, not whether more than one word is
allowed in that field.


Now, about *stored* fields. If you store the data, verbatim copies are
kept in the
storage-specific files in each segment, and the values will be on disk for
each document.

But you probably don't care much because this data is only referenced when you
assemble a document for return to the client, it's irrelevant for searching.

Best
Erick

On Tue, Aug 9, 2011 at 8:02 PM, Kevin Osborn <os...@yahoo.com> wrote:
> Please verify my understanding. I have a field called "category" and it has a value "computers". If I use this same field and value for all of my documents, it is really only stored on disk once because "category:computers" is a unique term. Is this correct?
>
> But, what about multi-valued fields. So, I have a field called "category". For 100 documents, it has the values "computers" and "laptops". For 100 other documents, it has the values "computers" and "tablets". Is this stored as "category:computers", "category:laptops", "category:tablets", meaning 3 unique terms. Or is it stored as "category:computers,laptops" and "category:computers,tablets". I believe it is the first case (hopefully), but I am not sure.
>
> Thanks.