You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Thijs Vonk <vo...@gmail.com> on 2008/04/27 13:50:17 UTC

unique values from a field in a result

What is the best way to get the unique terms from a field in a result?
I've been using SimpleFacet to do this. However, I don't need the 
counts, so it seems overkill to have to iterate over all the result 
documents per field to get the unique values for that field.
The field's contain database Id's that I use on the client side to get 
aditional information from the database.

Is there a faster way to get the unique values from a field in a result?

Thijs

Re: unique values from a field in a result

Posted by Chris Hostetter <ho...@fucit.org>.

: My example is just simple, in real life the numbers are a lot bigger. However,
: the amount of unique products vs variations is such that it seems a lot of
: work to iterate over al variations in a DocSet just to get the few unique
: products.
: But, what I understand from you anwser is that the best way to get the 3
: unique products is to iterate over the 1000 variations in the result DocSet?
: And if that is the case I'm happy with it.

in a nutshell: yes.  As ian mentione,d you can use sampling to get 
approximate info, but if you only check 999 of the docs in the DocSet's 
it's always possible that the 1000th "variation" doc actaully refers to a 
4th product.

the other appraoch to take is to have redundent denormalization: index one 
doc per varaition, and one doc per product -- power your left nav with a 
search against the "product" docs, and power the main listing with a 
search against the "variation" docs.


-Hoss

Re: unique values from a field in a result

Posted by Ian Holsman <li...@holsman.net>.

Hi Thijs.

If you are not concerned with a *EXACT* number there is a paper that was 
published in 1990 that discusses this problem.

http://dblab.kaist.ac.kr/Publication/pdf/ACM90_TODS_v15n2.pdf

from the paper (If I understand it correctly)

For 120,000,000 records you can sample 10,112,529 records  (10%) when 
the variance is low and get an answer with 95% confidence.


Regards
Ian

Thijs wrote:
> It must be my english.
> When I read your comment, I think you could compare it to the category 
> example...
>
> Maybe with an example I can explain my situation better:
> The documents in the index contain variations of different products.
> Say for example I have 10 different products. Every product is indexed 
> 1000 times (1000 different variations, per product) the product is not 
> unique, the variation is unique.
> The first 10 result of a search only contain the best matching 
> variations for all the products in the complete result. So lets say 
> the result returns 1000 variations for 3 different products. What I 
> need is some 'sidebar information' containing detailed information on 
> al the 3 unique products in the complete result.
>
> My example is just simple, in real life the numbers are a lot bigger. 
> However, the amount of unique products vs variations is such that it 
> seems a lot of work to iterate over al variations in a DocSet just to 
> get the few unique products.
> But, what I understand from you anwser is that the best way to get the 
> 3 unique products is to iterate over the 1000 variations in the result 
> DocSet? And if that is the case I'm happy with it.
>
> Thanks
> Thijs
>
>
>
> But to get some extra inforamtion I need al the unique values for one 
> of the fields in the index (being the pk of the product).
>
> Chris Hostetter schreef:
>> : You are correct I'm looking for the unique values for one field in 
>> a DocSet.
>> : The field is not multivalued. and it contains only 1 long value, 
>> the pk of a
>> : database table
>> : But you said the counts are stored in the index, I don't see that. 
>> Because
>>
>> there's something very confusing about your question ... if the value 
>> of the field is unique for every document (by "pk" you mean the 
>> primary key for these docs in your database correct?) then why do you 
>> specificly need the "unique terms" ? ... aren't they by definition 
>> unique?
>>
>> usually when people ask questions like this, they are interested in 
>> the "unique values" for something like a "category" field, where lots 
>> of documenst are in the same category, and they want to know what the 
>> full list of categories is for all ofhte documenst that match their 
>> query.
>>
>> if you want the list of all "primary keys" for all the documents that 
>> match your query, why not just make sure that field has stored="true" 
>> in the schema.xml and getthe values that way?
>>
>> I'm extra confused because of this comment...
>>
>> : when I debug simplefacet. It always iterates over all the documents 
>> in the
>> : result docset (SimpleFacet.getFieldCacheCounts line 259).
>>
>> it doesn't *seem* like faceting is neccessary, but why do you think 
>> iterating over all the documents in your result set set seems like a 
>> waste here?  if you want to know what *all* the values are for every 
>> document in your doc set, then regardless of wether the values are 
>> distinct for each doc, how else could Solr get all the values then 
>> looking at each matching doc?
>>
>>
>>
>> -Hoss
>>
>>   
>
>

Re: unique values from a field in a result

Posted by Thijs <vo...@gmail.com>.

It must be my english.
When I read your comment, I think you could compare it to the category 
example...

Maybe with an example I can explain my situation better:
The documents in the index contain variations of different products.
Say for example I have 10 different products. Every product is indexed 
1000 times (1000 different variations, per product) the product is not 
unique, the variation is unique.
The first 10 result of a search only contain the best matching 
variations for all the products in the complete result. So lets say the 
result returns 1000 variations for 3 different products. What I need is 
some 'sidebar information' containing detailed information on al the 3 
unique products in the complete result.

My example is just simple, in real life the numbers are a lot bigger. 
However, the amount of unique products vs variations is such that it 
seems a lot of work to iterate over al variations in a DocSet just to 
get the few unique products.
But, what I understand from you anwser is that the best way to get the 3 
unique products is to iterate over the 1000 variations in the result 
DocSet? And if that is the case I'm happy with it.

Thanks
Thijs



But to get some extra inforamtion I need al the unique values for one of 
the fields in the index (being the pk of the product).

Chris Hostetter schreef:
> : You are correct I'm looking for the unique values for one field in a DocSet.
> : The field is not multivalued. and it contains only 1 long value, the pk of a
> : database table
> : But you said the counts are stored in the index, I don't see that. Because
>
> there's something very confusing about your question ... if the value of 
> the field is unique for every document (by "pk" you mean the primary key 
> for these docs in your database correct?) then why do you specificly need 
> the "unique terms" ? ... aren't they by definition unique?
>
> usually when people ask questions like this, they are interested in the 
> "unique values" for something like a "category" field, where lots of 
> documenst are in the same category, and they want to know what the full 
> list of categories is for all ofhte documenst that match their query.
>
> if you want the list of all "primary keys" for all the documents that 
> match your query, why not just make sure that field has stored="true" in 
> the schema.xml and getthe values that way?
>
> I'm extra confused because of this comment...
>
> : when I debug simplefacet. It always iterates over all the documents in the
> : result docset (SimpleFacet.getFieldCacheCounts line 259).
>
> it doesn't *seem* like faceting is neccessary, but why do you think 
> iterating over all the documents in your result set set seems like a waste 
> here?  if you want to know what *all* the values are for every document in 
> your doc set, then regardless of wether the values are distinct for 
> each doc, how else could Solr get all the values then looking at each 
> matching doc?
>
>
>
> -Hoss
>
>

Re: unique values from a field in a result

Posted by Chris Hostetter <ho...@fucit.org>.

: You are correct I'm looking for the unique values for one field in a DocSet.
: The field is not multivalued. and it contains only 1 long value, the pk of a
: database table
: But you said the counts are stored in the index, I don't see that. Because

there's something very confusing about your question ... if the value of 
the field is unique for every document (by "pk" you mean the primary key 
for these docs in your database correct?) then why do you specificly need 
the "unique terms" ? ... aren't they by definition unique?

usually when people ask questions like this, they are interested in the 
"unique values" for something like a "category" field, where lots of 
documenst are in the same category, and they want to know what the full 
list of categories is for all ofhte documenst that match their query.

if you want the list of all "primary keys" for all the documents that 
match your query, why not just make sure that field has stored="true" in 
the schema.xml and getthe values that way?

I'm extra confused because of this comment...

: when I debug simplefacet. It always iterates over all the documents in the
: result docset (SimpleFacet.getFieldCacheCounts line 259).

it doesn't *seem* like faceting is neccessary, but why do you think 
iterating over all the documents in your result set set seems like a waste 
here?  if you want to know what *all* the values are for every document in 
your doc set, then regardless of wether the values are distinct for 
each doc, how else could Solr get all the values then looking at each 
matching doc?



-Hoss

Re: unique values from a field in a result

Posted by Thijs Vonk <vo...@gmail.com>.

You are correct I'm looking for the unique values for one field in a 
DocSet. The field is not multivalued. and it contains only 1 long value, 
the pk of a database table
But you said the counts are stored in the index, I don't see that. 
Because when I debug simplefacet. It always iterates over all the 
documents in the result docset (SimpleFacet.getFieldCacheCounts line 259).

But if this is the only way, then ok.
 Thnx

Thijs

Ryan McKinley wrote:
>
> On Apr 27, 2008, at 7:50 AM, Thijs Vonk wrote:
>> What is the best way to get the unique terms from a field in a result?
>> I've been using SimpleFacet to do this. However, I don't need the 
>> counts, so it seems overkill to have to iterate over all the result 
>> documents per field to get the unique values for that field.
>> The field's contain database Id's that I use on the client side to 
>> get aditional information from the database.
>>
>> Is there a faster way to get the unique values from a field in a result?
>>
>
> If you are looking for the unique terms for a field across all 
> documents, faceting is the way to go.  The counts are stored in the 
> index, so i don't think that is a substantial loss.
>
> If you are looking for the unique terms within a document (I doubt you 
> are, but not totally clear from your question) perhaps you could store 
> the unique terms?
>
> ryan

Re: unique values from a field in a result

Posted by Ryan McKinley <ry...@gmail.com>.

On Apr 27, 2008, at 7:50 AM, Thijs Vonk wrote:
> What is the best way to get the unique terms from a field in a result?
> I've been using SimpleFacet to do this. However, I don't need the  
> counts, so it seems overkill to have to iterate over all the result  
> documents per field to get the unique values for that field.
> The field's contain database Id's that I use on the client side to  
> get aditional information from the database.
>
> Is there a faster way to get the unique values from a field in a  
> result?
>

If you are looking for the unique terms for a field across all  
documents, faceting is the way to go.  The counts are stored in the  
index, so i don't think that is a substantial loss.

If you are looking for the unique terms within a document (I doubt you  
are, but not totally clear from your question) perhaps you could store  
the unique terms?

ryan