You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Ian Holsman <ha...@holsman.net> on 2011/05/31 17:02:44 UTC

how does Solr/Lucene index multi-value fields

Hi.

I want to store a list of documents (say each being 30-60k of text) into a single SolrDocument. (to speed up post-retrieval querying)

In order to do this, I need to know if lucene calculates the TF/IDF score over the entire field or does it treat each value in the list as a unique field? 

If I can't store it as a multi-value, I could create a schema where I put each document into a unique field, but I'm not sure how to create the query to search all the fields.


Regards
Ian

Re: how does Solr/Lucene index multi-value fields

Posted by Ian Holsman <ha...@holsman.net>.

Thanks Erick.

sadly in my use-case I don't that wouldn't work. I'll go back to storing them at the story level, and hitting a DB to get related stories I think.

--I
On May 31, 2011, at 12:27 PM, Erick Erickson wrote:

> Hmmm, I may have mis-lead you. Re-reading my text it
> wasn't very well written....
> 
> TF/IDF calculations are, indeed, per-field. I was trying
> to say that there was no difference between storing all
> the data for an individual field as a single long string of text
> in a single-valued field or as several shorter strings in
> a multi-valued field.
> 
> Best
> Erick
> 
> On Tue, May 31, 2011 at 12:16 PM, Ian Holsman <ha...@holsman.net> wrote:
>> 
>> On May 31, 2011, at 12:11 PM, Erick Erickson wrote:
>> 
>>> Can you explain the use-case a bit more here? Especially the post-query
>>> processing and how you expect the multiple documents to help here.
>>> 
>> 
>> we have a collection of related stories. when a user searches for something, we might not want to display the story that is most-relevant (according to SOLR), but according to other home-grown rules.  by combing all the possibilities in one SolrDocument, we can avoid a DB-hit to get related stories.
>> 
>> 
>>> But TF/IDF is calculated over all the values in the field. There's really no
>>> difference between a multi-valued field and storing all the data in a
>>> single field
>>> as far as relevance calculations are concerned.
>>> 
>> 
>> so.. it will suck regardless.. I thought we had per-field relevance in the current trunk. :-(
>> 
>> 
>>> Best
>>> Erick
>>> 
>>> On Tue, May 31, 2011 at 11:02 AM, Ian Holsman <ha...@holsman.net> wrote:
>>>> Hi.
>>>> 
>>>> I want to store a list of documents (say each being 30-60k of text) into a single SolrDocument. (to speed up post-retrieval querying)
>>>> 
>>>> In order to do this, I need to know if lucene calculates the TF/IDF score over the entire field or does it treat each value in the list as a unique field?
>>>> 
>>>> If I can't store it as a multi-value, I could create a schema where I put each document into a unique field, but I'm not sure how to create the query to search all the fields.
>>>> 
>>>> 
>>>> Regards
>>>> Ian
>>>> 
>>>> 
>> 
>>

Re: how does Solr/Lucene index multi-value fields

Posted by Erick Erickson <er...@gmail.com>.

Hmmm, I may have mis-lead you. Re-reading my text it
wasn't very well written....

TF/IDF calculations are, indeed, per-field. I was trying
to say that there was no difference between storing all
the data for an individual field as a single long string of text
in a single-valued field or as several shorter strings in
a multi-valued field.

Best
Erick

On Tue, May 31, 2011 at 12:16 PM, Ian Holsman <ha...@holsman.net> wrote:
>
> On May 31, 2011, at 12:11 PM, Erick Erickson wrote:
>
>> Can you explain the use-case a bit more here? Especially the post-query
>> processing and how you expect the multiple documents to help here.
>>
>
> we have a collection of related stories. when a user searches for something, we might not want to display the story that is most-relevant (according to SOLR), but according to other home-grown rules.  by combing all the possibilities in one SolrDocument, we can avoid a DB-hit to get related stories.
>
>
>> But TF/IDF is calculated over all the values in the field. There's really no
>> difference between a multi-valued field and storing all the data in a
>> single field
>> as far as relevance calculations are concerned.
>>
>
> so.. it will suck regardless.. I thought we had per-field relevance in the current trunk. :-(
>
>
>> Best
>> Erick
>>
>> On Tue, May 31, 2011 at 11:02 AM, Ian Holsman <ha...@holsman.net> wrote:
>>> Hi.
>>>
>>> I want to store a list of documents (say each being 30-60k of text) into a single SolrDocument. (to speed up post-retrieval querying)
>>>
>>> In order to do this, I need to know if lucene calculates the TF/IDF score over the entire field or does it treat each value in the list as a unique field?
>>>
>>> If I can't store it as a multi-value, I could create a schema where I put each document into a unique field, but I'm not sure how to create the query to search all the fields.
>>>
>>>
>>> Regards
>>> Ian
>>>
>>>
>
>

Re: how does Solr/Lucene index multi-value fields

Posted by Jonathan Rochkind <ro...@jhu.edu>.

On 5/31/2011 12:16 PM, Ian Holsman wrote:
> we have a collection of related stories. when a user searches for 
> something, we might not want to display the story that is 
> most-relevant (according to SOLR), but according to other home-grown 
> rules. by combing all the possibilities in one SolrDocument, we can 
> avoid a DB-hit to get related stories.

Avoiding a DB hit may or may not actually be a good goal here. You may 
find that hitting the DB to get related stories is _more performant_ 
than retrieving a very large stored field from Solr. (My sense is this 
can be especially a problem on a Solr index that has not been optimized, 
but I'm not sure).

Sorry, don't have an answer to your actual question, but if an attempted 
performance improvement is making other things harder... might want to 
be sure your presumed performance improvement really is a performance 
improvement.

Re: how does Solr/Lucene index multi-value fields

Posted by Ian Holsman <ha...@holsman.net>.

On May 31, 2011, at 12:11 PM, Erick Erickson wrote:

> Can you explain the use-case a bit more here? Especially the post-query
> processing and how you expect the multiple documents to help here.
> 

we have a collection of related stories. when a user searches for something, we might not want to display the story that is most-relevant (according to SOLR), but according to other home-grown rules.  by combing all the possibilities in one SolrDocument, we can avoid a DB-hit to get related stories.

> But TF/IDF is calculated over all the values in the field. There's really no
> difference between a multi-valued field and storing all the data in a
> single field
> as far as relevance calculations are concerned.
> 

so.. it will suck regardless.. I thought we had per-field relevance in the current trunk. :-(

> Best
> Erick
> 
> On Tue, May 31, 2011 at 11:02 AM, Ian Holsman <ha...@holsman.net> wrote:
>> Hi.
>> 
>> I want to store a list of documents (say each being 30-60k of text) into a single SolrDocument. (to speed up post-retrieval querying)
>> 
>> In order to do this, I need to know if lucene calculates the TF/IDF score over the entire field or does it treat each value in the list as a unique field?
>> 
>> If I can't store it as a multi-value, I could create a schema where I put each document into a unique field, but I'm not sure how to create the query to search all the fields.
>> 
>> 
>> Regards
>> Ian
>> 
>>

Re: how does Solr/Lucene index multi-value fields

Posted by Erick Erickson <er...@gmail.com>.

Can you explain the use-case a bit more here? Especially the post-query
processing and how you expect the multiple documents to help here.

But TF/IDF is calculated over all the values in the field. There's really no
difference between a multi-valued field and storing all the data in a
single field
as far as relevance calculations are concerned.

Best
Erick

On Tue, May 31, 2011 at 11:02 AM, Ian Holsman <ha...@holsman.net> wrote:
> Hi.
>
> I want to store a list of documents (say each being 30-60k of text) into a single SolrDocument. (to speed up post-retrieval querying)
>
> In order to do this, I need to know if lucene calculates the TF/IDF score over the entire field or does it treat each value in the list as a unique field?
>
> If I can't store it as a multi-value, I could create a schema where I put each document into a unique field, but I'm not sure how to create the query to search all the fields.
>
>
> Regards
> Ian
>
>