You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by elisabeth benoit <el...@gmail.com> on 2015/10/12 14:39:11 UTC

catchall fields or multiple fields

Hello,

We're using solr 4.10 and storing all data in a catchall field. It seems to
me that one good reason for using a catchall field is when using scoring
with idf (with idf, a word might not have same score in all fields). We got
rid of idf and are now considering using multiple fields. I remember
reading somewhere that using a catchall field might speed up searching
time. I was wondering if some of you have any opinion (or experience)
related to this subject.

Best regards,
Elisabeth

Re: catchall fields or multiple fields

Posted by elisabeth benoit <el...@gmail.com>.
Thanks for your suggestion Jack. In fact we're doing geographic search
(fields are country, state, county, town, hamlet, district....)

So it's difficult to split.

Best regards,
Elisabeth

2015-10-13 16:01 GMT+02:00 Jack Krupansky <ja...@gmail.com>:

> Performing a sequence of queries can help too. For example, if users
> commonly search for a product name, you could do an initial query on just
> the product name field which should be much faster than searching the text
> of all product descriptions, and highlighting would be less problematic. If
> that initial query comes up empty, then you could move on to the next
> highest most likely field, maybe product title (short one line
> description), and query voluminous fields like detailed product
> descriptions, specifications, and user comments/reviews only as a last
> resort.
>
> -- Jack Krupansky
>
> On Tue, Oct 13, 2015 at 6:17 AM, elisabeth benoit <
> elisaelisaelisa@gmail.com
> > wrote:
>
> > Thanks to you all for those informed advices.
> >
> > Thanks Trey for your very detailed point of view. This is now very clear
> to
> > me how a search on multiple fields can grow slower than a search on a
> > catchall field.
> >
> > Our actual search model is problematic: we search on a catchall field,
> but
> > need to know which fields match, so we do highlighting on multi fields
> (not
> > indexed, but stored). To improve performance, we want to get rid of
> > highlighting and use the solr explain output. To get the explain output
> on
> > those fields, we need to do a search on those fields.
> >
> > So I guess we have to test if removing highlighting and adding multi
> fields
> > search will improve performances or not.
> >
> > Best regards,
> > Elisabeth
> >
> >
> >
> > 2015-10-12 17:55 GMT+02:00 Jack Krupansky <ja...@gmail.com>:
> >
> > > I think it may all depend on the nature of your application and how
> much
> > > commonality there is between fields.
> > >
> > > One interesting area is auto-suggest, where you can certainly suggest
> > from
> > > the union of all fields, you may want to give priority to suggestions
> > from
> > > preferred fields. For example, for actual product names or important
> > > keywords rather than random words from the English language that happen
> > to
> > > occur in descriptions, all of which would occur in a catchall.
> > >
> > > -- Jack Krupansky
> > >
> > > On Mon, Oct 12, 2015 at 8:39 AM, elisabeth benoit <
> > > elisaelisaelisa@gmail.com
> > > > wrote:
> > >
> > > > Hello,
> > > >
> > > > We're using solr 4.10 and storing all data in a catchall field. It
> > seems
> > > to
> > > > me that one good reason for using a catchall field is when using
> > scoring
> > > > with idf (with idf, a word might not have same score in all fields).
> We
> > > got
> > > > rid of idf and are now considering using multiple fields. I remember
> > > > reading somewhere that using a catchall field might speed up
> searching
> > > > time. I was wondering if some of you have any opinion (or experience)
> > > > related to this subject.
> > > >
> > > > Best regards,
> > > > Elisabeth
> > > >
> > >
> >
>

Re: catchall fields or multiple fields

Posted by Jack Krupansky <ja...@gmail.com>.
Performing a sequence of queries can help too. For example, if users
commonly search for a product name, you could do an initial query on just
the product name field which should be much faster than searching the text
of all product descriptions, and highlighting would be less problematic. If
that initial query comes up empty, then you could move on to the next
highest most likely field, maybe product title (short one line
description), and query voluminous fields like detailed product
descriptions, specifications, and user comments/reviews only as a last
resort.

-- Jack Krupansky

On Tue, Oct 13, 2015 at 6:17 AM, elisabeth benoit <elisaelisaelisa@gmail.com
> wrote:

> Thanks to you all for those informed advices.
>
> Thanks Trey for your very detailed point of view. This is now very clear to
> me how a search on multiple fields can grow slower than a search on a
> catchall field.
>
> Our actual search model is problematic: we search on a catchall field, but
> need to know which fields match, so we do highlighting on multi fields (not
> indexed, but stored). To improve performance, we want to get rid of
> highlighting and use the solr explain output. To get the explain output on
> those fields, we need to do a search on those fields.
>
> So I guess we have to test if removing highlighting and adding multi fields
> search will improve performances or not.
>
> Best regards,
> Elisabeth
>
>
>
> 2015-10-12 17:55 GMT+02:00 Jack Krupansky <ja...@gmail.com>:
>
> > I think it may all depend on the nature of your application and how much
> > commonality there is between fields.
> >
> > One interesting area is auto-suggest, where you can certainly suggest
> from
> > the union of all fields, you may want to give priority to suggestions
> from
> > preferred fields. For example, for actual product names or important
> > keywords rather than random words from the English language that happen
> to
> > occur in descriptions, all of which would occur in a catchall.
> >
> > -- Jack Krupansky
> >
> > On Mon, Oct 12, 2015 at 8:39 AM, elisabeth benoit <
> > elisaelisaelisa@gmail.com
> > > wrote:
> >
> > > Hello,
> > >
> > > We're using solr 4.10 and storing all data in a catchall field. It
> seems
> > to
> > > me that one good reason for using a catchall field is when using
> scoring
> > > with idf (with idf, a word might not have same score in all fields). We
> > got
> > > rid of idf and are now considering using multiple fields. I remember
> > > reading somewhere that using a catchall field might speed up searching
> > > time. I was wondering if some of you have any opinion (or experience)
> > > related to this subject.
> > >
> > > Best regards,
> > > Elisabeth
> > >
> >
>

Re: catchall fields or multiple fields

Posted by elisabeth benoit <el...@gmail.com>.
Thanks to you all for those informed advices.

Thanks Trey for your very detailed point of view. This is now very clear to
me how a search on multiple fields can grow slower than a search on a
catchall field.

Our actual search model is problematic: we search on a catchall field, but
need to know which fields match, so we do highlighting on multi fields (not
indexed, but stored). To improve performance, we want to get rid of
highlighting and use the solr explain output. To get the explain output on
those fields, we need to do a search on those fields.

So I guess we have to test if removing highlighting and adding multi fields
search will improve performances or not.

Best regards,
Elisabeth



2015-10-12 17:55 GMT+02:00 Jack Krupansky <ja...@gmail.com>:

> I think it may all depend on the nature of your application and how much
> commonality there is between fields.
>
> One interesting area is auto-suggest, where you can certainly suggest from
> the union of all fields, you may want to give priority to suggestions from
> preferred fields. For example, for actual product names or important
> keywords rather than random words from the English language that happen to
> occur in descriptions, all of which would occur in a catchall.
>
> -- Jack Krupansky
>
> On Mon, Oct 12, 2015 at 8:39 AM, elisabeth benoit <
> elisaelisaelisa@gmail.com
> > wrote:
>
> > Hello,
> >
> > We're using solr 4.10 and storing all data in a catchall field. It seems
> to
> > me that one good reason for using a catchall field is when using scoring
> > with idf (with idf, a word might not have same score in all fields). We
> got
> > rid of idf and are now considering using multiple fields. I remember
> > reading somewhere that using a catchall field might speed up searching
> > time. I was wondering if some of you have any opinion (or experience)
> > related to this subject.
> >
> > Best regards,
> > Elisabeth
> >
>

Re: catchall fields or multiple fields

Posted by Jack Krupansky <ja...@gmail.com>.
I think it may all depend on the nature of your application and how much
commonality there is between fields.

One interesting area is auto-suggest, where you can certainly suggest from
the union of all fields, you may want to give priority to suggestions from
preferred fields. For example, for actual product names or important
keywords rather than random words from the English language that happen to
occur in descriptions, all of which would occur in a catchall.

-- Jack Krupansky

On Mon, Oct 12, 2015 at 8:39 AM, elisabeth benoit <elisaelisaelisa@gmail.com
> wrote:

> Hello,
>
> We're using solr 4.10 and storing all data in a catchall field. It seems to
> me that one good reason for using a catchall field is when using scoring
> with idf (with idf, a word might not have same score in all fields). We got
> rid of idf and are now considering using multiple fields. I remember
> reading somewhere that using a catchall field might speed up searching
> time. I was wondering if some of you have any opinion (or experience)
> related to this subject.
>
> Best regards,
> Elisabeth
>

Re: catchall fields or multiple fields

Posted by Walter Underwood <wu...@wunderwood.org>.
Why get rid of idf? Most often, idf is a big help in relevance.

I’ve used different weights for different parts of the document, like weighting the title 8X the body.

I’ve used different weights for different analysis chains. If we have three fields, one lowercased, one stemmed, and one a phonetic representation, then you can weight the lower case higher than the stemmed field, and stemmed higher than phonetic.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Oct 12, 2015, at 6:12 AM, Ahmet Arslan <io...@yahoo.com.INVALID> wrote:
> 
> Hi,
> 
> Catch-all field: No need to worry about how to aggregate scores coming from different fields.
> But you cannot utilize different analysers for different fields.
> 
> Multiple-fields: You can play with edismax's parameters on-the-fly, without having to re-index.
> It is flexible that you can include/exclude fields from search.
> 
> Ahmet
> 
> 
> 
> On Monday, October 12, 2015 3:39 PM, elisabeth benoit <el...@gmail.com> wrote:
> Hello,
> 
> We're using solr 4.10 and storing all data in a catchall field. It seems to
> me that one good reason for using a catchall field is when using scoring
> with idf (with idf, a word might not have same score in all fields). We got
> rid of idf and are now considering using multiple fields. I remember
> reading somewhere that using a catchall field might speed up searching
> time. I was wondering if some of you have any opinion (or experience)
> related to this subject.
> 
> Best regards,
> Elisabeth


Re: catchall fields or multiple fields

Posted by Trey Grainger <so...@gmail.com>.
Elisabeth,

Yes, it will almost always be more efficient to search within a catch-all
field than to search across multiple fields. Think of it this way: when you
search on a single field, you are doing a single keyword search against the
index per term. When you search across multiple fields, you are executing
the search for that term multiple times (once for each field) against the
index, and then doing the necessary intersections/unions/etc. of the
document sets.

As you continue to add more and more fields to search across, the search
continues to grow slower. If you're only searching a few fields then it
will probably not be noticeably slower, but the more and more you add, the
slower your response times will become. This slowdown may be measured in
milliseconds, in which case you may not care, but it will be slower.

The idf point you mentioned can be both a pro and a con depending upon the
use case. For example, if you are searching news content that has a
"french_text" field and an "english_text" field, it would be suboptimal if
for the search "Barack Obama" you got only French documents at the top
because the US president's name is much more commonly found in English
documents. When you're searching fields with different types of content,
however, you might find examples where you'd actually want idf differences
maintained and documents differentiated based upon underlying field.

One particularly nice thing about the multi-field approach is that it is
very easy to apply different boosts to the fields and to dynamically change
the boosts. You can similarly do this with payloads within a catch-all
field. You could even assign each term a payload corresponding to which
field the content came from, and then dynamically change the boosts
associated with those payloads at query time (caveat - custom code
required). See this blog post for an end-to-end payload scoring example,
https://lucidworks.com/blog/2014/06/13/end-to-end-payload-example-in-solr/.


Sharing my personal experience: at CareerBuilder, we use the catch-all
field with payloads (one per underlying field) that we can dynamically
change the weight of at query time. We found that for most of our corpus
sizes (ranging between 2 and 100 million full text jobs or resumes), that
is is more efficient to search between 1 and 3 fields than to do the
multi-field search with payload scoring, but once we get to the 4th field
the extra cost associated with the payload scoring was overtaken by the
additional time required to search each additional field.   These numbers
(3 vs 4 fields, etc.) are all anecdotal, of course, as it is dependent upon
a lot of environmental and corpus factors unique to our use case.

The main point of this approach, however, is that there is no additional
cost per-field beyond the upfront cost to add and score payloads, so we
have been able to easily represent over a hundred of these payload-based
"virtual fields" with different weights within a catch-all field (all with
a fixed query-time cost).

*In summary*: yes, you should expect a performance decline as you add more
and more fields to your query if you are searching across multiple fields.
You can overcome this by using a single catch-all field if you are okay
losing IDF per-field (you'll still have it globally across all fields). If
you want to use a catch-all field, but still want to boost content based
upon the field it originated within, you can accomplish this with payloads.

All the best,

Trey Grainger
Co-author, Solr in Action
Director of Engineering, Search & Recommendations @ CareerBuilder


On Mon, Oct 12, 2015 at 9:12 AM, Ahmet Arslan <io...@yahoo.com.invalid>
wrote:

> Hi,
>
> Catch-all field: No need to worry about how to aggregate scores coming
> from different fields.
> But you cannot utilize different analysers for different fields.
>
> Multiple-fields: You can play with edismax's parameters on-the-fly,
> without having to re-index.
> It is flexible that you can include/exclude fields from search.
>
> Ahmet
>
>
>
> On Monday, October 12, 2015 3:39 PM, elisabeth benoit <
> elisaelisaelisa@gmail.com> wrote:
> Hello,
>
> We're using solr 4.10 and storing all data in a catchall field. It seems to
> me that one good reason for using a catchall field is when using scoring
> with idf (with idf, a word might not have same score in all fields). We got
> rid of idf and are now considering using multiple fields. I remember
> reading somewhere that using a catchall field might speed up searching
> time. I was wondering if some of you have any opinion (or experience)
> related to this subject.
>
> Best regards,
> Elisabeth
>

Re: catchall fields or multiple fields

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.
Hi,

Catch-all field: No need to worry about how to aggregate scores coming from different fields.
But you cannot utilize different analysers for different fields.

Multiple-fields: You can play with edismax's parameters on-the-fly, without having to re-index.
It is flexible that you can include/exclude fields from search.

Ahmet



On Monday, October 12, 2015 3:39 PM, elisabeth benoit <el...@gmail.com> wrote:
Hello,

We're using solr 4.10 and storing all data in a catchall field. It seems to
me that one good reason for using a catchall field is when using scoring
with idf (with idf, a word might not have same score in all fields). We got
rid of idf and are now considering using multiple fields. I remember
reading somewhere that using a catchall field might speed up searching
time. I was wondering if some of you have any opinion (or experience)
related to this subject.

Best regards,
Elisabeth