You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Girish Redekar <gi...@aplopio.com> on 2009/11/21 08:13:16 UTC

Index time boosts, payloads, and long query strings

Hi ,

I'm relatively new to Solr/Lucene, and am using Solr (and not lucene
directly) primarily because I can use it without writing java code (rest of
my project is python coded).

My application has the following requirements:
(a) ability to search over multiple fields, each with different weight
(b) If possible, I'd like to have the ability to add extra/diminished
weights to particular tokens within a field
(c) My query strings have large lengths (50-100 words)
(d) My index is 500K+  documents

1) The way to (a) is field boosting (right?). My question is: Is all field
boosting done at query time? Even if I give index time boosts to fields? Is
there a performance advantage in boosting fields at index time vs at using
something like fieldname:querystring^boost.
2) From what I've read, it seems that I can do (b) using payloads. However,
as this link (
http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/)
suggests, I will have to write a payload aware Query Parser. Wanted to
confirm if this is indeed the case - or is there a out-of-box way to
implement payloads (am using Solr1.4)
3) For my project, the user fills multiple text boxes (for each query). I
combine these into a single query (with different treatment for contents of
each text box). Consequently, my query looks something like (fieldname1:
queryterm1 queryterm2^2.0 queryterm3^3.0 +queryterm4)^1.0  Are there any
guidelines for improving performance of such a system (sorry, this bit is
vague)

Any help with this will be great !

Girish Redekar
http://girishredekar.net

Re: Index time boosts, payloads, and long query strings

Posted by Erick Erickson <er...@gmail.com>.

Yep <G>....

On Mon, Nov 23, 2009 at 4:13 AM, Girish Redekar
<gi...@aplopio.com>wrote:

> Thanks Erick!
>
> After reading your answer, and re-reading the Solr wiki, I realized my
> folly. I used to think that index-time boosts when applied on a per-field
> basis are equivalent to query time boosts to that field.
>
> To ensure that my new understanding is correct , I'll state it in my words.
> Index time boosts will determine boost for a *document* if it is counted as
> a hit. Query time boosts give you control on boosting the occurrence of a
> query in a specific field.
>
> Please correct me if I'm wrong (again) :-)
>
> Girish Redekar
> http://girishredekar.net
>
>
> On Sun, Nov 22, 2009 at 8:25 PM, Erick Erickson <erickerickson@gmail.com
> >wrote:
>
> > I still think they are apples and oranges. If you boost *all* titles,
> > you're effectively boosting none of them. Index time boosting
> > expresses "this document's title is more important than other
> > document titles." What I think you're after is "titles are more
> > important than other parts of the document.
> >
> > For this latter, you're talking query-time boosting. Boosting only
> > really makes sense if there are multiple clauses, something
> > like title:important OR body:unimportant. If this is true, speed
> > is irrelevant, you need correct behavior.
> >
> > Not that I think you'd notice either way. Modern computers
> > can do a LOT of FLOPS/sec. Here's an experiment: time
> > some queries (but beware of timing the very first ones, see
> > the Wiki) with boosts and without boosts. I doubt you'll see
> > enough difference to matter (but please do report back if you
> > do, it'll further my education <G>).
> >
> > But, depending on your index structure, you may get this
> > anyway. Generally, matches on shorter fields weigh more
> > in the score calculations than on longer fields. If you have
> > fields like title and body and you are querying on title:term OR
> > body:term, documents with term in the title will tend toward
> > higher scores.
> >
> > But before putting too much effort into this, do you have any
> > evidence that the default behavior is unsatisfactory? Because
> > unless and until you do, I think this is a distraction <G>...
> >
> > Best
> > Erick
> >
> > On Sun, Nov 22, 2009 at 8:37 AM, Girish Redekar
> > <gi...@aplopio.com>wrote:
> >
> > > Hi Erick -
> > >
> > > Maybe I mis-wrote.
> > >
> > > My question is: would "title:any_query^4.0" be faster/slower than
> > applying
> > > index time boost to the field title. Basically, if I take *every* user
> > > query
> > > and search for it in title with boost (say, 4.0) - is it different than
> > > saying field title has boost 4.0?
> > >
> > > Cheers,
> > > Girish Redekar
> > > http://girishredekar.net
> > >
> > >
> > > On Sun, Nov 22, 2009 at 2:02 AM, Erick Erickson <
> erickerickson@gmail.com
> > > >wrote:
> > >
> > > > I'll take a whack at index .vs. query boosting. They are expressing
> > very
> > > > different concepts. Let's claim we're interested in boosting the
> title
> > > > field....
> > > >
> > > > Index time boosting is expressing "this document's title is X more
> > > > important
> > > >
> > > > than a normal document title". It doesn't matter *what* the title is,
> > > > any query that matches on anything in this document's title will give
> > > this
> > > > document a boost. I might use this to give preferential treatment to
> > all
> > > > encyclopedia entries or something.
> > > >
> > > > Query time boosting, like "title:solr^4.0" expresses "Any document
> with
> > > > solr
> > > > in
> > > > it's title is more important than documents without solr in the
> title".
> > > > This
> > > > really
> > > > only makes sense if you have other clauses that might cause a
> document
> > > > *without*
> > > > solr  the title to match......
> > > >
> > > > Since they are doing different things, efficiency isn't really
> > relevant.
> > > >
> > > > HTH
> > > > Erick
> > > >
> > > >
> > > > On Sat, Nov 21, 2009 at 2:13 AM, Girish Redekar
> > > > <gi...@aplopio.com>wrote:
> > > >
> > > > > Hi ,
> > > > >
> > > > > I'm relatively new to Solr/Lucene, and am using Solr (and not
> lucene
> > > > > directly) primarily because I can use it without writing java code
> > > (rest
> > > > of
> > > > > my project is python coded).
> > > > >
> > > > > My application has the following requirements:
> > > > > (a) ability to search over multiple fields, each with different
> > weight
> > > > > (b) If possible, I'd like to have the ability to add
> extra/diminished
> > > > > weights to particular tokens within a field
> > > > > (c) My query strings have large lengths (50-100 words)
> > > > > (d) My index is 500K+  documents
> > > > >
> > > > > 1) The way to (a) is field boosting (right?). My question is: Is
> all
> > > > field
> > > > > boosting done at query time? Even if I give index time boosts to
> > > fields?
> > > > Is
> > > > > there a performance advantage in boosting fields at index time vs
> at
> > > > using
> > > > > something like fieldname:querystring^boost.
> > > > > 2) From what I've read, it seems that I can do (b) using payloads.
> > > > However,
> > > > > as this link (
> > > > >
> > > > >
> > > >
> > >
> >
> http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/
> > > > > )
> > > > > suggests, I will have to write a payload aware Query Parser. Wanted
> > to
> > > > > confirm if this is indeed the case - or is there a out-of-box way
> to
> > > > > implement payloads (am using Solr1.4)
> > > > > 3) For my project, the user fills multiple text boxes (for each
> > query).
> > > I
> > > > > combine these into a single query (with different treatment for
> > > contents
> > > > of
> > > > > each text box). Consequently, my query looks something like
> > > (fieldname1:
> > > > > queryterm1 queryterm2^2.0 queryterm3^3.0 +queryterm4)^1.0  Are
> there
> > > any
> > > > > guidelines for improving performance of such a system (sorry, this
> > bit
> > > is
> > > > > vague)
> > > > >
> > > > > Any help with this will be great !
> > > > >
> > > > > Girish Redekar
> > > > > http://girishredekar.net
> > > > >
> > > >
> > >
> >
>

Re: Index time boosts, payloads, and long query strings

Posted by Girish Redekar <gi...@aplopio.com>.

Thanks Erick!

After reading your answer, and re-reading the Solr wiki, I realized my
folly. I used to think that index-time boosts when applied on a per-field
basis are equivalent to query time boosts to that field.

To ensure that my new understanding is correct , I'll state it in my words.
Index time boosts will determine boost for a *document* if it is counted as
a hit. Query time boosts give you control on boosting the occurrence of a
query in a specific field.

Please correct me if I'm wrong (again) :-)

Girish Redekar
http://girishredekar.net


On Sun, Nov 22, 2009 at 8:25 PM, Erick Erickson <er...@gmail.com>wrote:

> I still think they are apples and oranges. If you boost *all* titles,
> you're effectively boosting none of them. Index time boosting
> expresses "this document's title is more important than other
> document titles." What I think you're after is "titles are more
> important than other parts of the document.
>
> For this latter, you're talking query-time boosting. Boosting only
> really makes sense if there are multiple clauses, something
> like title:important OR body:unimportant. If this is true, speed
> is irrelevant, you need correct behavior.
>
> Not that I think you'd notice either way. Modern computers
> can do a LOT of FLOPS/sec. Here's an experiment: time
> some queries (but beware of timing the very first ones, see
> the Wiki) with boosts and without boosts. I doubt you'll see
> enough difference to matter (but please do report back if you
> do, it'll further my education <G>).
>
> But, depending on your index structure, you may get this
> anyway. Generally, matches on shorter fields weigh more
> in the score calculations than on longer fields. If you have
> fields like title and body and you are querying on title:term OR
> body:term, documents with term in the title will tend toward
> higher scores.
>
> But before putting too much effort into this, do you have any
> evidence that the default behavior is unsatisfactory? Because
> unless and until you do, I think this is a distraction <G>...
>
> Best
> Erick
>
> On Sun, Nov 22, 2009 at 8:37 AM, Girish Redekar
> <gi...@aplopio.com>wrote:
>
> > Hi Erick -
> >
> > Maybe I mis-wrote.
> >
> > My question is: would "title:any_query^4.0" be faster/slower than
> applying
> > index time boost to the field title. Basically, if I take *every* user
> > query
> > and search for it in title with boost (say, 4.0) - is it different than
> > saying field title has boost 4.0?
> >
> > Cheers,
> > Girish Redekar
> > http://girishredekar.net
> >
> >
> > On Sun, Nov 22, 2009 at 2:02 AM, Erick Erickson <erickerickson@gmail.com
> > >wrote:
> >
> > > I'll take a whack at index .vs. query boosting. They are expressing
> very
> > > different concepts. Let's claim we're interested in boosting the title
> > > field....
> > >
> > > Index time boosting is expressing "this document's title is X more
> > > important
> > >
> > > than a normal document title". It doesn't matter *what* the title is,
> > > any query that matches on anything in this document's title will give
> > this
> > > document a boost. I might use this to give preferential treatment to
> all
> > > encyclopedia entries or something.
> > >
> > > Query time boosting, like "title:solr^4.0" expresses "Any document with
> > > solr
> > > in
> > > it's title is more important than documents without solr in the title".
> > > This
> > > really
> > > only makes sense if you have other clauses that might cause a document
> > > *without*
> > > solr  the title to match......
> > >
> > > Since they are doing different things, efficiency isn't really
> relevant.
> > >
> > > HTH
> > > Erick
> > >
> > >
> > > On Sat, Nov 21, 2009 at 2:13 AM, Girish Redekar
> > > <gi...@aplopio.com>wrote:
> > >
> > > > Hi ,
> > > >
> > > > I'm relatively new to Solr/Lucene, and am using Solr (and not lucene
> > > > directly) primarily because I can use it without writing java code
> > (rest
> > > of
> > > > my project is python coded).
> > > >
> > > > My application has the following requirements:
> > > > (a) ability to search over multiple fields, each with different
> weight
> > > > (b) If possible, I'd like to have the ability to add extra/diminished
> > > > weights to particular tokens within a field
> > > > (c) My query strings have large lengths (50-100 words)
> > > > (d) My index is 500K+  documents
> > > >
> > > > 1) The way to (a) is field boosting (right?). My question is: Is all
> > > field
> > > > boosting done at query time? Even if I give index time boosts to
> > fields?
> > > Is
> > > > there a performance advantage in boosting fields at index time vs at
> > > using
> > > > something like fieldname:querystring^boost.
> > > > 2) From what I've read, it seems that I can do (b) using payloads.
> > > However,
> > > > as this link (
> > > >
> > > >
> > >
> >
> http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/
> > > > )
> > > > suggests, I will have to write a payload aware Query Parser. Wanted
> to
> > > > confirm if this is indeed the case - or is there a out-of-box way to
> > > > implement payloads (am using Solr1.4)
> > > > 3) For my project, the user fills multiple text boxes (for each
> query).
> > I
> > > > combine these into a single query (with different treatment for
> > contents
> > > of
> > > > each text box). Consequently, my query looks something like
> > (fieldname1:
> > > > queryterm1 queryterm2^2.0 queryterm3^3.0 +queryterm4)^1.0  Are there
> > any
> > > > guidelines for improving performance of such a system (sorry, this
> bit
> > is
> > > > vague)
> > > >
> > > > Any help with this will be great !
> > > >
> > > > Girish Redekar
> > > > http://girishredekar.net
> > > >
> > >
> >
>

Re: Index time boosts, payloads, and long query strings

Posted by Erick Erickson <er...@gmail.com>.

I still think they are apples and oranges. If you boost *all* titles,
you're effectively boosting none of them. Index time boosting
expresses "this document's title is more important than other
document titles." What I think you're after is "titles are more
important than other parts of the document.

For this latter, you're talking query-time boosting. Boosting only
really makes sense if there are multiple clauses, something
like title:important OR body:unimportant. If this is true, speed
is irrelevant, you need correct behavior.

Not that I think you'd notice either way. Modern computers
can do a LOT of FLOPS/sec. Here's an experiment: time
some queries (but beware of timing the very first ones, see
the Wiki) with boosts and without boosts. I doubt you'll see
enough difference to matter (but please do report back if you
do, it'll further my education <G>).

But, depending on your index structure, you may get this
anyway. Generally, matches on shorter fields weigh more
in the score calculations than on longer fields. If you have
fields like title and body and you are querying on title:term OR
body:term, documents with term in the title will tend toward
higher scores.

But before putting too much effort into this, do you have any
evidence that the default behavior is unsatisfactory? Because
unless and until you do, I think this is a distraction <G>...

Best
Erick

On Sun, Nov 22, 2009 at 8:37 AM, Girish Redekar
<gi...@aplopio.com>wrote:

> Hi Erick -
>
> Maybe I mis-wrote.
>
> My question is: would "title:any_query^4.0" be faster/slower than applying
> index time boost to the field title. Basically, if I take *every* user
> query
> and search for it in title with boost (say, 4.0) - is it different than
> saying field title has boost 4.0?
>
> Cheers,
> Girish Redekar
> http://girishredekar.net
>
>
> On Sun, Nov 22, 2009 at 2:02 AM, Erick Erickson <erickerickson@gmail.com
> >wrote:
>
> > I'll take a whack at index .vs. query boosting. They are expressing very
> > different concepts. Let's claim we're interested in boosting the title
> > field....
> >
> > Index time boosting is expressing "this document's title is X more
> > important
> >
> > than a normal document title". It doesn't matter *what* the title is,
> > any query that matches on anything in this document's title will give
> this
> > document a boost. I might use this to give preferential treatment to all
> > encyclopedia entries or something.
> >
> > Query time boosting, like "title:solr^4.0" expresses "Any document with
> > solr
> > in
> > it's title is more important than documents without solr in the title".
> > This
> > really
> > only makes sense if you have other clauses that might cause a document
> > *without*
> > solr  the title to match......
> >
> > Since they are doing different things, efficiency isn't really relevant.
> >
> > HTH
> > Erick
> >
> >
> > On Sat, Nov 21, 2009 at 2:13 AM, Girish Redekar
> > <gi...@aplopio.com>wrote:
> >
> > > Hi ,
> > >
> > > I'm relatively new to Solr/Lucene, and am using Solr (and not lucene
> > > directly) primarily because I can use it without writing java code
> (rest
> > of
> > > my project is python coded).
> > >
> > > My application has the following requirements:
> > > (a) ability to search over multiple fields, each with different weight
> > > (b) If possible, I'd like to have the ability to add extra/diminished
> > > weights to particular tokens within a field
> > > (c) My query strings have large lengths (50-100 words)
> > > (d) My index is 500K+  documents
> > >
> > > 1) The way to (a) is field boosting (right?). My question is: Is all
> > field
> > > boosting done at query time? Even if I give index time boosts to
> fields?
> > Is
> > > there a performance advantage in boosting fields at index time vs at
> > using
> > > something like fieldname:querystring^boost.
> > > 2) From what I've read, it seems that I can do (b) using payloads.
> > However,
> > > as this link (
> > >
> > >
> >
> http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/
> > > )
> > > suggests, I will have to write a payload aware Query Parser. Wanted to
> > > confirm if this is indeed the case - or is there a out-of-box way to
> > > implement payloads (am using Solr1.4)
> > > 3) For my project, the user fills multiple text boxes (for each query).
> I
> > > combine these into a single query (with different treatment for
> contents
> > of
> > > each text box). Consequently, my query looks something like
> (fieldname1:
> > > queryterm1 queryterm2^2.0 queryterm3^3.0 +queryterm4)^1.0  Are there
> any
> > > guidelines for improving performance of such a system (sorry, this bit
> is
> > > vague)
> > >
> > > Any help with this will be great !
> > >
> > > Girish Redekar
> > > http://girishredekar.net
> > >
> >
>

Re: Index time boosts, payloads, and long query strings

Posted by Girish Redekar <gi...@aplopio.com>.

Hi Erick -

Maybe I mis-wrote.

My question is: would "title:any_query^4.0" be faster/slower than applying
index time boost to the field title. Basically, if I take *every* user query
and search for it in title with boost (say, 4.0) - is it different than
saying field title has boost 4.0?

Cheers,
Girish Redekar
http://girishredekar.net


On Sun, Nov 22, 2009 at 2:02 AM, Erick Erickson <er...@gmail.com>wrote:

> I'll take a whack at index .vs. query boosting. They are expressing very
> different concepts. Let's claim we're interested in boosting the title
> field....
>
> Index time boosting is expressing "this document's title is X more
> important
>
> than a normal document title". It doesn't matter *what* the title is,
> any query that matches on anything in this document's title will give this
> document a boost. I might use this to give preferential treatment to all
> encyclopedia entries or something.
>
> Query time boosting, like "title:solr^4.0" expresses "Any document with
> solr
> in
> it's title is more important than documents without solr in the title".
> This
> really
> only makes sense if you have other clauses that might cause a document
> *without*
> solr  the title to match......
>
> Since they are doing different things, efficiency isn't really relevant.
>
> HTH
> Erick
>
>
> On Sat, Nov 21, 2009 at 2:13 AM, Girish Redekar
> <gi...@aplopio.com>wrote:
>
> > Hi ,
> >
> > I'm relatively new to Solr/Lucene, and am using Solr (and not lucene
> > directly) primarily because I can use it without writing java code (rest
> of
> > my project is python coded).
> >
> > My application has the following requirements:
> > (a) ability to search over multiple fields, each with different weight
> > (b) If possible, I'd like to have the ability to add extra/diminished
> > weights to particular tokens within a field
> > (c) My query strings have large lengths (50-100 words)
> > (d) My index is 500K+  documents
> >
> > 1) The way to (a) is field boosting (right?). My question is: Is all
> field
> > boosting done at query time? Even if I give index time boosts to fields?
> Is
> > there a performance advantage in boosting fields at index time vs at
> using
> > something like fieldname:querystring^boost.
> > 2) From what I've read, it seems that I can do (b) using payloads.
> However,
> > as this link (
> >
> >
> http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/
> > )
> > suggests, I will have to write a payload aware Query Parser. Wanted to
> > confirm if this is indeed the case - or is there a out-of-box way to
> > implement payloads (am using Solr1.4)
> > 3) For my project, the user fills multiple text boxes (for each query). I
> > combine these into a single query (with different treatment for contents
> of
> > each text box). Consequently, my query looks something like (fieldname1:
> > queryterm1 queryterm2^2.0 queryterm3^3.0 +queryterm4)^1.0  Are there any
> > guidelines for improving performance of such a system (sorry, this bit is
> > vague)
> >
> > Any help with this will be great !
> >
> > Girish Redekar
> > http://girishredekar.net
> >
>

Re: Index time boosts, payloads, and long query strings

Posted by Erick Erickson <er...@gmail.com>.

I'll take a whack at index .vs. query boosting. They are expressing very
different concepts. Let's claim we're interested in boosting the title
field....

Index time boosting is expressing "this document's title is X more important

than a normal document title". It doesn't matter *what* the title is,
any query that matches on anything in this document's title will give this
document a boost. I might use this to give preferential treatment to all
encyclopedia entries or something.

Query time boosting, like "title:solr^4.0" expresses "Any document with solr
in
it's title is more important than documents without solr in the title". This
really
only makes sense if you have other clauses that might cause a document
*without*
solr  the title to match......

Since they are doing different things, efficiency isn't really relevant.

HTH
Erick

On Sat, Nov 21, 2009 at 2:13 AM, Girish Redekar
<gi...@aplopio.com>wrote:

> Hi ,
>
> I'm relatively new to Solr/Lucene, and am using Solr (and not lucene
> directly) primarily because I can use it without writing java code (rest of
> my project is python coded).
>
> My application has the following requirements:
> (a) ability to search over multiple fields, each with different weight
> (b) If possible, I'd like to have the ability to add extra/diminished
> weights to particular tokens within a field
> (c) My query strings have large lengths (50-100 words)
> (d) My index is 500K+  documents
>
> 1) The way to (a) is field boosting (right?). My question is: Is all field
> boosting done at query time? Even if I give index time boosts to fields? Is
> there a performance advantage in boosting fields at index time vs at using
> something like fieldname:querystring^boost.
> 2) From what I've read, it seems that I can do (b) using payloads. However,
> as this link (
>
> http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/
> )
> suggests, I will have to write a payload aware Query Parser. Wanted to
> confirm if this is indeed the case - or is there a out-of-box way to
> implement payloads (am using Solr1.4)
> 3) For my project, the user fills multiple text boxes (for each query). I
> combine these into a single query (with different treatment for contents of
> each text box). Consequently, my query looks something like (fieldname1:
> queryterm1 queryterm2^2.0 queryterm3^3.0 +queryterm4)^1.0  Are there any
> guidelines for improving performance of such a system (sorry, this bit is
> vague)
>
> Any help with this will be great !
>
> Girish Redekar
> http://girishredekar.net
>