You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Marcus Herou <ma...@tailsweep.com> on 2009/04/23 22:01:58 UTC

Change boost of documents / single fields / external scoring ?

Hi.

Confusing subject eh ? Trying to become a little clearer in a few sentences.

We have a Solr/Lucene index where each document is a Blog Entry. We have
just implemented the PageRank algorithm for Blogs and are about to add a
column to the index called score and perhaps adjust the document boost.

We have as well decided that it is the blog itself and not the individual
pages that are to be ranked so all entries belonging to one blog will
receive the same score.

I have not found a way to apply a document score without actually
re-indexing all fields in the affected entries (could very well be 100% at
every PageRank recalculation) and this will of course take hell of a long
time to reindex which effectively will render the process useless since it
would take a week or of reindexing as of current and will take more and more
time. (100M blog entries as of current and rapidly increasing).

Guess we have run into the issue where we have some "static" data which we
do not want to touch at all but we want to update certain "dynamic" fields.

Lucene is not a database I know but is there a way to implement external
search-time scoring or update individual fields ? Would there be a
possibilty to do some kind of join (parallell searches separate index types)
? or send the result to a separate sorting algorithm ? Hmmm.... Perhaps a
subclass of Sort ? Grasping at straws here folks...

Hope anyone of the core experts can help us.

Cheers

//Marcus Herou



-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/

Re: Change boost of documents / single fields / external scoring ?

Posted by Marcus Herou <ma...@tailsweep.com>.
Could an ExternalFileField help me ?
http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html

On Thu, Apr 23, 2009 at 10:01 PM, Marcus Herou
<ma...@tailsweep.com>wrote:

> Hi.
>
> Confusing subject eh ? Trying to become a little clearer in a few
> sentences.
>
> We have a Solr/Lucene index where each document is a Blog Entry. We have
> just implemented the PageRank algorithm for Blogs and are about to add a
> column to the index called score and perhaps adjust the document boost.
>
> We have as well decided that it is the blog itself and not the individual
> pages that are to be ranked so all entries belonging to one blog will
> receive the same score.
>
> I have not found a way to apply a document score without actually
> re-indexing all fields in the affected entries (could very well be 100% at
> every PageRank recalculation) and this will of course take hell of a long
> time to reindex which effectively will render the process useless since it
> would take a week or of reindexing as of current and will take more and more
> time. (100M blog entries as of current and rapidly increasing).
>
> Guess we have run into the issue where we have some "static" data which we
> do not want to touch at all but we want to update certain "dynamic" fields.
>
> Lucene is not a database I know but is there a way to implement external
> search-time scoring or update individual fields ? Would there be a
> possibilty to do some kind of join (parallell searches separate index types)
> ? or send the result to a separate sorting algorithm ? Hmmm.... Perhaps a
> subclass of Sort ? Grasping at straws here folks...
>
> Hope anyone of the core experts can help us.
>
> Cheers
>
> //Marcus Herou
>
>
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> marcus.herou@tailsweep.com
> http://www.tailsweep.com/
> http://blogg.tailsweep.com/
>



-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/

Re: Change boost of documents / single fields / external scoring ?

Posted by Marcus Herou <ma...@tailsweep.com>.
Could an ExternalFileField help me ?
http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html

On Thu, Apr 23, 2009 at 10:01 PM, Marcus Herou
<ma...@tailsweep.com>wrote:

> Hi.
>
> Confusing subject eh ? Trying to become a little clearer in a few
> sentences.
>
> We have a Solr/Lucene index where each document is a Blog Entry. We have
> just implemented the PageRank algorithm for Blogs and are about to add a
> column to the index called score and perhaps adjust the document boost.
>
> We have as well decided that it is the blog itself and not the individual
> pages that are to be ranked so all entries belonging to one blog will
> receive the same score.
>
> I have not found a way to apply a document score without actually
> re-indexing all fields in the affected entries (could very well be 100% at
> every PageRank recalculation) and this will of course take hell of a long
> time to reindex which effectively will render the process useless since it
> would take a week or of reindexing as of current and will take more and more
> time. (100M blog entries as of current and rapidly increasing).
>
> Guess we have run into the issue where we have some "static" data which we
> do not want to touch at all but we want to update certain "dynamic" fields.
>
> Lucene is not a database I know but is there a way to implement external
> search-time scoring or update individual fields ? Would there be a
> possibilty to do some kind of join (parallell searches separate index types)
> ? or send the result to a separate sorting algorithm ? Hmmm.... Perhaps a
> subclass of Sort ? Grasping at straws here folks...
>
> Hope anyone of the core experts can help us.
>
> Cheers
>
> //Marcus Herou
>
>
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> marcus.herou@tailsweep.com
> http://www.tailsweep.com/
> http://blogg.tailsweep.com/
>



-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/

Re: Change boost of documents / single fields / external scoring ?

Posted by Marcus Herou <ma...@tailsweep.com>.
Yep it is / was, solved now.
Thanks Grant and Yonik.

Published the setup here: http://dev.tailsweep.com/solr-external-scoring/en/

//Marcus

On Fri, Apr 24, 2009 at 6:34 PM, Grant Ingersoll <gs...@apache.org>wrote:

> See also my comment on your PageRank question, which I take is the same
> issue.  Namely, you might have a look into the external file stuff.  I've
> used that to store oft-changing boost information that then factors into the
> score.
>
> -Grant
>
>
> On Apr 24, 2009, at 12:03 PM, Michael McCandless wrote:
>
>  I think something like this (NOTE: not at all tested, and I have no
>> real experience with function queries):
>>
>>  ValueSource vals = new MyPageRankScores(...);
>>  ValueSourceQuery prQuery = new ValueSourceQuery(vals);
>>  Query realQuery = get-user's-query;
>>  Query q = new CustomScoreQuery(realQuery, prQuery);
>>  TopDocs hits = searcher.search(q, 10);
>>
>> MyPageRankScores is your class, subclassing ValueSource and implementing
>> the
>> getValues method.
>>
>> You could subclass CustomScoreQuery if you want to tweak just how the
>> "real" Query scores and your page-rank scores are combined.
>>
>> Mike
>>
>> On Fri, Apr 24, 2009 at 5:20 AM, Marcus Herou
>> <ma...@tailsweep.com> wrote:
>>
>>> Yes I am thinking of something like that.
>>>
>>> Could you elaborate on how that would look like pseudo wise ?
>>>
>>> Kindly
>>>
>>> //Marcus
>>>
>>> On Fri, Apr 24, 2009 at 9:05 AM, Michael McCandless <
>>> lucene@mikemccandless.com> wrote:
>>>
>>>  Could function queries be used here?  EG you could implement a
>>>> ValueSource that pulls in the external scores?
>>>>
>>>> Mike
>>>>
>>>> On Thu, Apr 23, 2009 at 4:01 PM, Marcus Herou
>>>> <ma...@tailsweep.com> wrote:
>>>>
>>>>> Hi.
>>>>>
>>>>> Confusing subject eh ? Trying to become a little clearer in a few
>>>>>
>>>> sentences.
>>>>
>>>>>
>>>>> We have a Solr/Lucene index where each document is a Blog Entry. We
>>>>> have
>>>>> just implemented the PageRank algorithm for Blogs and are about to add
>>>>> a
>>>>> column to the index called score and perhaps adjust the document boost.
>>>>>
>>>>> We have as well decided that it is the blog itself and not the
>>>>> individual
>>>>> pages that are to be ranked so all entries belonging to one blog will
>>>>> receive the same score.
>>>>>
>>>>> I have not found a way to apply a document score without actually
>>>>> re-indexing all fields in the affected entries (could very well be 100%
>>>>>
>>>> at
>>>>
>>>>> every PageRank recalculation) and this will of course take hell of a
>>>>> long
>>>>> time to reindex which effectively will render the process useless since
>>>>>
>>>> it
>>>>
>>>>> would take a week or of reindexing as of current and will take more and
>>>>>
>>>> more
>>>>
>>>>> time. (100M blog entries as of current and rapidly increasing).
>>>>>
>>>>> Guess we have run into the issue where we have some "static" data which
>>>>>
>>>> we
>>>>
>>>>> do not want to touch at all but we want to update certain "dynamic"
>>>>>
>>>> fields.
>>>>
>>>>>
>>>>> Lucene is not a database I know but is there a way to implement
>>>>> external
>>>>> search-time scoring or update individual fields ? Would there be a
>>>>> possibilty to do some kind of join (parallell searches separate index
>>>>>
>>>> types)
>>>>
>>>>> ? or send the result to a separate sorting algorithm ? Hmmm.... Perhaps
>>>>> a
>>>>> subclass of Sort ? Grasping at straws here folks...
>>>>>
>>>>> Hope anyone of the core experts can help us.
>>>>>
>>>>> Cheers
>>>>>
>>>>> //Marcus Herou
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Marcus Herou CTO and co-founder Tailsweep AB
>>>>> +46702561312
>>>>> marcus.herou@tailsweep.com
>>>>> http://www.tailsweep.com/
>>>>> http://blogg.tailsweep.com/
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Marcus Herou CTO and co-founder Tailsweep AB
>>> +46702561312
>>> marcus.herou@tailsweep.com
>>> http://www.tailsweep.com/
>>> http://blogg.tailsweep.com/
>>>
>>>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>


-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/

Re: Change boost of documents / single fields / external scoring ?

Posted by Grant Ingersoll <gs...@apache.org>.
See also my comment on your PageRank question, which I take is the  
same issue.  Namely, you might have a look into the external file  
stuff.  I've used that to store oft-changing boost information that  
then factors into the score.

-Grant

On Apr 24, 2009, at 12:03 PM, Michael McCandless wrote:

> I think something like this (NOTE: not at all tested, and I have no
> real experience with function queries):
>
>  ValueSource vals = new MyPageRankScores(...);
>  ValueSourceQuery prQuery = new ValueSourceQuery(vals);
>  Query realQuery = get-user's-query;
>  Query q = new CustomScoreQuery(realQuery, prQuery);
>  TopDocs hits = searcher.search(q, 10);
>
> MyPageRankScores is your class, subclassing ValueSource and  
> implementing the
> getValues method.
>
> You could subclass CustomScoreQuery if you want to tweak just how the
> "real" Query scores and your page-rank scores are combined.
>
> Mike
>
> On Fri, Apr 24, 2009 at 5:20 AM, Marcus Herou
> <ma...@tailsweep.com> wrote:
>> Yes I am thinking of something like that.
>>
>> Could you elaborate on how that would look like pseudo wise ?
>>
>> Kindly
>>
>> //Marcus
>>
>> On Fri, Apr 24, 2009 at 9:05 AM, Michael McCandless <
>> lucene@mikemccandless.com> wrote:
>>
>>> Could function queries be used here?  EG you could implement a
>>> ValueSource that pulls in the external scores?
>>>
>>> Mike
>>>
>>> On Thu, Apr 23, 2009 at 4:01 PM, Marcus Herou
>>> <ma...@tailsweep.com> wrote:
>>>> Hi.
>>>>
>>>> Confusing subject eh ? Trying to become a little clearer in a few
>>> sentences.
>>>>
>>>> We have a Solr/Lucene index where each document is a Blog Entry.  
>>>> We have
>>>> just implemented the PageRank algorithm for Blogs and are about  
>>>> to add a
>>>> column to the index called score and perhaps adjust the document  
>>>> boost.
>>>>
>>>> We have as well decided that it is the blog itself and not the  
>>>> individual
>>>> pages that are to be ranked so all entries belonging to one blog  
>>>> will
>>>> receive the same score.
>>>>
>>>> I have not found a way to apply a document score without actually
>>>> re-indexing all fields in the affected entries (could very well  
>>>> be 100%
>>> at
>>>> every PageRank recalculation) and this will of course take hell  
>>>> of a long
>>>> time to reindex which effectively will render the process useless  
>>>> since
>>> it
>>>> would take a week or of reindexing as of current and will take  
>>>> more and
>>> more
>>>> time. (100M blog entries as of current and rapidly increasing).
>>>>
>>>> Guess we have run into the issue where we have some "static" data  
>>>> which
>>> we
>>>> do not want to touch at all but we want to update certain "dynamic"
>>> fields.
>>>>
>>>> Lucene is not a database I know but is there a way to implement  
>>>> external
>>>> search-time scoring or update individual fields ? Would there be a
>>>> possibilty to do some kind of join (parallell searches separate  
>>>> index
>>> types)
>>>> ? or send the result to a separate sorting algorithm ? Hmmm....  
>>>> Perhaps a
>>>> subclass of Sort ? Grasping at straws here folks...
>>>>
>>>> Hope anyone of the core experts can help us.
>>>>
>>>> Cheers
>>>>
>>>> //Marcus Herou
>>>>
>>>>
>>>>
>>>> --
>>>> Marcus Herou CTO and co-founder Tailsweep AB
>>>> +46702561312
>>>> marcus.herou@tailsweep.com
>>>> http://www.tailsweep.com/
>>>> http://blogg.tailsweep.com/
>>>>
>>>
>>
>>
>>
>> --
>> Marcus Herou CTO and co-founder Tailsweep AB
>> +46702561312
>> marcus.herou@tailsweep.com
>> http://www.tailsweep.com/
>> http://blogg.tailsweep.com/
>>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


Re: Change boost of documents / single fields / external scoring ?

Posted by Michael McCandless <lu...@mikemccandless.com>.
I think something like this (NOTE: not at all tested, and I have no
real experience with function queries):

  ValueSource vals = new MyPageRankScores(...);
  ValueSourceQuery prQuery = new ValueSourceQuery(vals);
  Query realQuery = get-user's-query;
  Query q = new CustomScoreQuery(realQuery, prQuery);
  TopDocs hits = searcher.search(q, 10);

MyPageRankScores is your class, subclassing ValueSource and implementing the
getValues method.

You could subclass CustomScoreQuery if you want to tweak just how the
"real" Query scores and your page-rank scores are combined.

Mike

On Fri, Apr 24, 2009 at 5:20 AM, Marcus Herou
<ma...@tailsweep.com> wrote:
> Yes I am thinking of something like that.
>
> Could you elaborate on how that would look like pseudo wise ?
>
> Kindly
>
> //Marcus
>
> On Fri, Apr 24, 2009 at 9:05 AM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
>> Could function queries be used here?  EG you could implement a
>> ValueSource that pulls in the external scores?
>>
>> Mike
>>
>> On Thu, Apr 23, 2009 at 4:01 PM, Marcus Herou
>> <ma...@tailsweep.com> wrote:
>> > Hi.
>> >
>> > Confusing subject eh ? Trying to become a little clearer in a few
>> sentences.
>> >
>> > We have a Solr/Lucene index where each document is a Blog Entry. We have
>> > just implemented the PageRank algorithm for Blogs and are about to add a
>> > column to the index called score and perhaps adjust the document boost.
>> >
>> > We have as well decided that it is the blog itself and not the individual
>> > pages that are to be ranked so all entries belonging to one blog will
>> > receive the same score.
>> >
>> > I have not found a way to apply a document score without actually
>> > re-indexing all fields in the affected entries (could very well be 100%
>> at
>> > every PageRank recalculation) and this will of course take hell of a long
>> > time to reindex which effectively will render the process useless since
>> it
>> > would take a week or of reindexing as of current and will take more and
>> more
>> > time. (100M blog entries as of current and rapidly increasing).
>> >
>> > Guess we have run into the issue where we have some "static" data which
>> we
>> > do not want to touch at all but we want to update certain "dynamic"
>> fields.
>> >
>> > Lucene is not a database I know but is there a way to implement external
>> > search-time scoring or update individual fields ? Would there be a
>> > possibilty to do some kind of join (parallell searches separate index
>> types)
>> > ? or send the result to a separate sorting algorithm ? Hmmm.... Perhaps a
>> > subclass of Sort ? Grasping at straws here folks...
>> >
>> > Hope anyone of the core experts can help us.
>> >
>> > Cheers
>> >
>> > //Marcus Herou
>> >
>> >
>> >
>> > --
>> > Marcus Herou CTO and co-founder Tailsweep AB
>> > +46702561312
>> > marcus.herou@tailsweep.com
>> > http://www.tailsweep.com/
>> > http://blogg.tailsweep.com/
>> >
>>
>
>
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> marcus.herou@tailsweep.com
> http://www.tailsweep.com/
> http://blogg.tailsweep.com/
>

Re: Change boost of documents / single fields / external scoring ?

Posted by Marcus Herou <ma...@tailsweep.com>.
Yes I am thinking of something like that.

Could you elaborate on how that would look like pseudo wise ?

Kindly

//Marcus

On Fri, Apr 24, 2009 at 9:05 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> Could function queries be used here?  EG you could implement a
> ValueSource that pulls in the external scores?
>
> Mike
>
> On Thu, Apr 23, 2009 at 4:01 PM, Marcus Herou
> <ma...@tailsweep.com> wrote:
> > Hi.
> >
> > Confusing subject eh ? Trying to become a little clearer in a few
> sentences.
> >
> > We have a Solr/Lucene index where each document is a Blog Entry. We have
> > just implemented the PageRank algorithm for Blogs and are about to add a
> > column to the index called score and perhaps adjust the document boost.
> >
> > We have as well decided that it is the blog itself and not the individual
> > pages that are to be ranked so all entries belonging to one blog will
> > receive the same score.
> >
> > I have not found a way to apply a document score without actually
> > re-indexing all fields in the affected entries (could very well be 100%
> at
> > every PageRank recalculation) and this will of course take hell of a long
> > time to reindex which effectively will render the process useless since
> it
> > would take a week or of reindexing as of current and will take more and
> more
> > time. (100M blog entries as of current and rapidly increasing).
> >
> > Guess we have run into the issue where we have some "static" data which
> we
> > do not want to touch at all but we want to update certain "dynamic"
> fields.
> >
> > Lucene is not a database I know but is there a way to implement external
> > search-time scoring or update individual fields ? Would there be a
> > possibilty to do some kind of join (parallell searches separate index
> types)
> > ? or send the result to a separate sorting algorithm ? Hmmm.... Perhaps a
> > subclass of Sort ? Grasping at straws here folks...
> >
> > Hope anyone of the core experts can help us.
> >
> > Cheers
> >
> > //Marcus Herou
> >
> >
> >
> > --
> > Marcus Herou CTO and co-founder Tailsweep AB
> > +46702561312
> > marcus.herou@tailsweep.com
> > http://www.tailsweep.com/
> > http://blogg.tailsweep.com/
> >
>



-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/

Re: Change boost of documents / single fields / external scoring ?

Posted by Michael McCandless <lu...@mikemccandless.com>.
Could function queries be used here?  EG you could implement a
ValueSource that pulls in the external scores?

Mike

On Thu, Apr 23, 2009 at 4:01 PM, Marcus Herou
<ma...@tailsweep.com> wrote:
> Hi.
>
> Confusing subject eh ? Trying to become a little clearer in a few sentences.
>
> We have a Solr/Lucene index where each document is a Blog Entry. We have
> just implemented the PageRank algorithm for Blogs and are about to add a
> column to the index called score and perhaps adjust the document boost.
>
> We have as well decided that it is the blog itself and not the individual
> pages that are to be ranked so all entries belonging to one blog will
> receive the same score.
>
> I have not found a way to apply a document score without actually
> re-indexing all fields in the affected entries (could very well be 100% at
> every PageRank recalculation) and this will of course take hell of a long
> time to reindex which effectively will render the process useless since it
> would take a week or of reindexing as of current and will take more and more
> time. (100M blog entries as of current and rapidly increasing).
>
> Guess we have run into the issue where we have some "static" data which we
> do not want to touch at all but we want to update certain "dynamic" fields.
>
> Lucene is not a database I know but is there a way to implement external
> search-time scoring or update individual fields ? Would there be a
> possibilty to do some kind of join (parallell searches separate index types)
> ? or send the result to a separate sorting algorithm ? Hmmm.... Perhaps a
> subclass of Sort ? Grasping at straws here folks...
>
> Hope anyone of the core experts can help us.
>
> Cheers
>
> //Marcus Herou
>
>
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> marcus.herou@tailsweep.com
> http://www.tailsweep.com/
> http://blogg.tailsweep.com/
>