You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-dev@lucene.apache.org by jason rutherglen <ja...@yahoo.com> on 2006/05/19 01:40:13 UTC

Making RemoteSearchable like client for Solr

A solution to an index that requires 100s of millions of documents is to distributed the documents over multiple servers.  I thought I had the RemoteSearchable like client for Solr pretty well done, however this normalized scoring with weights throws a bit of a kink.  http://issues.apache.org/bugzilla/show_bug.cgi?id=31841  Could someone who understands this offer a hint as to how this would be implemented in Solr?  I am unfamiliar with the Weights and Similarities.  What has been implemented is a client that merges results from multiple Solr servers.  

BTW Still hacking on the UpdateableSearcher, starting to test out the code Yonik offered.  

Thanks.

Re: Making RemoteSearchable like client for Solr

Posted by jason rutherglen <ja...@yahoo.com>.

Yes that makes sense.  I will develop along those lines.  Thanks!

----- Original Message ----
From: Yonik Seeley <ys...@gmail.com>
To: solr-dev@lucene.apache.org
Cc: jason rutherglen <ja...@yahoo.com>
Sent: Thursday, May 18, 2006 7:28:12 PM
Subject: Re: Making RemoteSearchable like client for Solr

On 5/18/06, Chris Hostetter <ho...@fucit.org> wrote:
> which would return
> the results along with all hte metadata needed for merging the results on the
> "local" site of the multisearch.

As far as idf, I don't think that's doable in one pass though.  If you
want accurate idfs, you need to get docFreqs on the first pass, then
tell the subsearchers the global frequencies on the second pass.

-Yonik

Re: Making RemoteSearchable like client for Solr

Posted by Yonik Seeley <ys...@gmail.com>.

On 5/18/06, Chris Hostetter <ho...@fucit.org> wrote:
> which would return
> the results along with all hte metadata needed for merging the results on the
> "local" site of the multisearch.

As far as idf, I don't think that's doable in one pass though.  If you
want accurate idfs, you need to get docFreqs on the first pass, then
tell the subsearchers the global frequencies on the second pass.

-Yonik

Re: Making RemoteSearchable like client for Solr

Posted by Chris Hostetter <ho...@fucit.org>.

: Yes I would want to return the docFreq for each term, in the header or
: something of the /select XML result?

the response is totally customizable, RequestHandlers can add any
primitive data that they want (Strings, Integers, Floats, Dates, Lists,
Maps).  I would imagine you'd want to make a new Request Handler that
would be used on the "remote" side of the multisearch, which would return
the results along with all hte metadata needed for merging the results on the
"local" site of the multisearch.



-Hoss

Re: Making RemoteSearchable like client for Solr

Posted by jason rutherglen <ja...@yahoo.com>.

Yes I would want to return the docFreq for each term, in the header or something of the /select XML result?

----- Original Message ----
From: Yonik Seeley <ys...@gmail.com>
To: solr-dev@lucene.apache.org
Sent: Thursday, May 18, 2006 7:08:41 PM
Subject: Re: Making RemoteSearchable like client for Solr

On 5/18/06, jason rutherglen <ja...@yahoo.com> wrote:
> > If you query for "x OR y", the doc score you get will be a combination
> of the doc score for x and the doc score for y.   After you have the
> document score for the complete query, you can't adjust the IDF for
> just one of the terms because you don't know the individual scores for
> x and y anymore.
>
> Can the /select call return the IDFs for each individual term in the XML result?

Not currently.
I think you would want to return docFreq(), the raw document frequency.


-Yonik

Re: Making RemoteSearchable like client for Solr

Posted by Yonik Seeley <ys...@gmail.com>.

On 5/18/06, jason rutherglen <ja...@yahoo.com> wrote:
> > If you query for "x OR y", the doc score you get will be a combination
> of the doc score for x and the doc score for y.   After you have the
> document score for the complete query, you can't adjust the IDF for
> just one of the terms because you don't know the individual scores for
> x and y anymore.
>
> Can the /select call return the IDFs for each individual term in the XML result?

Not currently.
I think you would want to return docFreq(), the raw document frequency.


-Yonik

Re: Making RemoteSearchable like client for Solr

Posted by jason rutherglen <ja...@yahoo.com>.

> If you query for "x OR y", the doc score you get will be a combination
of the doc score for x and the doc score for y.   After you have the
document score for the complete query, you can't adjust the IDF for
just one of the terms because you don't know the individual scores for
x and y anymore.

Can the /select call return the IDFs for each individual term in the XML result?

----- Original Message ----
From: Yonik Seeley <ys...@gmail.com>
To: solr-dev@lucene.apache.org
Sent: Thursday, May 18, 2006 6:33:29 PM
Subject: Re: Making RemoteSearchable like client for Solr

On 5/18/06, jason rutherglen <ja...@yahoo.com> wrote:
> It uses Jakarta HTTP Client.  And implements a PriorityQueue like thing using the java.util.concurrent queues and thread pool for merging results.

Are you able to contribute this code, or is it proprietary?

Have you implemented sorting by field also?  That would currently
require the additional constraint that the sort field be stored as
well as indexed (Lucene only requires it be indexed).

> Perhaps the global IDF is not a big deal?  The idea is to distribute evenly over all the machines the documents.  However when a new server comes online, this may present a problem as it would start at 0 documents.

Hmmm, yes, idf values could get out-of-whack when there are very few
documents on a server.

> I probably would not cache the global IDF, would simply merge it each time.  I actually do not fully understand what the global IDF means as I need to dig more deeply into this.

inverse-document-frequency.  it makes rarer terms count more.
it's two components are the number of docs in the collection, and the
number of docs containing a specific term.

> > I don't think everything can be done in a single call since by the
> time you score docs against a query you have lost how you arrived at
> the composite score.
>
> I'm not sure what this means "you have lost how you arrived at
>  the composite score" could you explain.

If you query for "x OR y", the doc score you get will be a combination
of the doc score for x and the doc score for y.   After you have the
document score for the complete query, you can't adjust the IDF for
just one of the terms because you don't know the individual scores for
x and y anymore.

-Yonik

Re: Making RemoteSearchable like client for Solr

Posted by Yonik Seeley <ys...@gmail.com>.

On 5/18/06, jason rutherglen <ja...@yahoo.com> wrote:
> It uses Jakarta HTTP Client.  And implements a PriorityQueue like thing using the java.util.concurrent queues and thread pool for merging results.

Are you able to contribute this code, or is it proprietary?

Have you implemented sorting by field also?  That would currently
require the additional constraint that the sort field be stored as
well as indexed (Lucene only requires it be indexed).

> Perhaps the global IDF is not a big deal?  The idea is to distribute evenly over all the machines the documents.  However when a new server comes online, this may present a problem as it would start at 0 documents.

Hmmm, yes, idf values could get out-of-whack when there are very few
documents on a server.

> I probably would not cache the global IDF, would simply merge it each time.  I actually do not fully understand what the global IDF means as I need to dig more deeply into this.

inverse-document-frequency.  it makes rarer terms count more.
it's two components are the number of docs in the collection, and the
number of docs containing a specific term.

> > I don't think everything can be done in a single call since by the
> time you score docs against a query you have lost how you arrived at
> the composite score.
>
> I'm not sure what this means "you have lost how you arrived at
>  the composite score" could you explain.

If you query for "x OR y", the doc score you get will be a combination
of the doc score for x and the doc score for y.   After you have the
document score for the complete query, you can't adjust the IDF for
just one of the terms because you don't know the individual scores for
x and y anymore.

-Yonik

Re: Making RemoteSearchable like client for Solr

Posted by jason rutherglen <ja...@yahoo.com>.

It uses Jakarta HTTP Client.  And implements a PriorityQueue like thing using the java.util.concurrent queues and thread pool for merging results.  Perhaps the global IDF is not a big deal?  The idea is to distribute evenly over all the machines the documents.  However when a new server comes online, this may present a problem as it would start at 0 documents.  The goal would be to allow scaling by simply adding hardware and the software takes care of the rest.  

I probably would not cache the global IDF, would simply merge it each time.  I actually do not fully understand what the global IDF means as I need to dig more deeply into this.  

> I don't think everything can be done in a single call since by the
time you score docs against a query you have lost how you arrived at
the composite score.

I'm not sure what this means "you have lost how you arrived at
 the composite score" could you explain.  

Anyways, thanks for doing Solr it's quite cool, it has been working quite well.  

Jason

----- Original Message ----
From: Yonik Seeley <ys...@gmail.com>
To: solr-dev@lucene.apache.org
Sent: Thursday, May 18, 2006 6:04:50 PM
Subject: Re: Making RemoteSearchable like client for Solr

On 5/18/06, jason rutherglen <ja...@yahoo.com> wrote:
> I used the XML, I think using HTTP is important.

Is this written in Java?  Using HTTPClient?  Anything you will be able to share?

No caching on the client yet, that is a good idea, however my personal
goal is to have an index that is updated every 30 seconds or less and
so am not sure about caching on the client.  The caching can be
handled by the Solr servers, that should be fine.  If it works
correctly then the architecture is very simple requiring 2 layers.
The first is a Solr layer, the second is the client layer essentially
running many threads in parallel per request.  Seems like this would
scale cheaply by adding more hardware on both layers.
>
> >  If you are using RMI you could
> either borrow from or subclass Lucene's MultiSearcher that implements
> this stuff.
>
> Yeah this is the real issue, if there are any general outlines of the best way to do this with Solr.  Perhaps a separate Solr call for the docFreqs?  Or could this be returned in the current /select call?  I'm still trying to figure this part out.

Using XML, there would definitely have to be some more API calls to
return idf related stuff.
I don't think everything can be done in a single call since by the
time you score docs against a query you have lost how you arrived at
the composite score.

It might be nice to be able to turn the distributed idf turned off
though... people with large index segments and documents that are
randomly distributed probably won't see much of a difference in
scoring, but will see a performance increase.

We also need to be careful of caching scores at the local level... if
a different remote searcher changes, the scores cached on the other
become invalid because of the gobal idf (yuck).

-Yonik

Re: Making RemoteSearchable like client for Solr

Posted by jason rutherglen <ja...@yahoo.com>.

Sorry didn't answer the sharing bit, yes it can be shared, I think, need to ask T.C.

----- Original Message ----
From: Yonik Seeley <ys...@gmail.com>
To: solr-dev@lucene.apache.org
Sent: Thursday, May 18, 2006 6:04:50 PM
Subject: Re: Making RemoteSearchable like client for Solr

On 5/18/06, jason rutherglen <ja...@yahoo.com> wrote:
> I used the XML, I think using HTTP is important.

Is this written in Java?  Using HTTPClient?  Anything you will be able to share?

No caching on the client yet, that is a good idea, however my personal
goal is to have an index that is updated every 30 seconds or less and
so am not sure about caching on the client.  The caching can be
handled by the Solr servers, that should be fine.  If it works
correctly then the architecture is very simple requiring 2 layers.
The first is a Solr layer, the second is the client layer essentially
running many threads in parallel per request.  Seems like this would
scale cheaply by adding more hardware on both layers.
>
> >  If you are using RMI you could
> either borrow from or subclass Lucene's MultiSearcher that implements
> this stuff.
>
> Yeah this is the real issue, if there are any general outlines of the best way to do this with Solr.  Perhaps a separate Solr call for the docFreqs?  Or could this be returned in the current /select call?  I'm still trying to figure this part out.

Using XML, there would definitely have to be some more API calls to
return idf related stuff.
I don't think everything can be done in a single call since by the
time you score docs against a query you have lost how you arrived at
the composite score.

It might be nice to be able to turn the distributed idf turned off
though... people with large index segments and documents that are
randomly distributed probably won't see much of a difference in
scoring, but will see a performance increase.

We also need to be careful of caching scores at the local level... if
a different remote searcher changes, the scores cached on the other
become invalid because of the gobal idf (yuck).

-Yonik

Re: Making RemoteSearchable like client for Solr

Posted by Yonik Seeley <ys...@gmail.com>.

On 5/18/06, jason rutherglen <ja...@yahoo.com> wrote:
> I used the XML, I think using HTTP is important.

Is this written in Java?  Using HTTPClient?  Anything you will be able to share?

No caching on the client yet, that is a good idea, however my personal
goal is to have an index that is updated every 30 seconds or less and
so am not sure about caching on the client.  The caching can be
handled by the Solr servers, that should be fine.  If it works
correctly then the architecture is very simple requiring 2 layers.
The first is a Solr layer, the second is the client layer essentially
running many threads in parallel per request.  Seems like this would
scale cheaply by adding more hardware on both layers.
>
> >  If you are using RMI you could
> either borrow from or subclass Lucene's MultiSearcher that implements
> this stuff.
>
> Yeah this is the real issue, if there are any general outlines of the best way to do this with Solr.  Perhaps a separate Solr call for the docFreqs?  Or could this be returned in the current /select call?  I'm still trying to figure this part out.

Using XML, there would definitely have to be some more API calls to
return idf related stuff.
I don't think everything can be done in a single call since by the
time you score docs against a query you have lost how you arrived at
the composite score.

It might be nice to be able to turn the distributed idf turned off
though... people with large index segments and documents that are
randomly distributed probably won't see much of a difference in
scoring, but will see a performance increase.

We also need to be careful of caching scores at the local level... if
a different remote searcher changes, the scores cached on the other
become invalid because of the gobal idf (yuck).

-Yonik

Re: Making RemoteSearchable like client for Solr

Posted by jason rutherglen <ja...@yahoo.com>.

I used the XML, I think using HTTP is important.  No caching on the client yet, that is a good idea, however my personal goal is to have an index that is updated every 30 seconds or less and so am not sure about caching on the client.  The caching can be handled by the Solr servers, that should be fine.  If it works correctly then the architecture is very simple requiring 2 layers.  The first is a Solr layer, the second is the client layer essentially running many threads in parallel per request.  Seems like this would scale cheaply by adding more hardware on both layers.  

>  If you are using RMI you could
either borrow from or subclass Lucene's MultiSearcher that implements
this stuff.

Yeah this is the real issue, if there are any general outlines of the best way to do this with Solr.  Perhaps a separate Solr call for the docFreqs?  Or could this be returned in the current /select call?  I'm still trying to figure this part out.  

----- Original Message ----
From: Yonik Seeley <ys...@gmail.com>
To: solr-dev@lucene.apache.org
Sent: Thursday, May 18, 2006 5:21:58 PM
Subject: Re: Making RemoteSearchable like client for Solr

On 5/18/06, jason rutherglen <ja...@yahoo.com> wrote:
> A solution to an index that requires 100s of millions of documents is to distributed the documents over multiple servers.  I thought I had the RemoteSearchable like client for Solr pretty well done

Great!  Can you share what approach you followed?
Is caching done on the subsearchers, and not the supersearcher?
Are you using RMI, or XML/HTTP?

> , however this normalized scoring with weights throws a bit of a kink.

It certainly does... not easy stuff.  If you are using RMI you could
either borrow from or subclass Lucene's MultiSearcher that implements
this stuff.

-Yonik

Re: Making RemoteSearchable like client for Solr

Posted by Yonik Seeley <ys...@gmail.com>.

On 5/18/06, jason rutherglen <ja...@yahoo.com> wrote:
> A solution to an index that requires 100s of millions of documents is to distributed the documents over multiple servers.  I thought I had the RemoteSearchable like client for Solr pretty well done

Great!  Can you share what approach you followed?
Is caching done on the subsearchers, and not the supersearcher?
Are you using RMI, or XML/HTTP?

> , however this normalized scoring with weights throws a bit of a kink.

It certainly does... not easy stuff.  If you are using RMI you could
either borrow from or subclass Lucene's MultiSearcher that implements
this stuff.

-Yonik