You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-dev@lucene.apache.org by Ken Krugler <kk...@transpac.com> on 2006/09/28 02:24:01 UTC

Re: Is solr scalable with respect to number of documents?

[Moving this to solr-dev from solr-user]

>On 9/27/06, Vish D. <vi...@gmail.com> wrote:
>>I just noticed that link on the first reply from Yonik about
>>FederatedSearch. I see that a lot of thought went in to it. I guess the
>>question to ask would be, any progress on it, Yonik? :)
>
>No code, but great progress at shooting holes in various strategies ;-)
>
>I'm currently thinking about doing federated search at a higher level,
>with slightly modified standard request handlers, and another
>top-level request handler that can combine requests.  The biggest
>downside: no custom query handlers.
>
>The other option: do federated search like a lucene MultiSearcher...
>(a federated version of the SolrIndexSearcher).  The downside is that
>existing interfaces would not be usable... we can't be shipping tons
>of BitDocSets across the network.  Things like highlighting, federated
>search, etc, would need to be pushed down into this interface.  New
>interfaces means lots of changes to request handler code.  Upside
>would be that custom request handlers would still work and be
>automatically parallelized.
>
>Anyone have any thoughts on this stuff?
>http://wiki.apache.org/solr/FederatedSearch

Quick impression - given the scope of what's being described on this 
page, it feels like a "boil the ocean" problem.

I've spent an afternoon looking at how we could use Solr as our 
distributed searchers for Nutch. Currently the Nutch search serving 
code isn't getting much love, so somehow leveraging Solr would seem 
like a win.

The three attributes of Solr that are most interesting to me in this 
context are:

1. Live update support.

2. More complex query processing.

3. Caching (though not as critical)

Things I can live with that I noticed being described as issues on 
the Federated Search page:

  * No sorting support - just simple merged ranking.
  * No IDF skew compensation - we can mix documents sufficiently.
  * No automatic doc->server mapping - we can calc our own stable hash for this.
  * No consistency via retry.

To that end, I did a quick exploration of how to use Hadoop RPC to 
"talk" to the guts of Solr. This assumes that:

1. Query processing happens at the search server level, versus at the 
master, as it is currently with Nutch.

2. There's a way to request summaries by document id via a subsequent 
(post-merge) call from the master.

<and a bunch of other issues that I haven't noted>.

The immediate problem I ran into is that the notion of Solr running 
inside of a servlet container currently penetrates deep into the 
bowels of the code. Even below the core level, calls are being made 
to extract query parameters from a URL.

So step 1, if I was going to try to do this in a clean manner, would 
be to define a servlet side/Solr core API layer. Then it would be 
relatively easy to at least do the first cut of hooking up the Solr 
core to a Nutch master via Hadoop PRC.

-- Ken
-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"

Re: Is solr scalable with respect to number of documents?

Posted by Doug Cutting <cu...@apache.org>.

Ken Krugler wrote:
> RMI is faster than Hadoop RPC [ ... ]

Have you benchmarked this?

Doug

Re: Is solr scalable with respect to number of documents?

Posted by Ken Krugler <kk...@transpac.com>.

>On 9/27/06, Ken Krugler <kk...@transpac.com> wrote:
>>I've spent an afternoon looking at how we could use Solr as our
>>distributed searchers for Nutch. Currently the Nutch search serving
>>code isn't getting much love, so somehow leveraging Solr would seem
>>like a win.
>>
>>The three attributes of Solr that are most interesting to me in this
>>context are:
>>
>>1. Live update support.
>>
>>2. More complex query processing.
>>
>>3. Caching (though not as critical)
>>
>>Things I can live with that I noticed being described as issues on
>>the Federated Search page:
>>
>>   * No sorting support - just simple merged ranking.
>
>I think it wouldn't be too much trouble to support all forms of
>sorting that Solr currently supports.  This can be done in the same
>manner as the current Lucene MultiSearcher.
>
>>   * No IDF skew compensation - we can mix documents sufficiently.
>
>Yeah, I wasn't going to tackle that on the first pass.  But it is
>doable (again, the Lucene MultiSearcher shows how).  I'd want to make
>it optional in any case, because the performance gains are often not
>worth it.

Note that Nutch doesn't try to solve this, because of concerns that 
the extra round-trips required to normalize IDFs across remote 
searchers would be too slow. RMI is faster than Hadoop RPC, so I 
guess it's less of an issue there.

>>   * No automatic doc->server mapping - we can calc our own stable 
>>hash for this.
>>   * No consistency via retry.
>
>>To that end, I did a quick exploration of how to use Hadoop RPC to
>>"talk" to the guts of Solr. This assumes that:
>
>I'm not into Nutch or Hadoop that much yet, so I'd be really
>interested what you find out there.
>
>>1. Query processing happens at the search server level, versus at the
>>master, as it is currently with Nutch.
>>
>>2. There's a way to request summaries by document id via a subsequent
>>(post-merge) call from the master.
>
>#2 is the biggie I think (if by "document id" you mean internal lucene docid).
>Not having internal document ids change between calls is the biggest problem.

Well, you have to handle potential summarizer problems in any case - 
for example if a remote searcher goes away, or gets so bogged down 
that it times out, but you've got a hit from that server which needs 
a summary. This is the case we ran into during load testing.

Though that wouldn't be as serious as getting a completely wrong 
summary, if the remote index updated between when the search request 
happened and the summary was requested.

A munge count might be enough, and pretty simple.

>><and a bunch of other issues that I haven't noted>.
>>
>>The immediate problem I ran into is that the notion of Solr running
>>inside of a servlet container currently penetrates deep into the
>>bowels of the code. Even below the core level, calls are being made
>>to extract query parameters from a URL.
>
>That's wrapped up in SolrQueryParams, which has a non-servlet version though.
>The unit tests use this to run outside of a container.

That's part of it, but from what I remember there were other issues 
with servlet-esque objects getting passed down deep. I'll have to 
take another look, as my afternoon of poking was a few weeks ago.

Thanks,

-- Ken
-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"

Re: Is solr scalable with respect to number of documents?

Posted by Yonik Seeley <yo...@apache.org>.

On 9/27/06, Yonik Seeley <yo...@apache.org> wrote:
> On 9/27/06, Ken Krugler <kk...@transpac.com> wrote:
> > The immediate problem I ran into is that the notion of Solr running
> > inside of a servlet container currently penetrates deep into the
> > bowels of the code. Even below the core level, calls are being made
> > to extract query parameters from a URL.
>
> That's wrapped up in SolrQueryParams, which has a non-servlet version though.
> The unit tests use this to run outside of a container.

Make that SolrQueryRequest.  The local version is
http://incubator.apache.org/solr/docs/api/org/apache/solr/request/LocalSolrQueryRequest.html

-Yonik

Re: Is solr scalable with respect to number of documents?

Posted by Yonik Seeley <yo...@apache.org>.

On 9/27/06, Ken Krugler <kk...@transpac.com> wrote:
> I've spent an afternoon looking at how we could use Solr as our
> distributed searchers for Nutch. Currently the Nutch search serving
> code isn't getting much love, so somehow leveraging Solr would seem
> like a win.
>
> The three attributes of Solr that are most interesting to me in this
> context are:
>
> 1. Live update support.
>
> 2. More complex query processing.
>
> 3. Caching (though not as critical)
>
> Things I can live with that I noticed being described as issues on
> the Federated Search page:
>
>   * No sorting support - just simple merged ranking.

I think it wouldn't be too much trouble to support all forms of
sorting that Solr currently supports.  This can be done in the same
manner as the current Lucene MultiSearcher.

>   * No IDF skew compensation - we can mix documents sufficiently.

Yeah, I wasn't going to tackle that on the first pass.  But it is
doable (again, the Lucene MultiSearcher shows how).  I'd want to make
it optional in any case, because the performance gains are often not
worth it.

>   * No automatic doc->server mapping - we can calc our own stable hash for this.
>   * No consistency via retry.

> To that end, I did a quick exploration of how to use Hadoop RPC to
> "talk" to the guts of Solr. This assumes that:

I'm not into Nutch or Hadoop that much yet, so I'd be really
interested what you find out there.

> 1. Query processing happens at the search server level, versus at the
> master, as it is currently with Nutch.
>
> 2. There's a way to request summaries by document id via a subsequent
> (post-merge) call from the master.

#2 is the biggie I think (if by "document id" you mean internal lucene docid).
Not having internal document ids change between calls is the biggest problem.

> <and a bunch of other issues that I haven't noted>.
>
> The immediate problem I ran into is that the notion of Solr running
> inside of a servlet container currently penetrates deep into the
> bowels of the code. Even below the core level, calls are being made
> to extract query parameters from a URL.

That's wrapped up in SolrQueryParams, which has a non-servlet version though.
The unit tests use this to run outside of a container.

-Yonik