You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Dmitry Kan <dm...@gmail.com> on 2011/03/22 12:51:09 UTC

solr on the cloud

hey folks,

I have tried running the sharded solr with zoo keeper on a single machine.
The SOLR code is from current trunk. It runs nicely. Can you please point me
to a page, where I can check the status of the solr on the cloud development
and available features, apart from http://wiki.apache.org/solr/SolrCloud ?

Basically, of high interest is checking out the Map-Reduce for distributed
faceting, is it even possible with the trunk?

-- 
Regards,

Dmitry Kan

Re: solr on the cloud

Posted by Dmitry Kan <dm...@gmail.com>.

Hi Otis,

Ok, thanks.

No, the question about distributed faceting was in a 'guess' mode as
faceting seems to be a good fit to MR. I probably need to follow the jira
tickets closer for a follow-up, but was initially wondering if I missed some
documentation on the topic, which didn't apparently happen.

On Fri, Mar 25, 2011 at 5:34 AM, Otis Gospodnetic <
otis_gospodnetic@yahoo.com> wrote:

> Hi,
>
>
> > I have tried running the sharded solr with zoo keeper on a  single
> machine.
>
> > The SOLR code is from current trunk. It runs nicely. Can you  please
> point me
> > to a page, where I can check the status of the solr on the  cloud
> development
> > and available features, apart from http://wiki.apache.org/solr/SolrCloud?
>
> I'm afraid that's the most comprehensive documentation so far.
>
> > Basically, of high interest  is checking out the Map-Reduce for
> distributed
> > faceting, is it even possible  with the trunk?
>
> Hm, MR for distributed faceting?  Maybe I missed this... can you point to a
> place that mentions this?
>
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>



-- 
Regards,

Dmitry Kan

Re: solr on the cloud

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Hi,


> I have tried running the sharded solr with zoo keeper on a  single machine.

> The SOLR code is from current trunk. It runs nicely. Can you  please point me
> to a page, where I can check the status of the solr on the  cloud development
> and available features, apart from http://wiki.apache.org/solr/SolrCloud ?

I'm afraid that's the most comprehensive documentation so far.

> Basically, of high interest  is checking out the Map-Reduce for distributed
> faceting, is it even possible  with the trunk?

Hm, MR for distributed faceting?  Maybe I missed this... can you point to a 
place that mentions this?

Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

Re: solr on the cloud

Posted by Upayavira <uv...@odoko.co.uk>.


On Fri, 25 Mar 2011 14:26 +0200, "Dmitry Kan" <dm...@gmail.com>
wrote:
> Hi, Upayavira
> 
> Probably I'm confusing the terms here. When I say "distributed faceting"
> I'm
> more into SOLR on the cloud (e.g. HDFS + MR + cloud of commodity
> machines)
> rather than into traditional multicore/sharded SOLR on a single or
> multiple
> servers with non-distributed file systems (is that what you mean when you
> refer to "distribution of facet requests across hosts"?)

I mean the latter I am afraid. I'm very interested in how the former
might be implemented, but as far as I understand it, Zookeeper does not
take you all the way there. It co-ordinates nodes (e.g. telling a slave
where its master is), but if you have to distribute an index over
multiple hosts, it will be sharded between multiple solr hosts, with
each of those hosts having a local index.

You are presumably talking about a scenario where you effectively have
one index, spanning multiple hosts (we have code to distribute queries
across multiple segments, why can't we do it across multiple hosts?).
I've heard of work to do this with Infinispan underneath, but not within
the core Lucene/Solr.

Upayavira

> On Fri, Mar 25, 2011 at 1:57 PM, Upayavira <uv...@odoko.co.uk> wrote:
> 
> >
> >
> > On Fri, 25 Mar 2011 13:44 +0200, "Dmitry Kan" <dm...@gmail.com>
> > wrote:
> > > Hi Yonik,
> > >
> > > Oh, this is great. Is distributed faceting available in the trunk? What
> > > is
> > > the basic server setup needed for trying this out, is it cloud with HDFS
> > > and
> > > SOLR with zookepers?
> > > Any chance to see the related documentation? :)
> >
> > Distributed faceting has been available for a long time, and is
> > available in the 1.4.1 release.
> >
> > The distribution of facet requests across hosts happens in the
> > background. There's no real difference (in query syntax) between a
> > standard facet query and a distributed one.
> >
> > i.e. you don't need SolrCloud nor Zookeeper for it. (they may provide
> > other benefits, but you don't need them for distributed faceting).
> >
> > Upayavira
> >
> > > On Fri, Mar 25, 2011 at 1:35 PM, Yonik Seeley
> > > <yo...@lucidimagination.com>wrote:
> > >
> > > > On Tue, Mar 22, 2011 at 7:51 AM, Dmitry Kan <dm...@gmail.com>
> > wrote:
> > > > > Basically, of high interest is checking out the Map-Reduce for
> > > > distributed
> > > > > faceting, is it even possible with the trunk?
> > > >
> > > > Solr already has distributed faceting, and it's much more performant
> > > > than a map-reduce implementation would be.
> > > >
> > > > I've also seen a product use the term "map reduce" incorrectly... as
> > in,
> > > > we "map" the request to each shard, and then "reduce" the results to a
> > > > single list (of course, that's not actually map-reduce at all ;-)
> > > >
> > > >
> > > :) this sounds pretty strange to me as well. It was only my guess, that
> > > if
> > > you have MR as computational model and a cloud beneath it, you could
> > > naturally map facet fields to their counts inside single documents (no
> > > matter, where they are, be it shards or "single" index) and pass them
> > > onto
> > > reducers.
> > >
> > >
> > > > -Yonik
> > > > http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
> > > > 25-26, San Francisco
> > > >
> > >
> > >
> > >
> > > --
> > > Regards,
> > >
> > > Dmitry Kan
> > >
> > ---
> > Enterprise Search Consultant at Sourcesense UK,
> > Making Sense of Open Source
> >
> >
> 
> 
> -- 
> Regards,
> 
> Dmitry Kan
> 
--- 
Enterprise Search Consultant at Sourcesense UK, 
Making Sense of Open Source

Re: solr on the cloud

Posted by Dmitry Kan <dm...@gmail.com>.

Thanks, Jason, this looks very relevant!

On Fri, Mar 25, 2011 at 11:26 PM, Jason Rutherglen <
jason.rutherglen@gmail.com> wrote:

> Dmitry,
>
> If you're planning on using HBase you can take a look at
> https://issues.apache.org/jira/browse/HBASE-3529  I think we may even
> have a reasonable solution for reading the index [randomly] out of
> HDFS.  Benchmarking'll be implemented next.  It's not production
> ready, suggestions are welcome.
>
> Jason
>
> On Fri, Mar 25, 2011 at 2:03 PM, Dmitry Kan <dm...@gmail.com> wrote:
> > Hi Otis,
> >
> > Thanks for elaborating on this and the link (funny!).
> >
> > I have quite a big dataset growing all the time. The problems that I
> start
> > facing are pretty much predictable:
> > 1. Scalability: this inludes indexing time (now some days!, better hours
> or
> > even minutes, if that's possible) along with handling the rapid growth
> > 2. Robustness: the entire system (distributed or single server or
> anything
> > else) should be fault-tolerant, e.g. if one shard goes down, other
> catches
> > up (master-slave scheme)
> > 3. Some apps that we run on SOLR are pretty computationally demanding,
> like
> > faceting over one+bi+trigrams of hundreds of millions of documents (index
> > size of half a TB) ---> single server with a shard of data does not seem
> to
> > be enough for realtime search.
> >
> > This is just for a bit of a background. I agree with you on that hadoop
> and
> > cloud probably best suit massive batch processes rather than realtime
> > search. I'm sure, if anyone out there made SOLR shine throught the cloud
> for
> > realtime search over large datasets.
> >
> > By "SOLR on the cloud (e.g. HDFS + MR +  cloud of
> > commodity machines)" I mean what you've done for your customers using
> EC2.
> > Any chance, the guidlines/articles for/on setting indices on HDFS are
> > available in some open / paid area?
> >
> > To sum this up, I didn't mean to create a buzz on the cloud solutions in
> > this thread, just was wondering what is practically available / going on
> in
> > SOLR development in this regard.
> >
> > Thanks,
> >
> > Dmitry
> >
> >
> > On Fri, Mar 25, 2011 at 10:28 PM, Otis Gospodnetic <
> > otis_gospodnetic@yahoo.com> wrote:
> >
> >> Hi Dan,
> >>
> >> This feels a bit like a buzzword soup.... with mushrooms. :)
> >>
> >> MR jobs, at least the ones in Hadoopland, are very batch oriented, so
> that
> >> wouldn't be very suitable for most search applications.  There are some
> >> technologies like Riak that combine MR and search.  Let me use this
> funny
> >> little
> >> link: http://lmgtfy.com/?q=riak%20mapreduce%20search
> >>
> >>
> >> Sure, you can put indices on HDFS (but don't expect searches to be
> fast).
> >>  Sure
> >> you can create indices using MapReduce, we've done that successfully for
> >> customers bringing long indexing jobs from many hours to minutes by
> using,
> >> yes,
> >> a cluster of machines (actually EC2 instances).
> >> But when you say "more into SOLR on the cloud (e.g. HDFS + MR +  cloud
> of
> >> commodity machines)", I can't actually picture what precisely you
> mean...
> >>
> >>
> >> Otis
> >> ---
> >> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> >> Lucene ecosystem search :: http://search-lucene.com/
> >>
> >>
> >>
> >> ----- Original Message ----
> >> > From: Dmitry Kan <dm...@gmail.com>
> >> > To: solr-user@lucene.apache.org
> >> > Cc: Upayavira <uv...@odoko.co.uk>
> >> > Sent: Fri, March 25, 2011 8:26:33 AM
> >> > Subject: Re: solr on the cloud
> >> >
> >> > Hi, Upayavira
> >> >
> >> > Probably I'm confusing the terms here. When I say  "distributed
> faceting"
> >> I'm
> >> > more into SOLR on the cloud (e.g. HDFS + MR +  cloud of commodity
> >> machines)
> >> > rather than into traditional multicore/sharded  SOLR on a single or
> >> multiple
> >> > servers with non-distributed file systems (is  that what you mean when
> >> you
> >> > refer to "distribution of facet requests across  hosts"?)
> >> >
> >> > On Fri, Mar 25, 2011 at 1:57 PM, Upayavira <uv...@odoko.co.uk>  wrote:
> >> >
> >> > >
> >> > >
> >> > > On Fri, 25 Mar 2011 13:44 +0200, "Dmitry Kan"  <
> dmitry.kan@gmail.com>
> >> > >  wrote:
> >> > > > Hi Yonik,
> >> > > >
> >> > > > Oh, this is great. Is  distributed faceting available in the
> trunk?
> >> What
> >> > > > is
> >> > > >  the basic server setup needed for trying this out, is it cloud
> with
> >> HDFS
> >> > >  > and
> >> > > > SOLR with zookepers?
> >> > > > Any chance to see the  related documentation? :)
> >> > >
> >> > > Distributed faceting has been  available for a long time, and is
> >> > > available in the 1.4.1  release.
> >> > >
> >> > > The distribution of facet requests across hosts happens  in the
> >> > > background. There's no real difference (in query syntax) between  a
> >> > > standard facet query and a distributed one.
> >> > >
> >> > > i.e. you  don't need SolrCloud nor Zookeeper for it. (they may
> provide
> >> > > other  benefits, but you don't need them for distributed faceting).
> >> > >
> >> > >  Upayavira
> >> > >
> >> > > > On Fri, Mar 25, 2011 at 1:35 PM, Yonik  Seeley
> >> > > > <yo...@lucidimagination.com>wrote:
> >> > >  >
> >> > > > > On Tue, Mar 22, 2011 at 7:51 AM, Dmitry Kan <
> dmitry.kan@gmail.com>
> >> > >  wrote:
> >> > > > > > Basically, of high interest is checking out the  Map-Reduce
> for
> >> > > > > distributed
> >> > > > > > faceting, is  it even possible with the trunk?
> >> > > > >
> >> > > > > Solr  already has distributed faceting, and it's much more
> >> performant
> >> > > >  > than a map-reduce implementation would be.
> >> > > > >
> >> > > >  > I've also seen a product use the term "map reduce"
> incorrectly...
> >>  as
> >> > > in,
> >> > > > > we "map" the request to each shard, and then  "reduce" the
> results
> >> to a
> >> > > > > single list (of course, that's not  actually map-reduce at all
> ;-)
> >> > > > >
> >> > > > >
> >> > > >  :) this sounds pretty strange to me as well. It was only my
> guess,
> >> that
> >> > >  > if
> >> > > > you have MR as computational model and a cloud beneath it,  you
> could
> >> > > > naturally map facet fields to their counts inside single
>  documents
> >> (no
> >> > > > matter, where they are, be it shards or "single"  index) and pass
> >> them
> >> > > > onto
> >> > > > reducers.
> >> > >  >
> >> > > >
> >> > > > > -Yonik
> >> > > > > http://www.lucenerevolution.org -- Lucene/Solr User Conference,
> >> May
> >> > >  > > 25-26, San Francisco
> >> > > > >
> >> > > >
> >> > >  >
> >> > > >
> >> > > > --
> >> > > > Regards,
> >> > > >
> >> > >  > Dmitry Kan
> >> > > >
> >> > > ---
> >> > > Enterprise Search Consultant at  Sourcesense UK,
> >> > > Making Sense of Open  Source
> >> > >
> >> > >
> >> >
> >> >
> >> > --
> >> > Regards,
> >> >
> >> > Dmitry Kan
> >> >
> >>
> >
>



-- 
Regards,

Dmitry Kan

Re: solr on the cloud

Posted by Jason Rutherglen <ja...@gmail.com>.

Dmitry,

If you're planning on using HBase you can take a look at
https://issues.apache.org/jira/browse/HBASE-3529  I think we may even
have a reasonable solution for reading the index [randomly] out of
HDFS.  Benchmarking'll be implemented next.  It's not production
ready, suggestions are welcome.

Jason

On Fri, Mar 25, 2011 at 2:03 PM, Dmitry Kan <dm...@gmail.com> wrote:
> Hi Otis,
>
> Thanks for elaborating on this and the link (funny!).
>
> I have quite a big dataset growing all the time. The problems that I start
> facing are pretty much predictable:
> 1. Scalability: this inludes indexing time (now some days!, better hours or
> even minutes, if that's possible) along with handling the rapid growth
> 2. Robustness: the entire system (distributed or single server or anything
> else) should be fault-tolerant, e.g. if one shard goes down, other catches
> up (master-slave scheme)
> 3. Some apps that we run on SOLR are pretty computationally demanding, like
> faceting over one+bi+trigrams of hundreds of millions of documents (index
> size of half a TB) ---> single server with a shard of data does not seem to
> be enough for realtime search.
>
> This is just for a bit of a background. I agree with you on that hadoop and
> cloud probably best suit massive batch processes rather than realtime
> search. I'm sure, if anyone out there made SOLR shine throught the cloud for
> realtime search over large datasets.
>
> By "SOLR on the cloud (e.g. HDFS + MR +  cloud of
> commodity machines)" I mean what you've done for your customers using EC2.
> Any chance, the guidlines/articles for/on setting indices on HDFS are
> available in some open / paid area?
>
> To sum this up, I didn't mean to create a buzz on the cloud solutions in
> this thread, just was wondering what is practically available / going on in
> SOLR development in this regard.
>
> Thanks,
>
> Dmitry
>
>
> On Fri, Mar 25, 2011 at 10:28 PM, Otis Gospodnetic <
> otis_gospodnetic@yahoo.com> wrote:
>
>> Hi Dan,
>>
>> This feels a bit like a buzzword soup.... with mushrooms. :)
>>
>> MR jobs, at least the ones in Hadoopland, are very batch oriented, so that
>> wouldn't be very suitable for most search applications.  There are some
>> technologies like Riak that combine MR and search.  Let me use this funny
>> little
>> link: http://lmgtfy.com/?q=riak%20mapreduce%20search
>>
>>
>> Sure, you can put indices on HDFS (but don't expect searches to be fast).
>>  Sure
>> you can create indices using MapReduce, we've done that successfully for
>> customers bringing long indexing jobs from many hours to minutes by using,
>> yes,
>> a cluster of machines (actually EC2 instances).
>> But when you say "more into SOLR on the cloud (e.g. HDFS + MR +  cloud of
>> commodity machines)", I can't actually picture what precisely you mean...
>>
>>
>> Otis
>> ---
>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
>> Lucene ecosystem search :: http://search-lucene.com/
>>
>>
>>
>> ----- Original Message ----
>> > From: Dmitry Kan <dm...@gmail.com>
>> > To: solr-user@lucene.apache.org
>> > Cc: Upayavira <uv...@odoko.co.uk>
>> > Sent: Fri, March 25, 2011 8:26:33 AM
>> > Subject: Re: solr on the cloud
>> >
>> > Hi, Upayavira
>> >
>> > Probably I'm confusing the terms here. When I say  "distributed faceting"
>> I'm
>> > more into SOLR on the cloud (e.g. HDFS + MR +  cloud of commodity
>> machines)
>> > rather than into traditional multicore/sharded  SOLR on a single or
>> multiple
>> > servers with non-distributed file systems (is  that what you mean when
>> you
>> > refer to "distribution of facet requests across  hosts"?)
>> >
>> > On Fri, Mar 25, 2011 at 1:57 PM, Upayavira <uv...@odoko.co.uk>  wrote:
>> >
>> > >
>> > >
>> > > On Fri, 25 Mar 2011 13:44 +0200, "Dmitry Kan"  <dm...@gmail.com>
>> > >  wrote:
>> > > > Hi Yonik,
>> > > >
>> > > > Oh, this is great. Is  distributed faceting available in the trunk?
>> What
>> > > > is
>> > > >  the basic server setup needed for trying this out, is it cloud with
>> HDFS
>> > >  > and
>> > > > SOLR with zookepers?
>> > > > Any chance to see the  related documentation? :)
>> > >
>> > > Distributed faceting has been  available for a long time, and is
>> > > available in the 1.4.1  release.
>> > >
>> > > The distribution of facet requests across hosts happens  in the
>> > > background. There's no real difference (in query syntax) between  a
>> > > standard facet query and a distributed one.
>> > >
>> > > i.e. you  don't need SolrCloud nor Zookeeper for it. (they may provide
>> > > other  benefits, but you don't need them for distributed faceting).
>> > >
>> > >  Upayavira
>> > >
>> > > > On Fri, Mar 25, 2011 at 1:35 PM, Yonik  Seeley
>> > > > <yo...@lucidimagination.com>wrote:
>> > >  >
>> > > > > On Tue, Mar 22, 2011 at 7:51 AM, Dmitry Kan <dm...@gmail.com>
>> > >  wrote:
>> > > > > > Basically, of high interest is checking out the  Map-Reduce for
>> > > > > distributed
>> > > > > > faceting, is  it even possible with the trunk?
>> > > > >
>> > > > > Solr  already has distributed faceting, and it's much more
>> performant
>> > > >  > than a map-reduce implementation would be.
>> > > > >
>> > > >  > I've also seen a product use the term "map reduce" incorrectly...
>>  as
>> > > in,
>> > > > > we "map" the request to each shard, and then  "reduce" the results
>> to a
>> > > > > single list (of course, that's not  actually map-reduce at all ;-)
>> > > > >
>> > > > >
>> > > >  :) this sounds pretty strange to me as well. It was only my guess,
>> that
>> > >  > if
>> > > > you have MR as computational model and a cloud beneath it,  you could
>> > > > naturally map facet fields to their counts inside single  documents
>> (no
>> > > > matter, where they are, be it shards or "single"  index) and pass
>> them
>> > > > onto
>> > > > reducers.
>> > >  >
>> > > >
>> > > > > -Yonik
>> > > > > http://www.lucenerevolution.org -- Lucene/Solr User Conference,
>> May
>> > >  > > 25-26, San Francisco
>> > > > >
>> > > >
>> > >  >
>> > > >
>> > > > --
>> > > > Regards,
>> > > >
>> > >  > Dmitry Kan
>> > > >
>> > > ---
>> > > Enterprise Search Consultant at  Sourcesense UK,
>> > > Making Sense of Open  Source
>> > >
>> > >
>> >
>> >
>> > --
>> > Regards,
>> >
>> > Dmitry Kan
>> >
>>
>

Re: solr on the cloud

Posted by Dmitry Kan <dm...@gmail.com>.

Hi Otis,

Thanks for elaborating on this and the link (funny!).

I have quite a big dataset growing all the time. The problems that I start
facing are pretty much predictable:
1. Scalability: this inludes indexing time (now some days!, better hours or
even minutes, if that's possible) along with handling the rapid growth
2. Robustness: the entire system (distributed or single server or anything
else) should be fault-tolerant, e.g. if one shard goes down, other catches
up (master-slave scheme)
3. Some apps that we run on SOLR are pretty computationally demanding, like
faceting over one+bi+trigrams of hundreds of millions of documents (index
size of half a TB) ---> single server with a shard of data does not seem to
be enough for realtime search.

This is just for a bit of a background. I agree with you on that hadoop and
cloud probably best suit massive batch processes rather than realtime
search. I'm sure, if anyone out there made SOLR shine throught the cloud for
realtime search over large datasets.

By "SOLR on the cloud (e.g. HDFS + MR +  cloud of
commodity machines)" I mean what you've done for your customers using EC2.
Any chance, the guidlines/articles for/on setting indices on HDFS are
available in some open / paid area?

To sum this up, I didn't mean to create a buzz on the cloud solutions in
this thread, just was wondering what is practically available / going on in
SOLR development in this regard.

Thanks,

Dmitry


On Fri, Mar 25, 2011 at 10:28 PM, Otis Gospodnetic <
otis_gospodnetic@yahoo.com> wrote:

> Hi Dan,
>
> This feels a bit like a buzzword soup.... with mushrooms. :)
>
> MR jobs, at least the ones in Hadoopland, are very batch oriented, so that
> wouldn't be very suitable for most search applications.  There are some
> technologies like Riak that combine MR and search.  Let me use this funny
> little
> link: http://lmgtfy.com/?q=riak%20mapreduce%20search
>
>
> Sure, you can put indices on HDFS (but don't expect searches to be fast).
>  Sure
> you can create indices using MapReduce, we've done that successfully for
> customers bringing long indexing jobs from many hours to minutes by using,
> yes,
> a cluster of machines (actually EC2 instances).
> But when you say "more into SOLR on the cloud (e.g. HDFS + MR +  cloud of
> commodity machines)", I can't actually picture what precisely you mean...
>
>
> Otis
> ---
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> ----- Original Message ----
> > From: Dmitry Kan <dm...@gmail.com>
> > To: solr-user@lucene.apache.org
> > Cc: Upayavira <uv...@odoko.co.uk>
> > Sent: Fri, March 25, 2011 8:26:33 AM
> > Subject: Re: solr on the cloud
> >
> > Hi, Upayavira
> >
> > Probably I'm confusing the terms here. When I say  "distributed faceting"
> I'm
> > more into SOLR on the cloud (e.g. HDFS + MR +  cloud of commodity
> machines)
> > rather than into traditional multicore/sharded  SOLR on a single or
> multiple
> > servers with non-distributed file systems (is  that what you mean when
> you
> > refer to "distribution of facet requests across  hosts"?)
> >
> > On Fri, Mar 25, 2011 at 1:57 PM, Upayavira <uv...@odoko.co.uk>  wrote:
> >
> > >
> > >
> > > On Fri, 25 Mar 2011 13:44 +0200, "Dmitry Kan"  <dm...@gmail.com>
> > >  wrote:
> > > > Hi Yonik,
> > > >
> > > > Oh, this is great. Is  distributed faceting available in the trunk?
> What
> > > > is
> > > >  the basic server setup needed for trying this out, is it cloud with
> HDFS
> > >  > and
> > > > SOLR with zookepers?
> > > > Any chance to see the  related documentation? :)
> > >
> > > Distributed faceting has been  available for a long time, and is
> > > available in the 1.4.1  release.
> > >
> > > The distribution of facet requests across hosts happens  in the
> > > background. There's no real difference (in query syntax) between  a
> > > standard facet query and a distributed one.
> > >
> > > i.e. you  don't need SolrCloud nor Zookeeper for it. (they may provide
> > > other  benefits, but you don't need them for distributed faceting).
> > >
> > >  Upayavira
> > >
> > > > On Fri, Mar 25, 2011 at 1:35 PM, Yonik  Seeley
> > > > <yo...@lucidimagination.com>wrote:
> > >  >
> > > > > On Tue, Mar 22, 2011 at 7:51 AM, Dmitry Kan <dm...@gmail.com>
> > >  wrote:
> > > > > > Basically, of high interest is checking out the  Map-Reduce for
> > > > > distributed
> > > > > > faceting, is  it even possible with the trunk?
> > > > >
> > > > > Solr  already has distributed faceting, and it's much more
> performant
> > > >  > than a map-reduce implementation would be.
> > > > >
> > > >  > I've also seen a product use the term "map reduce" incorrectly...
>  as
> > > in,
> > > > > we "map" the request to each shard, and then  "reduce" the results
> to a
> > > > > single list (of course, that's not  actually map-reduce at all ;-)
> > > > >
> > > > >
> > > >  :) this sounds pretty strange to me as well. It was only my guess,
> that
> > >  > if
> > > > you have MR as computational model and a cloud beneath it,  you could
> > > > naturally map facet fields to their counts inside single  documents
> (no
> > > > matter, where they are, be it shards or "single"  index) and pass
> them
> > > > onto
> > > > reducers.
> > >  >
> > > >
> > > > > -Yonik
> > > > > http://www.lucenerevolution.org -- Lucene/Solr User Conference,
> May
> > >  > > 25-26, San Francisco
> > > > >
> > > >
> > >  >
> > > >
> > > > --
> > > > Regards,
> > > >
> > >  > Dmitry Kan
> > > >
> > > ---
> > > Enterprise Search Consultant at  Sourcesense UK,
> > > Making Sense of Open  Source
> > >
> > >
> >
> >
> > --
> > Regards,
> >
> > Dmitry Kan
> >
>

Re: solr on the cloud

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Hi Dan,

This feels a bit like a buzzword soup.... with mushrooms. :)

MR jobs, at least the ones in Hadoopland, are very batch oriented, so that 
wouldn't be very suitable for most search applications.  There are some 
technologies like Riak that combine MR and search.  Let me use this funny little 
link: http://lmgtfy.com/?q=riak%20mapreduce%20search


Sure, you can put indices on HDFS (but don't expect searches to be fast).  Sure 
you can create indices using MapReduce, we've done that successfully for 
customers bringing long indexing jobs from many hours to minutes by using, yes, 
a cluster of machines (actually EC2 instances).
But when you say "more into SOLR on the cloud (e.g. HDFS + MR +  cloud of 
commodity machines)", I can't actually picture what precisely you mean...  


Otis
---
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: Dmitry Kan <dm...@gmail.com>
> To: solr-user@lucene.apache.org
> Cc: Upayavira <uv...@odoko.co.uk>
> Sent: Fri, March 25, 2011 8:26:33 AM
> Subject: Re: solr on the cloud
> 
> Hi, Upayavira
> 
> Probably I'm confusing the terms here. When I say  "distributed faceting" I'm
> more into SOLR on the cloud (e.g. HDFS + MR +  cloud of commodity machines)
> rather than into traditional multicore/sharded  SOLR on a single or multiple
> servers with non-distributed file systems (is  that what you mean when you
> refer to "distribution of facet requests across  hosts"?)
> 
> On Fri, Mar 25, 2011 at 1:57 PM, Upayavira <uv...@odoko.co.uk>  wrote:
> 
> >
> >
> > On Fri, 25 Mar 2011 13:44 +0200, "Dmitry Kan"  <dm...@gmail.com>
> >  wrote:
> > > Hi Yonik,
> > >
> > > Oh, this is great. Is  distributed faceting available in the trunk? What
> > > is
> > >  the basic server setup needed for trying this out, is it cloud with HDFS
> >  > and
> > > SOLR with zookepers?
> > > Any chance to see the  related documentation? :)
> >
> > Distributed faceting has been  available for a long time, and is
> > available in the 1.4.1  release.
> >
> > The distribution of facet requests across hosts happens  in the
> > background. There's no real difference (in query syntax) between  a
> > standard facet query and a distributed one.
> >
> > i.e. you  don't need SolrCloud nor Zookeeper for it. (they may provide
> > other  benefits, but you don't need them for distributed faceting).
> >
> >  Upayavira
> >
> > > On Fri, Mar 25, 2011 at 1:35 PM, Yonik  Seeley
> > > <yo...@lucidimagination.com>wrote:
> >  >
> > > > On Tue, Mar 22, 2011 at 7:51 AM, Dmitry Kan <dm...@gmail.com>
> >  wrote:
> > > > > Basically, of high interest is checking out the  Map-Reduce for
> > > > distributed
> > > > > faceting, is  it even possible with the trunk?
> > > >
> > > > Solr  already has distributed faceting, and it's much more performant
> > >  > than a map-reduce implementation would be.
> > > >
> > >  > I've also seen a product use the term "map reduce" incorrectly...  as
> > in,
> > > > we "map" the request to each shard, and then  "reduce" the results to a
> > > > single list (of course, that's not  actually map-reduce at all ;-)
> > > >
> > > >
> > >  :) this sounds pretty strange to me as well. It was only my guess, that
> >  > if
> > > you have MR as computational model and a cloud beneath it,  you could
> > > naturally map facet fields to their counts inside single  documents (no
> > > matter, where they are, be it shards or "single"  index) and pass them
> > > onto
> > > reducers.
> >  >
> > >
> > > > -Yonik
> > > > http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
> >  > > 25-26, San Francisco
> > > >
> > >
> >  >
> > >
> > > --
> > > Regards,
> > >
> >  > Dmitry Kan
> > >
> > ---
> > Enterprise Search Consultant at  Sourcesense UK,
> > Making Sense of Open  Source
> >
> >
> 
> 
> -- 
> Regards,
> 
> Dmitry Kan
>

Re: solr on the cloud

Posted by Dmitry Kan <dm...@gmail.com>.

Hi, Upayavira

Probably I'm confusing the terms here. When I say "distributed faceting" I'm
more into SOLR on the cloud (e.g. HDFS + MR + cloud of commodity machines)
rather than into traditional multicore/sharded SOLR on a single or multiple
servers with non-distributed file systems (is that what you mean when you
refer to "distribution of facet requests across hosts"?)

On Fri, Mar 25, 2011 at 1:57 PM, Upayavira <uv...@odoko.co.uk> wrote:

>
>
> On Fri, 25 Mar 2011 13:44 +0200, "Dmitry Kan" <dm...@gmail.com>
> wrote:
> > Hi Yonik,
> >
> > Oh, this is great. Is distributed faceting available in the trunk? What
> > is
> > the basic server setup needed for trying this out, is it cloud with HDFS
> > and
> > SOLR with zookepers?
> > Any chance to see the related documentation? :)
>
> Distributed faceting has been available for a long time, and is
> available in the 1.4.1 release.
>
> The distribution of facet requests across hosts happens in the
> background. There's no real difference (in query syntax) between a
> standard facet query and a distributed one.
>
> i.e. you don't need SolrCloud nor Zookeeper for it. (they may provide
> other benefits, but you don't need them for distributed faceting).
>
> Upayavira
>
> > On Fri, Mar 25, 2011 at 1:35 PM, Yonik Seeley
> > <yo...@lucidimagination.com>wrote:
> >
> > > On Tue, Mar 22, 2011 at 7:51 AM, Dmitry Kan <dm...@gmail.com>
> wrote:
> > > > Basically, of high interest is checking out the Map-Reduce for
> > > distributed
> > > > faceting, is it even possible with the trunk?
> > >
> > > Solr already has distributed faceting, and it's much more performant
> > > than a map-reduce implementation would be.
> > >
> > > I've also seen a product use the term "map reduce" incorrectly... as
> in,
> > > we "map" the request to each shard, and then "reduce" the results to a
> > > single list (of course, that's not actually map-reduce at all ;-)
> > >
> > >
> > :) this sounds pretty strange to me as well. It was only my guess, that
> > if
> > you have MR as computational model and a cloud beneath it, you could
> > naturally map facet fields to their counts inside single documents (no
> > matter, where they are, be it shards or "single" index) and pass them
> > onto
> > reducers.
> >
> >
> > > -Yonik
> > > http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
> > > 25-26, San Francisco
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > Dmitry Kan
> >
> ---
> Enterprise Search Consultant at Sourcesense UK,
> Making Sense of Open Source
>
>


-- 
Regards,

Dmitry Kan

Re: DIH relating multiple DataSources

Posted by Chris Hostetter <ho...@fucit.org>.

: Subject: DIH relating multiple DataSources
: In-Reply-To: <13...@webmail.messagingengine.com>
: References:
:     <AA...@mail.gmail.com><AANLkTinr=r
:     +-N3HFNRbT1Cx4gvkv-A=CgW5FeMuOxbXm@mail.gmail.com>
:  <AA...@mail.gmail.com>
:  <13...@webmail.messagingengine.com>

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.




-Hoss

DIH relating multiple DataSources

Posted by je...@trend.com.tw.

Hi All,

I'm a newbie to SOLR and is hoping to get some help.

I was able to get DIH to work with one datasource. What I'm trying to achieve is using two datasources to build my document. Below is my data-config:

<dataConfig>
<dataSource name="localDB" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:3306/ebook" user="ebook" password="masked" batchSize="1" />
<dataSource name="remoteDB" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://tw-stntlab1:3306/test" user="root" password="masked" batchSize="1" />
    <document name="epub">
		<entity dataSource="localDB" rootEntity="true" name="epub" pk="ID" query="select * from epub">
            <field column="ID" name="id" />
            <field column="Name" name="url" />
            <field column="Author" name="content" />
			<entity dataSource="remoteDB" name="test" query="select TESTCOLUMN from jctest where ID='${epub.ID}'">
				<field column="TESTCOLUMN" name="title" />
			</entity>
        </entity>
    </document>
</dataConfig>

If the above possible? I can't seem to get my "title" field above populated from a second datasource but the fields identified in my rootEntity using the first datasource works perfectly fine.

Thanks,
Jeff
TREND MICRO EMAIL NOTICE
The information contained in this email and any attachments is confidential and may be subject to copyright or other intellectual property protection. If you are not the intended recipient, you are not authorized to use or disclose this information, and we request that you notify us by reply mail or telephone and delete the original message from your mail system.

Re: solr on the cloud

Posted by Upayavira <uv...@odoko.co.uk>.


On Fri, 25 Mar 2011 13:44 +0200, "Dmitry Kan" <dm...@gmail.com>
wrote:
> Hi Yonik,
> 
> Oh, this is great. Is distributed faceting available in the trunk? What
> is
> the basic server setup needed for trying this out, is it cloud with HDFS
> and
> SOLR with zookepers?
> Any chance to see the related documentation? :)

Distributed faceting has been available for a long time, and is
available in the 1.4.1 release.

The distribution of facet requests across hosts happens in the
background. There's no real difference (in query syntax) between a
standard facet query and a distributed one.

i.e. you don't need SolrCloud nor Zookeeper for it. (they may provide
other benefits, but you don't need them for distributed faceting).

Upayavira

> On Fri, Mar 25, 2011 at 1:35 PM, Yonik Seeley
> <yo...@lucidimagination.com>wrote:
> 
> > On Tue, Mar 22, 2011 at 7:51 AM, Dmitry Kan <dm...@gmail.com> wrote:
> > > Basically, of high interest is checking out the Map-Reduce for
> > distributed
> > > faceting, is it even possible with the trunk?
> >
> > Solr already has distributed faceting, and it's much more performant
> > than a map-reduce implementation would be.
> >
> > I've also seen a product use the term "map reduce" incorrectly... as in,
> > we "map" the request to each shard, and then "reduce" the results to a
> > single list (of course, that's not actually map-reduce at all ;-)
> >
> >
> :) this sounds pretty strange to me as well. It was only my guess, that
> if
> you have MR as computational model and a cloud beneath it, you could
> naturally map facet fields to their counts inside single documents (no
> matter, where they are, be it shards or "single" index) and pass them
> onto
> reducers.
> 
> 
> > -Yonik
> > http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
> > 25-26, San Francisco
> >
> 
> 
> 
> -- 
> Regards,
> 
> Dmitry Kan
> 
--- 
Enterprise Search Consultant at Sourcesense UK, 
Making Sense of Open Source

Re: solr on the cloud

Posted by Dmitry Kan <dm...@gmail.com>.

Hi Yonik,

Oh, this is great. Is distributed faceting available in the trunk? What is
the basic server setup needed for trying this out, is it cloud with HDFS and
SOLR with zookepers?
Any chance to see the related documentation? :)

On Fri, Mar 25, 2011 at 1:35 PM, Yonik Seeley <yo...@lucidimagination.com>wrote:

> On Tue, Mar 22, 2011 at 7:51 AM, Dmitry Kan <dm...@gmail.com> wrote:
> > Basically, of high interest is checking out the Map-Reduce for
> distributed
> > faceting, is it even possible with the trunk?
>
> Solr already has distributed faceting, and it's much more performant
> than a map-reduce implementation would be.
>
> I've also seen a product use the term "map reduce" incorrectly... as in,
> we "map" the request to each shard, and then "reduce" the results to a
> single list (of course, that's not actually map-reduce at all ;-)
>
>
:) this sounds pretty strange to me as well. It was only my guess, that if
you have MR as computational model and a cloud beneath it, you could
naturally map facet fields to their counts inside single documents (no
matter, where they are, be it shards or "single" index) and pass them onto
reducers.


> -Yonik
> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
> 25-26, San Francisco
>



-- 
Regards,

Dmitry Kan

Re: solr on the cloud

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Tue, Mar 22, 2011 at 7:51 AM, Dmitry Kan <dm...@gmail.com> wrote:
> Basically, of high interest is checking out the Map-Reduce for distributed
> faceting, is it even possible with the trunk?

Solr already has distributed faceting, and it's much more performant
than a map-reduce implementation would be.

I've also seen a product use the term "map reduce" incorrectly... as in,
we "map" the request to each shard, and then "reduce" the results to a
single list (of course, that's not actually map-reduce at all ;-)

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco