You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Trung Pham <tr...@phamcom.com> on 2012/07/26 23:55:18 UTC

Map/Reduce directly against solr4 index.

Is it possible to run map reduce jobs directly on Solr4?

I'm asking this because I want to use Solr4 as the primary storage engine.
And I want to be able to run near real time analytics against it as well.
Rather than export solr4 data out to a hadoop cluster.

Re: Map/Reduce directly against solr4 index.

Posted by Trung Pham <tr...@phamcom.com>.

That is exactly what I want.

I want the distributed Hadoop TaskNode to be running on the same server
that is holding the local distributed solr index. This way there is no need
to move any data around... I think other people call this feature 'data
locality' of map/reduce.

I believe HBase and Hadoop integration work exactly like this. The only
difference here is we are substituting HDFS with the distributed Solr
indexes.

Since solr4 can manage the sharded/distributed index files, it's doing the
exact work that HDFS is doing. In theory, this should be achievable.

On Thu, Jul 26, 2012 at 7:51 PM, Lance Norskog <go...@gmail.com> wrote:

> No. This is just a Hadoop file input class. Distributed Hadoop has to
> get files from a distributed file service. It sounds like you want
> some kind of distributed file service that maps a TaskNode (??) on a
> given server to the files available on that server. There might be
> something that does this. HDFS works very hard at doing this; are you
> sure it is not good enough? I am endlessly amazed at the speed of
> these distributed apps.
>
> Have you done a proof of concept?
>
> On Thu, Jul 26, 2012 at 7:40 PM, Trung Pham <tr...@phamcom.com> wrote:
> > Can it read distributed lucene indexes in SolrCloud?
> > On Jul 26, 2012 7:11 PM, "Lance Norskog" <go...@gmail.com> wrote:
> >
> >> Mahout includes a file reader for Lucene indexes. It will read from
> >> HDFS or local disks.
> >>
> >> On Thu, Jul 26, 2012 at 6:57 PM, Darren Govoni <da...@ontrenet.com>
> >> wrote:
> >> > You raise an interesting possibility. A map/reduce solr handler over
> >> > solrcloud.......
> >> >
> >> > On Thu, 2012-07-26 at 18:52 -0700, Trung Pham wrote:
> >> >
> >> >> I think the performance should be close to Hadoop running on HDFS, if
> >> >> somehow Hadoop job can directly read the Solr Index file while
> executing
> >> >> the job on the local solr node.
> >> >>
> >> >> Kindna like how HBase and Cassadra integrate with Hadoop.
> >> >>
> >> >> Plus, we can run the map reduce job on a standby Solr4 cluster.
> >> >>
> >> >> This way, the documents in Solr will be our primary source of truth.
> >> And we
> >> >> have the ability to run near real time search queries and analytics
> on
> >> it.
> >> >> No need to export data around.
> >> >>
> >> >> Solr4 is becoming a very interesting solution to many web scale
> >> problems.
> >> >> Just missing the map/reduce component. :)
> >> >>
> >> >> On Thu, Jul 26, 2012 at 3:01 PM, Darren Govoni <da...@ontrenet.com>
> >> wrote:
> >> >>
> >> >> > Of course you can do it, but the question is whether this will
> produce
> >> >> > the performance results you expect.
> >> >> > I've seen talk about this in other forums, so you might find some
> >> prior
> >> >> > work here.
> >> >> >
> >> >> > Solr and HDFS serve somewhat different purposes. The key issue
> would
> >> be
> >> >> > if your map and reduce code
> >> >> > overloads the Solr endpoint. Even using SolrCloud, I believe all
> >> >> > requests will have to go through a single
> >> >> > URL (to be routed), so if you have thousands of map/reduce jobs all
> >> >> > running simultaneously, the question is whether
> >> >> > your Solr is architected to handle that amount of throughput.
> >> >> >
> >> >> >
> >> >> > On Thu, 2012-07-26 at 14:55 -0700, Trung Pham wrote:
> >> >> >
> >> >> > > Is it possible to run map reduce jobs directly on Solr4?
> >> >> > >
> >> >> > > I'm asking this because I want to use Solr4 as the primary
> storage
> >> >> > engine.
> >> >> > > And I want to be able to run near real time analytics against it
> as
> >> well.
> >> >> > > Rather than export solr4 data out to a hadoop cluster.
> >> >> >
> >> >> >
> >> >> >
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Lance Norskog
> >> goksron@gmail.com
> >>
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Re: Map/Reduce directly against solr4 index.

Posted by Lance Norskog <go...@gmail.com>.

No. This is just a Hadoop file input class. Distributed Hadoop has to
get files from a distributed file service. It sounds like you want
some kind of distributed file service that maps a TaskNode (??) on a
given server to the files available on that server. There might be
something that does this. HDFS works very hard at doing this; are you
sure it is not good enough? I am endlessly amazed at the speed of
these distributed apps.

Have you done a proof of concept?

On Thu, Jul 26, 2012 at 7:40 PM, Trung Pham <tr...@phamcom.com> wrote:
> Can it read distributed lucene indexes in SolrCloud?
> On Jul 26, 2012 7:11 PM, "Lance Norskog" <go...@gmail.com> wrote:
>
>> Mahout includes a file reader for Lucene indexes. It will read from
>> HDFS or local disks.
>>
>> On Thu, Jul 26, 2012 at 6:57 PM, Darren Govoni <da...@ontrenet.com>
>> wrote:
>> > You raise an interesting possibility. A map/reduce solr handler over
>> > solrcloud.......
>> >
>> > On Thu, 2012-07-26 at 18:52 -0700, Trung Pham wrote:
>> >
>> >> I think the performance should be close to Hadoop running on HDFS, if
>> >> somehow Hadoop job can directly read the Solr Index file while executing
>> >> the job on the local solr node.
>> >>
>> >> Kindna like how HBase and Cassadra integrate with Hadoop.
>> >>
>> >> Plus, we can run the map reduce job on a standby Solr4 cluster.
>> >>
>> >> This way, the documents in Solr will be our primary source of truth.
>> And we
>> >> have the ability to run near real time search queries and analytics on
>> it.
>> >> No need to export data around.
>> >>
>> >> Solr4 is becoming a very interesting solution to many web scale
>> problems.
>> >> Just missing the map/reduce component. :)
>> >>
>> >> On Thu, Jul 26, 2012 at 3:01 PM, Darren Govoni <da...@ontrenet.com>
>> wrote:
>> >>
>> >> > Of course you can do it, but the question is whether this will produce
>> >> > the performance results you expect.
>> >> > I've seen talk about this in other forums, so you might find some
>> prior
>> >> > work here.
>> >> >
>> >> > Solr and HDFS serve somewhat different purposes. The key issue would
>> be
>> >> > if your map and reduce code
>> >> > overloads the Solr endpoint. Even using SolrCloud, I believe all
>> >> > requests will have to go through a single
>> >> > URL (to be routed), so if you have thousands of map/reduce jobs all
>> >> > running simultaneously, the question is whether
>> >> > your Solr is architected to handle that amount of throughput.
>> >> >
>> >> >
>> >> > On Thu, 2012-07-26 at 14:55 -0700, Trung Pham wrote:
>> >> >
>> >> > > Is it possible to run map reduce jobs directly on Solr4?
>> >> > >
>> >> > > I'm asking this because I want to use Solr4 as the primary storage
>> >> > engine.
>> >> > > And I want to be able to run near real time analytics against it as
>> well.
>> >> > > Rather than export solr4 data out to a hadoop cluster.
>> >> >
>> >> >
>> >> >
>> >
>> >
>>
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>>



-- 
Lance Norskog
goksron@gmail.com

Re: Map/Reduce directly against solr4 index.

Posted by Trung Pham <tr...@phamcom.com>.

Can it read distributed lucene indexes in SolrCloud?
On Jul 26, 2012 7:11 PM, "Lance Norskog" <go...@gmail.com> wrote:

> Mahout includes a file reader for Lucene indexes. It will read from
> HDFS or local disks.
>
> On Thu, Jul 26, 2012 at 6:57 PM, Darren Govoni <da...@ontrenet.com>
> wrote:
> > You raise an interesting possibility. A map/reduce solr handler over
> > solrcloud.......
> >
> > On Thu, 2012-07-26 at 18:52 -0700, Trung Pham wrote:
> >
> >> I think the performance should be close to Hadoop running on HDFS, if
> >> somehow Hadoop job can directly read the Solr Index file while executing
> >> the job on the local solr node.
> >>
> >> Kindna like how HBase and Cassadra integrate with Hadoop.
> >>
> >> Plus, we can run the map reduce job on a standby Solr4 cluster.
> >>
> >> This way, the documents in Solr will be our primary source of truth.
> And we
> >> have the ability to run near real time search queries and analytics on
> it.
> >> No need to export data around.
> >>
> >> Solr4 is becoming a very interesting solution to many web scale
> problems.
> >> Just missing the map/reduce component. :)
> >>
> >> On Thu, Jul 26, 2012 at 3:01 PM, Darren Govoni <da...@ontrenet.com>
> wrote:
> >>
> >> > Of course you can do it, but the question is whether this will produce
> >> > the performance results you expect.
> >> > I've seen talk about this in other forums, so you might find some
> prior
> >> > work here.
> >> >
> >> > Solr and HDFS serve somewhat different purposes. The key issue would
> be
> >> > if your map and reduce code
> >> > overloads the Solr endpoint. Even using SolrCloud, I believe all
> >> > requests will have to go through a single
> >> > URL (to be routed), so if you have thousands of map/reduce jobs all
> >> > running simultaneously, the question is whether
> >> > your Solr is architected to handle that amount of throughput.
> >> >
> >> >
> >> > On Thu, 2012-07-26 at 14:55 -0700, Trung Pham wrote:
> >> >
> >> > > Is it possible to run map reduce jobs directly on Solr4?
> >> > >
> >> > > I'm asking this because I want to use Solr4 as the primary storage
> >> > engine.
> >> > > And I want to be able to run near real time analytics against it as
> well.
> >> > > Rather than export solr4 data out to a hadoop cluster.
> >> >
> >> >
> >> >
> >
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Re: Map/Reduce directly against solr4 index.

Posted by Lance Norskog <go...@gmail.com>.

Mahout includes a file reader for Lucene indexes. It will read from
HDFS or local disks.

On Thu, Jul 26, 2012 at 6:57 PM, Darren Govoni <da...@ontrenet.com> wrote:
> You raise an interesting possibility. A map/reduce solr handler over
> solrcloud.......
>
> On Thu, 2012-07-26 at 18:52 -0700, Trung Pham wrote:
>
>> I think the performance should be close to Hadoop running on HDFS, if
>> somehow Hadoop job can directly read the Solr Index file while executing
>> the job on the local solr node.
>>
>> Kindna like how HBase and Cassadra integrate with Hadoop.
>>
>> Plus, we can run the map reduce job on a standby Solr4 cluster.
>>
>> This way, the documents in Solr will be our primary source of truth. And we
>> have the ability to run near real time search queries and analytics on it.
>> No need to export data around.
>>
>> Solr4 is becoming a very interesting solution to many web scale problems.
>> Just missing the map/reduce component. :)
>>
>> On Thu, Jul 26, 2012 at 3:01 PM, Darren Govoni <da...@ontrenet.com> wrote:
>>
>> > Of course you can do it, but the question is whether this will produce
>> > the performance results you expect.
>> > I've seen talk about this in other forums, so you might find some prior
>> > work here.
>> >
>> > Solr and HDFS serve somewhat different purposes. The key issue would be
>> > if your map and reduce code
>> > overloads the Solr endpoint. Even using SolrCloud, I believe all
>> > requests will have to go through a single
>> > URL (to be routed), so if you have thousands of map/reduce jobs all
>> > running simultaneously, the question is whether
>> > your Solr is architected to handle that amount of throughput.
>> >
>> >
>> > On Thu, 2012-07-26 at 14:55 -0700, Trung Pham wrote:
>> >
>> > > Is it possible to run map reduce jobs directly on Solr4?
>> > >
>> > > I'm asking this because I want to use Solr4 as the primary storage
>> > engine.
>> > > And I want to be able to run near real time analytics against it as well.
>> > > Rather than export solr4 data out to a hadoop cluster.
>> >
>> >
>> >
>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Map/Reduce directly against solr4 index.

Posted by Darren Govoni <da...@ontrenet.com>.

You raise an interesting possibility. A map/reduce solr handler over
solrcloud.......

On Thu, 2012-07-26 at 18:52 -0700, Trung Pham wrote:

> I think the performance should be close to Hadoop running on HDFS, if
> somehow Hadoop job can directly read the Solr Index file while executing
> the job on the local solr node.
> 
> Kindna like how HBase and Cassadra integrate with Hadoop.
> 
> Plus, we can run the map reduce job on a standby Solr4 cluster.
> 
> This way, the documents in Solr will be our primary source of truth. And we
> have the ability to run near real time search queries and analytics on it.
> No need to export data around.
> 
> Solr4 is becoming a very interesting solution to many web scale problems.
> Just missing the map/reduce component. :)
> 
> On Thu, Jul 26, 2012 at 3:01 PM, Darren Govoni <da...@ontrenet.com> wrote:
> 
> > Of course you can do it, but the question is whether this will produce
> > the performance results you expect.
> > I've seen talk about this in other forums, so you might find some prior
> > work here.
> >
> > Solr and HDFS serve somewhat different purposes. The key issue would be
> > if your map and reduce code
> > overloads the Solr endpoint. Even using SolrCloud, I believe all
> > requests will have to go through a single
> > URL (to be routed), so if you have thousands of map/reduce jobs all
> > running simultaneously, the question is whether
> > your Solr is architected to handle that amount of throughput.
> >
> >
> > On Thu, 2012-07-26 at 14:55 -0700, Trung Pham wrote:
> >
> > > Is it possible to run map reduce jobs directly on Solr4?
> > >
> > > I'm asking this because I want to use Solr4 as the primary storage
> > engine.
> > > And I want to be able to run near real time analytics against it as well.
> > > Rather than export solr4 data out to a hadoop cluster.
> >
> >
> >

Re: Map/Reduce directly against solr4 index.

Posted by Trung Pham <tr...@phamcom.com>.

I think the performance should be close to Hadoop running on HDFS, if
somehow Hadoop job can directly read the Solr Index file while executing
the job on the local solr node.

Kindna like how HBase and Cassadra integrate with Hadoop.

Plus, we can run the map reduce job on a standby Solr4 cluster.

This way, the documents in Solr will be our primary source of truth. And we
have the ability to run near real time search queries and analytics on it.
No need to export data around.

Solr4 is becoming a very interesting solution to many web scale problems.
Just missing the map/reduce component. :)

On Thu, Jul 26, 2012 at 3:01 PM, Darren Govoni <da...@ontrenet.com> wrote:

> Of course you can do it, but the question is whether this will produce
> the performance results you expect.
> I've seen talk about this in other forums, so you might find some prior
> work here.
>
> Solr and HDFS serve somewhat different purposes. The key issue would be
> if your map and reduce code
> overloads the Solr endpoint. Even using SolrCloud, I believe all
> requests will have to go through a single
> URL (to be routed), so if you have thousands of map/reduce jobs all
> running simultaneously, the question is whether
> your Solr is architected to handle that amount of throughput.
>
>
> On Thu, 2012-07-26 at 14:55 -0700, Trung Pham wrote:
>
> > Is it possible to run map reduce jobs directly on Solr4?
> >
> > I'm asking this because I want to use Solr4 as the primary storage
> engine.
> > And I want to be able to run near real time analytics against it as well.
> > Rather than export solr4 data out to a hadoop cluster.
>
>
>

Re: Map/Reduce directly against solr4 index.

Posted by Darren Govoni <da...@ontrenet.com>.

Of course you can do it, but the question is whether this will produce
the performance results you expect.
I've seen talk about this in other forums, so you might find some prior
work here.

Solr and HDFS serve somewhat different purposes. The key issue would be
if your map and reduce code
overloads the Solr endpoint. Even using SolrCloud, I believe all
requests will have to go through a single
URL (to be routed), so if you have thousands of map/reduce jobs all
running simultaneously, the question is whether
your Solr is architected to handle that amount of throughput.

On Thu, 2012-07-26 at 14:55 -0700, Trung Pham wrote:

> Is it possible to run map reduce jobs directly on Solr4?
> 
> I'm asking this because I want to use Solr4 as the primary storage engine.
> And I want to be able to run near real time analytics against it as well.
> Rather than export solr4 data out to a hadoop cluster.

Re: Map/Reduce directly against solr4 index.

Posted by Schmidt Jeff <ja...@gmail.com>.

It's not free (for production use anyway), but you might consider DataStax Enterprise: http://www.datastax.com/products/enterprise

It is a very nice consolidation of Cassandra, Solr and Hadoop.  No ETL required.

Cheers,

Jeff

On Jul 26, 2012, at 3:55 PM, Trung Pham wrote:

> Is it possible to run map reduce jobs directly on Solr4?
> 
> I'm asking this because I want to use Solr4 as the primary storage engine.
> And I want to be able to run near real time analytics against it as well.
> Rather than export solr4 data out to a hadoop cluster.