You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Rohit Jain <ro...@esgyn.com> on 2017/06/12 16:24:00 UTC

Parallel API interface into SOLR

Hi folks,

We have a solution where we would like to connect to SOLR via an API, submit a query, and then pre-process the results before we return the results to our users.  However, in some cases, it is possible that the results being returned by SOLR, in a large distributed cluster deployment, is very large.  In these cases, we would like to set up parallel streams, so that each parallel SOLR worker feeds directly into one of our processes distributed across the cluster.  That way, we can pre-process those results in parallel, before we consolidate (and potentially reduce / aggregate) the results further for the user, who has a single client connection to our solution.  Sort of a MapReduce type scenario where our processors are the reducers.  We could consume the results as returned by these SOLR Worker processes, or perhaps have them shuffled based on a shard key, before our processes would receive them.

Any ideas on how this could be done?

Rohit Jain

RE: Parallel API interface into SOLR

Posted by Rohit Jain <ro...@esgyn.com>.
Thanks a lot Joel!  No wonder I could not find it :-).  I will try to see if this will work for us.

Rohit

-----Original Message-----
From: Joel Bernstein [mailto:joelsolr@gmail.com] 
Sent: Monday, June 12, 2017 1:01 PM
To: solr-user@lucene.apache.org
Subject: Re: Parallel API interface into SOLR

You can do what you're trying to do by using the SolrStream but it's
complex and not documented. Here is the basic code for having multiple
clients hitting the same shard:

*On client 1:*

SolrClientCache cache = new SolrClientCache();

StreamContext context = new StreamContext();
context.setSolrClientCache(cache);
context.numWorkers=2;
*context.workerID=0;*
ModifiableSolrParams params = new ModifiableSolrParams();

params.put("qt", "/export");
params.put("partitionKeys", "field1, field2");
params.put("sort", "field1 asc, field2 asc);
params.put("q", "some query");

SolrStream solrStream = new SolrStream(/shard_endpoint, params);
solrStream.setStreamContext(context);
solrStream.open();
while(true) {
    Tuple tup =solrStream.read();
}

*On client 2:*

SolrClientCache cache = new SolrClientCache();

StreamContext context = new StreamContext();
context.setSolrClientCache(cache);
context.numWorkers=2;
*context.workerID=1;*
ModifiableSolrParams params = new ModifiableSolrParams();

params.put("qt", "/export");
params.put("partitionKeys", "field1, field2");
params.put("sort", "field1 asc, field2 asc);
params.put("q", "some query");

SolrStream solrStream = new SolrStream(/shard_endpoint, params);
solrStream.setStreamContext(context);
solrStream.open();
while(true) {
    Tuple tup =solrStream.read();
}



In this scenario client1 and client2 are each getting a partition of the
result set. Notice that the context.workerID attribute is the difference
between the two requests. You can partition a result set as many ways as
you want by setting the context.numWorkers attribute.









Joel Bernstein
http://joelsolr.blogspot.com/

On Mon, Jun 12, 2017 at 1:11 PM, Rohit Jain <ro...@esgyn.com> wrote:

> Erick,
>
> I think so, although I may have overlooked something.  The idea is that we
> would make a request to the API from a single client but expect multiple
> streams of results to be returned in parallel to multiple parallel
> processes that we have set up to receive those results from SOLR.  Do these
> interfaces provide that?  This has always been the issue with interfaces
> like JDBC / ODBC as well, since they don't provide a mechanism to consume
> the results in parallel streams.  There is no protocol set up to do that.
> I was just wondering if there was for SOLR and what would be an example of
> that.
>
> Rohit
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Monday, June 12, 2017 11:56 AM
> To: solr-user <so...@lucene.apache.org>
> Subject: Re: Parallel API interface into SOLR
>
> Have you looked at Streaming Aggregation/Streaming Expressions/Parallel
> SQL etc?
>
> Best,
> Erick
>
> On Mon, Jun 12, 2017 at 9:24 AM, Rohit Jain <ro...@esgyn.com> wrote:
> > Hi folks,
> >
> > We have a solution where we would like to connect to SOLR via an API,
> submit a query, and then pre-process the results before we return the
> results to our users.  However, in some cases, it is possible that the
> results being returned by SOLR, in a large distributed cluster deployment,
> is very large.  In these cases, we would like to set up parallel streams,
> so that each parallel SOLR worker feeds directly into one of our processes
> distributed across the cluster.  That way, we can pre-process those results
> in parallel, before we consolidate (and potentially reduce / aggregate) the
> results further for the user, who has a single client connection to our
> solution.  Sort of a MapReduce type scenario where our processors are the
> reducers.  We could consume the results as returned by these SOLR Worker
> processes, or perhaps have them shuffled based on a shard key, before our
> processes would receive them.
> >
> > Any ideas on how this could be done?
> >
> > Rohit Jain
>

Re: Parallel API interface into SOLR

Posted by Joel Bernstein <jo...@gmail.com>.
You can do what you're trying to do by using the SolrStream but it's
complex and not documented. Here is the basic code for having multiple
clients hitting the same shard:

*On client 1:*

SolrClientCache cache = new SolrClientCache();

StreamContext context = new StreamContext();
context.setSolrClientCache(cache);
context.numWorkers=2;
*context.workerID=0;*
ModifiableSolrParams params = new ModifiableSolrParams();

params.put("qt", "/export");
params.put("partitionKeys", "field1, field2");
params.put("sort", "field1 asc, field2 asc);
params.put("q", "some query");

SolrStream solrStream = new SolrStream(/shard_endpoint, params);
solrStream.setStreamContext(context);
solrStream.open();
while(true) {
    Tuple tup =solrStream.read();
}

*On client 2:*

SolrClientCache cache = new SolrClientCache();

StreamContext context = new StreamContext();
context.setSolrClientCache(cache);
context.numWorkers=2;
*context.workerID=1;*
ModifiableSolrParams params = new ModifiableSolrParams();

params.put("qt", "/export");
params.put("partitionKeys", "field1, field2");
params.put("sort", "field1 asc, field2 asc);
params.put("q", "some query");

SolrStream solrStream = new SolrStream(/shard_endpoint, params);
solrStream.setStreamContext(context);
solrStream.open();
while(true) {
    Tuple tup =solrStream.read();
}



In this scenario client1 and client2 are each getting a partition of the
result set. Notice that the context.workerID attribute is the difference
between the two requests. You can partition a result set as many ways as
you want by setting the context.numWorkers attribute.









Joel Bernstein
http://joelsolr.blogspot.com/

On Mon, Jun 12, 2017 at 1:11 PM, Rohit Jain <ro...@esgyn.com> wrote:

> Erick,
>
> I think so, although I may have overlooked something.  The idea is that we
> would make a request to the API from a single client but expect multiple
> streams of results to be returned in parallel to multiple parallel
> processes that we have set up to receive those results from SOLR.  Do these
> interfaces provide that?  This has always been the issue with interfaces
> like JDBC / ODBC as well, since they don't provide a mechanism to consume
> the results in parallel streams.  There is no protocol set up to do that.
> I was just wondering if there was for SOLR and what would be an example of
> that.
>
> Rohit
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Monday, June 12, 2017 11:56 AM
> To: solr-user <so...@lucene.apache.org>
> Subject: Re: Parallel API interface into SOLR
>
> Have you looked at Streaming Aggregation/Streaming Expressions/Parallel
> SQL etc?
>
> Best,
> Erick
>
> On Mon, Jun 12, 2017 at 9:24 AM, Rohit Jain <ro...@esgyn.com> wrote:
> > Hi folks,
> >
> > We have a solution where we would like to connect to SOLR via an API,
> submit a query, and then pre-process the results before we return the
> results to our users.  However, in some cases, it is possible that the
> results being returned by SOLR, in a large distributed cluster deployment,
> is very large.  In these cases, we would like to set up parallel streams,
> so that each parallel SOLR worker feeds directly into one of our processes
> distributed across the cluster.  That way, we can pre-process those results
> in parallel, before we consolidate (and potentially reduce / aggregate) the
> results further for the user, who has a single client connection to our
> solution.  Sort of a MapReduce type scenario where our processors are the
> reducers.  We could consume the results as returned by these SOLR Worker
> processes, or perhaps have them shuffled based on a shard key, before our
> processes would receive them.
> >
> > Any ideas on how this could be done?
> >
> > Rohit Jain
>

RE: Parallel API interface into SOLR

Posted by Rohit Jain <ro...@esgyn.com>.
Erick,

I think so, although I may have overlooked something.  The idea is that we would make a request to the API from a single client but expect multiple streams of results to be returned in parallel to multiple parallel processes that we have set up to receive those results from SOLR.  Do these interfaces provide that?  This has always been the issue with interfaces like JDBC / ODBC as well, since they don't provide a mechanism to consume the results in parallel streams.  There is no protocol set up to do that.  I was just wondering if there was for SOLR and what would be an example of that.

Rohit

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: Monday, June 12, 2017 11:56 AM
To: solr-user <so...@lucene.apache.org>
Subject: Re: Parallel API interface into SOLR

Have you looked at Streaming Aggregation/Streaming Expressions/Parallel SQL etc?

Best,
Erick

On Mon, Jun 12, 2017 at 9:24 AM, Rohit Jain <ro...@esgyn.com> wrote:
> Hi folks,
>
> We have a solution where we would like to connect to SOLR via an API, submit a query, and then pre-process the results before we return the results to our users.  However, in some cases, it is possible that the results being returned by SOLR, in a large distributed cluster deployment, is very large.  In these cases, we would like to set up parallel streams, so that each parallel SOLR worker feeds directly into one of our processes distributed across the cluster.  That way, we can pre-process those results in parallel, before we consolidate (and potentially reduce / aggregate) the results further for the user, who has a single client connection to our solution.  Sort of a MapReduce type scenario where our processors are the reducers.  We could consume the results as returned by these SOLR Worker processes, or perhaps have them shuffled based on a shard key, before our processes would receive them.
>
> Any ideas on how this could be done?
>
> Rohit Jain

Re: Parallel API interface into SOLR

Posted by Erick Erickson <er...@gmail.com>.
Have you looked at Streaming Aggregation/Streaming Expressions/Parallel SQL etc?

Best,
Erick

On Mon, Jun 12, 2017 at 9:24 AM, Rohit Jain <ro...@esgyn.com> wrote:
> Hi folks,
>
> We have a solution where we would like to connect to SOLR via an API, submit a query, and then pre-process the results before we return the results to our users.  However, in some cases, it is possible that the results being returned by SOLR, in a large distributed cluster deployment, is very large.  In these cases, we would like to set up parallel streams, so that each parallel SOLR worker feeds directly into one of our processes distributed across the cluster.  That way, we can pre-process those results in parallel, before we consolidate (and potentially reduce / aggregate) the results further for the user, who has a single client connection to our solution.  Sort of a MapReduce type scenario where our processors are the reducers.  We could consume the results as returned by these SOLR Worker processes, or perhaps have them shuffled based on a shard key, before our processes would receive them.
>
> Any ideas on how this could be done?
>
> Rohit Jain