You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Joe Obernberger <jo...@gmail.com> on 2017/08/14 23:53:34 UTC

Classify stream expression questions

Hi All - I'm using the classify stream expression and the results 
returned are always limited to 1,000.  Where do I specify the number to 
return?  The stream expression that I'm using looks like:

classify(model(models,id="MODEL1014",cacheMillis=5000),search(COL,df="FULL_DOCUMENT",q="Collection:(COLLECT2000) 
AND DocTimestamp:[2017-08-14T04:00:00Z TO 
2017-08-15T03:59:00Z]",fl="id,score",sort="id asc"),field="ClusterText")

When I read this (code snipet):

              stream.open();
             while (true) {
                 Tuple tuple = stream.read();
                 if (tuple.EOF) {
                     break;
                 }
                 Double probabilty = (Double) 
tuple.fields.get("probability_d");
                 String docID = (String) tuple.fields.get("id");

I get back 1,000 results.  Another question is if there is a way to 
parallelize the classify call to other worker nodes?  Thank you!

-Joe

Re: Classify stream expression questions

Posted by Joe Obernberger <jo...@gmail.com>.

Thank you Joel - I'm using a ModifiableSolrParams object to build the 
parameters for Solr (hope this is what you want)

toString() returns:

expr=classify(model(models,id%3D"MODEL1014",cacheMillis%3D5000),search(COL,df%3D"FULL_DOCUMENT",q%3D"Collection:(COLLECT2000)+AND+DocTimestamp:[2017-08-14T04:00:00Z+TO+2017-08-16T03:59:00Z]",fl%3D"id,score",sort%3D"id+asc"),field%3D"ClusterText")&qt=/stream&explain=true&fl=id&sort=id+asc&rows=100

This collection has 100 shards with 3 replicas each, so I would expect 
100*20 = 2000 results?  Although I'm classifying on ClusterText, for the 
results, I only need an ID.  At present, I can build a model and 
classify a single, or set of documents as they come into the system.  
However, if I want to use a model as a search, then I'm asking Solr to 
classify a lot of docs, but I actually only want to return docs that 
have a probability of n or higher.

-Joe


On 8/14/2017 10:46 PM, Joel Bernstein wrote:
> My math was off again ... If you have 20 results from 50 shards that would
> produce the 1000 results.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Mon, Aug 14, 2017 at 10:17 PM, Joel Bernstein <jo...@gmail.com> wrote:
>
>> Actually my math was off. You would need 200 shards to get to 1000 result.
>> How many shards do you have?
>>
>> The expression you provided also didn't include the ClusterText field in
>> field list of the search. So perhaps it's missing other parameters.
>>
>> If you include all the parameters I may be able to spot the issue.
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>> On Mon, Aug 14, 2017 at 10:10 PM, Joel Bernstein <jo...@gmail.com>
>> wrote:
>>
>>> It looks like you just need to set the rows parameter in the search
>>> expression. If you don't set rows the default will be 20 I believe, which
>>> will pull to top 20 docs from each shard. If you have 5 shards than the
>>> 1000 results would make sense.
>>>
>>> You can parallelize the whole expression by wrapping it in a parallel
>>> expression. You'll need to set the partitionKeys in the search expression
>>> to do this.
>>>
>>> If you have a large number of records to process I would recommend batch
>>> processing. This blog explains the parallel batch framework:
>>>
>>> http://joelsolr.blogspot.com/2016/10/solr-63-batch-jobs-para
>>> llel-etl-and.html
>>>
>>>
>>>
>>>
>>>
>>>
>>> Joel Bernstein
>>> http://joelsolr.blogspot.com/
>>>
>>> On Mon, Aug 14, 2017 at 7:53 PM, Joe Obernberger <
>>> joseph.obernberger@gmail.com> wrote:
>>>
>>>> Hi All - I'm using the classify stream expression and the results
>>>> returned are always limited to 1,000.  Where do I specify the number to
>>>> return?  The stream expression that I'm using looks like:
>>>>
>>>> classify(model(models,id="MODEL1014",cacheMillis=5000),searc
>>>> h(COL,df="FULL_DOCUMENT",q="Collection:(COLLECT2000) AND
>>>> DocTimestamp:[2017-08-14T04:00:00Z TO 2017-08-15T03:59:00Z]",fl="id,score",sort="id
>>>> asc"),field="ClusterText")
>>>>
>>>> When I read this (code snipet):
>>>>
>>>>               stream.open();
>>>>              while (true) {
>>>>                  Tuple tuple = stream.read();
>>>>                  if (tuple.EOF) {
>>>>                      break;
>>>>                  }
>>>>                  Double probabilty = (Double)
>>>> tuple.fields.get("probability_d");
>>>>                  String docID = (String) tuple.fields.get("id");
>>>>
>>>> I get back 1,000 results.  Another question is if there is a way to
>>>> parallelize the classify call to other worker nodes?  Thank you!
>>>>
>>>> -Joe
>>>>
>>>>
>
> ---
> This email has been checked for viruses by AVG.
> http://www.avg.com
>

Re: Classify stream expression questions

Posted by Joel Bernstein <jo...@gmail.com>.

My math was off again ... If you have 20 results from 50 shards that would
produce the 1000 results.

Joel Bernstein
http://joelsolr.blogspot.com/

On Mon, Aug 14, 2017 at 10:17 PM, Joel Bernstein <jo...@gmail.com> wrote:

> Actually my math was off. You would need 200 shards to get to 1000 result.
> How many shards do you have?
>
> The expression you provided also didn't include the ClusterText field in
> field list of the search. So perhaps it's missing other parameters.
>
> If you include all the parameters I may be able to spot the issue.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Mon, Aug 14, 2017 at 10:10 PM, Joel Bernstein <jo...@gmail.com>
> wrote:
>
>> It looks like you just need to set the rows parameter in the search
>> expression. If you don't set rows the default will be 20 I believe, which
>> will pull to top 20 docs from each shard. If you have 5 shards than the
>> 1000 results would make sense.
>>
>> You can parallelize the whole expression by wrapping it in a parallel
>> expression. You'll need to set the partitionKeys in the search expression
>> to do this.
>>
>> If you have a large number of records to process I would recommend batch
>> processing. This blog explains the parallel batch framework:
>>
>> http://joelsolr.blogspot.com/2016/10/solr-63-batch-jobs-para
>> llel-etl-and.html
>>
>>
>>
>>
>>
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>> On Mon, Aug 14, 2017 at 7:53 PM, Joe Obernberger <
>> joseph.obernberger@gmail.com> wrote:
>>
>>> Hi All - I'm using the classify stream expression and the results
>>> returned are always limited to 1,000.  Where do I specify the number to
>>> return?  The stream expression that I'm using looks like:
>>>
>>> classify(model(models,id="MODEL1014",cacheMillis=5000),searc
>>> h(COL,df="FULL_DOCUMENT",q="Collection:(COLLECT2000) AND
>>> DocTimestamp:[2017-08-14T04:00:00Z TO 2017-08-15T03:59:00Z]",fl="id,score",sort="id
>>> asc"),field="ClusterText")
>>>
>>> When I read this (code snipet):
>>>
>>>              stream.open();
>>>             while (true) {
>>>                 Tuple tuple = stream.read();
>>>                 if (tuple.EOF) {
>>>                     break;
>>>                 }
>>>                 Double probabilty = (Double)
>>> tuple.fields.get("probability_d");
>>>                 String docID = (String) tuple.fields.get("id");
>>>
>>> I get back 1,000 results.  Another question is if there is a way to
>>> parallelize the classify call to other worker nodes?  Thank you!
>>>
>>> -Joe
>>>
>>>
>>
>

Re: Classify stream expression questions

Posted by Joel Bernstein <jo...@gmail.com>.

Actually my math was off. You would need 200 shards to get to 1000 result.
How many shards do you have?

The expression you provided also didn't include the ClusterText field in
field list of the search. So perhaps it's missing other parameters.

If you include all the parameters I may be able to spot the issue.

Joel Bernstein
http://joelsolr.blogspot.com/

On Mon, Aug 14, 2017 at 10:10 PM, Joel Bernstein <jo...@gmail.com> wrote:

> It looks like you just need to set the rows parameter in the search
> expression. If you don't set rows the default will be 20 I believe, which
> will pull to top 20 docs from each shard. If you have 5 shards than the
> 1000 results would make sense.
>
> You can parallelize the whole expression by wrapping it in a parallel
> expression. You'll need to set the partitionKeys in the search expression
> to do this.
>
> If you have a large number of records to process I would recommend batch
> processing. This blog explains the parallel batch framework:
>
> http://joelsolr.blogspot.com/2016/10/solr-63-batch-jobs-
> parallel-etl-and.html
>
>
>
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Mon, Aug 14, 2017 at 7:53 PM, Joe Obernberger <
> joseph.obernberger@gmail.com> wrote:
>
>> Hi All - I'm using the classify stream expression and the results
>> returned are always limited to 1,000.  Where do I specify the number to
>> return?  The stream expression that I'm using looks like:
>>
>> classify(model(models,id="MODEL1014",cacheMillis=5000),searc
>> h(COL,df="FULL_DOCUMENT",q="Collection:(COLLECT2000) AND
>> DocTimestamp:[2017-08-14T04:00:00Z TO 2017-08-15T03:59:00Z]",fl="id,score",sort="id
>> asc"),field="ClusterText")
>>
>> When I read this (code snipet):
>>
>>              stream.open();
>>             while (true) {
>>                 Tuple tuple = stream.read();
>>                 if (tuple.EOF) {
>>                     break;
>>                 }
>>                 Double probabilty = (Double)
>> tuple.fields.get("probability_d");
>>                 String docID = (String) tuple.fields.get("id");
>>
>> I get back 1,000 results.  Another question is if there is a way to
>> parallelize the classify call to other worker nodes?  Thank you!
>>
>> -Joe
>>
>>
>

Re: Classify stream expression questions

Posted by Joel Bernstein <jo...@gmail.com>.

It looks like you just need to set the rows parameter in the search
expression. If you don't set rows the default will be 20 I believe, which
will pull to top 20 docs from each shard. If you have 5 shards than the
1000 results would make sense.

You can parallelize the whole expression by wrapping it in a parallel
expression. You'll need to set the partitionKeys in the search expression
to do this.

If you have a large number of records to process I would recommend batch
processing. This blog explains the parallel batch framework:

http://joelsolr.blogspot.com/2016/10/solr-63-batch-jobs-parallel-etl-and.html

Joel Bernstein
http://joelsolr.blogspot.com/

On Mon, Aug 14, 2017 at 7:53 PM, Joe Obernberger <
joseph.obernberger@gmail.com> wrote:

> Hi All - I'm using the classify stream expression and the results returned
> are always limited to 1,000.  Where do I specify the number to return?  The
> stream expression that I'm using looks like:
>
> classify(model(models,id="MODEL1014",cacheMillis=5000),searc
> h(COL,df="FULL_DOCUMENT",q="Collection:(COLLECT2000) AND
> DocTimestamp:[2017-08-14T04:00:00Z TO 2017-08-15T03:59:00Z]",fl="id,score",sort="id
> asc"),field="ClusterText")
>
> When I read this (code snipet):
>
>              stream.open();
>             while (true) {
>                 Tuple tuple = stream.read();
>                 if (tuple.EOF) {
>                     break;
>                 }
>                 Double probabilty = (Double) tuple.fields.get("probability_
> d");
>                 String docID = (String) tuple.fields.get("id");
>
> I get back 1,000 results.  Another question is if there is a way to
> parallelize the classify call to other worker nodes?  Thank you!
>
> -Joe
>
>