You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Chetas Joshi <ch...@gmail.com> on 2016/11/05 00:48:37 UTC

Parallelize Cursor approach

Hi,

I am using the cursor approach to fetch results from Solr (5.5.0). Most of
my queries return millions of results. Is there a way I can read the pages
in parallel? Is there a way I can get all the cursors well in advance?

Let's say my query returns 2M documents and I have set rows=100,000.
Can I have multiple threads iterating over different pages like
Thread1 -> docs 1 to 100K
Thread2 -> docs 101K to 200K
......
......

for this to happen, can I get all the cursorMarks for a given query so that
I can leverage the following code in parallel

cursorQ.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark)
val rsp: QueryResponse = c.query(cursorQ)

Thank you,
Chetas.

Re: Parallelize Cursor approach

Posted by Chetas Joshi <ch...@gmail.com>.
I got it when you said form N queries. Just wanted to try the "get all
cursorMark first" approach but just realized it would be very inefficient
as you said since cursor mark is serialized version of the last sorted
value you received and hence still you are reading the results from solr
although your "fl" -> null.

Just wanted to try this approach as I need everything sorted. In submitting
N queries, I will have to merge sort the results of N queries. But that
should be way better than the first approach I tried.

Thanks!

On Mon, Nov 14, 2016 at 3:58 PM, Erick Erickson <er...@gmail.com>
wrote:

> You're executing all the queries to parallelize before even starting.
> Seems very inefficient. My suggestion doesn't require this first step.
> Perhaps it was confusing because I mentioned "your own cursorMark".
> Really I meant bypass that entirely, just form N queries that were
> restricted to N disjoint subsets of the data and process them all in
> parallel, either with /export or /select.
>
> Best,
> Erick
>
> On Mon, Nov 14, 2016 at 3:53 PM, Chetas Joshi <ch...@gmail.com>
> wrote:
> > Thanks Joel for the explanation.
> >
> > Hi Erick,
> >
> > One of the ways I am trying to parallelize the cursor approach is by
> > iterating the result set twice.
> > (1) Once just to get all the cursor marks
> >
> > val q: SolrQuery = new solrj.SolrQuery()
> > q.set("q", query)
> > q.add("fq", query)
> > q.add("rows", batchSize.toString)
> > q.add("collection", collection)
> > q.add("fl", "null")
> > q.add("sort", "id asc")
> >
> > Here I am not asking for any field values ( "fl" -> null )
> >
> > (2) Once I get all the cursor marks, I can start parallel threads to get
> > the results in parallel.
> >
> > However, the first step in fact takes a lot of time. Even more than when
> I
> > would actually iterate through the results with "fl" -> field1, field2,
> > field3
> >
> > Why is this happening?
> >
> > Thanks!
> >
> >
> > On Thu, Nov 10, 2016 at 8:22 PM, Joel Bernstein <jo...@gmail.com>
> wrote:
> >
> >> Solr 5 was very early days for Streaming Expressions. Streaming
> Expressions
> >> and SQL use Java 8 so development switched to the 6.0 branch five months
> >> before the 6.0 release. So there was a very large jump in features and
> bug
> >> fixes from Solr 5 to Solr 6 in Streaming Expressions.
> >>
> >> Joel Bernstein
> >> http://joelsolr.blogspot.com/
> >>
> >> On Thu, Nov 10, 2016 at 11:14 PM, Joel Bernstein <jo...@gmail.com>
> >> wrote:
> >>
> >> > In Solr 5 the /export handler wasn't escaping json text fields, which
> >> > would produce json parse exceptions. This was fixed in Solr 6.0.
> >> >
> >> > Joel Bernstein
> >> > http://joelsolr.blogspot.com/
> >> >
> >> > On Tue, Nov 8, 2016 at 6:17 PM, Erick Erickson <
> erickerickson@gmail.com>
> >> > wrote:
> >> >
> >> >> Hmm, that should work fine. Let us know what the logs show if
> anything
> >> >> because this is weird.
> >> >>
> >> >> Best,
> >> >> Erick
> >> >>
> >> >> On Tue, Nov 8, 2016 at 1:00 PM, Chetas Joshi <chetas.joshi@gmail.com
> >
> >> >> wrote:
> >> >> > Hi Erick,
> >> >> >
> >> >> > This is how I use the streaming approach.
> >> >> >
> >> >> > Here is the solrconfig block.
> >> >> >
> >> >> > <requestHandler name="/export" class="solr.SearchHandler">
> >> >> >     <lst name="invariants">
> >> >> >         <str name="rq">{!xport}</str>
> >> >> >         <str name="wt">xsort</str>
> >> >> >         <str name="distrib">false</str>
> >> >> >     </lst>
> >> >> >     <arr name="components">
> >> >> >         <str>query</str>
> >> >> >     </arr>
> >> >> > </requestHandler>
> >> >> >
> >> >> > And here is the code in which SolrJ is being used.
> >> >> >
> >> >> > String zkHost = args[0];
> >> >> > String collection = args[1];
> >> >> >
> >> >> > Map props = new HashMap();
> >> >> > props.put("q", "*:*");
> >> >> > props.put("qt", "/export");
> >> >> > props.put("sort", "fieldA asc");
> >> >> > props.put("fl", "fieldA,fieldB,fieldC");
> >> >> >
> >> >> > CloudSolrStream cloudstream = new CloudSolrStream(zkHost,collect
> >> >> ion,props);
> >> >> >
> >> >> > And then I iterate through the cloud stream (TupleStream).
> >> >> > So I am using streaming expressions (SolrJ).
> >> >> >
> >> >> > I have not looked at the solr logs while I started getting the JSON
> >> >> parsing
> >> >> > exceptions. But I will let you know what I see the next time I run
> >> into
> >> >> the
> >> >> > same exceptions.
> >> >> >
> >> >> > Thanks
> >> >> >
> >> >> > On Sat, Nov 5, 2016 at 9:32 PM, Erick Erickson <
> >> erickerickson@gmail.com
> >> >> >
> >> >> > wrote:
> >> >> >
> >> >> >> Hmmm, export is supposed to handle 10s of million result sets. I
> know
> >> >> >> of a situation where the Streaming Aggregation functionality back
> >> >> >> ported to Solr 4.10 processes on that scale. So do you have any
> clue
> >> >> >> what exactly is failing? Is there anything in the Solr logs?
> >> >> >>
> >> >> >> _How_ are you using /export, through Streaming Aggregation
> (SolrJ) or
> >> >> >> just the raw xport handler? It might be worth trying to do this
> from
> >> >> >> SolrJ if you're not, it should be a very quick program to write,
> just
> >> >> >> to test we're talking 100 lines max.
> >> >> >>
> >> >> >> You could always roll your own cursor mark stuff by partitioning
> the
> >> >> >> data amongst N threads/processes if you have any reasonable
> >> >> >> expectation that you could form filter queries that partition the
> >> >> >> result set anywhere near evenly.
> >> >> >>
> >> >> >> For example, let's say you have a field with random numbers
> between 0
> >> >> >> and 100. You could spin off 10 cursorMark-aware processes each
> with
> >> >> >> its own fq clause like
> >> >> >>
> >> >> >> fq=partition_field:[0 TO 10}
> >> >> >> fq=[10 TO 20}
> >> >> >> ....
> >> >> >> fq=[90 TO 100]
> >> >> >>
> >> >> >> Note the use of inclusive/exclusive end points....
> >> >> >>
> >> >> >> Each one would be totally independent of all others with no
> >> >> >> overlapping documents. And since the fq's would presumably be
> cached
> >> >> >> you should be able to go as fast as you can drive your cluster. Of
> >> >> >> course you lose query-wide sorting and the like, if that's
> important
> >> >> >> you'd need to figure something out there.
> >> >> >>
> >> >> >> Do be aware of a potential issue. When regular doc fields are
> >> >> >> returned, for each document returned, a 16K block of data will be
> >> >> >> decompressed to get the stored field data. Streaming Aggregation
> >> >> >> (/xport) reads docValues entries which are held in MMapDirectory
> >> space
> >> >> >> so will be much, much faster. As of Solr 5.5. You can override the
> >> >> >> decompression stuff, see:
> >> >> >> https://issues.apache.org/jira/browse/SOLR-8220 for fields that
> are
> >> >> >> both stored and docvalues...
> >> >> >>
> >> >> >> Best,
> >> >> >> Erick
> >> >> >>
> >> >> >> On Sat, Nov 5, 2016 at 6:41 PM, Chetas Joshi <
> chetas.joshi@gmail.com
> >> >
> >> >> >> wrote:
> >> >> >> > Thanks Yonik for the explanation.
> >> >> >> >
> >> >> >> > Hi Erick,
> >> >> >> > I was using the /xport functionality. But it hasn't been stable
> >> (Solr
> >> >> >> > 5.5.0). I started running into run time Exceptions (JSON parsing
> >> >> >> > exceptions) while reading the stream of Tuples. This started
> >> >> happening as
> >> >> >> > the size of my collection increased 3 times and I started
> running
> >> >> queries
> >> >> >> > that return millions of documents (>10mm). I don't know if it is
> >> the
> >> >> >> query
> >> >> >> > result size or the actual data size (total number of docs in the
> >> >> >> > collection) that is causing the instability.
> >> >> >> >
> >> >> >> > org.noggit.JSONParser$ParseException: Expected ',' or '}':
> >> >> >> > char=5,position=110938 BEFORE='uuid":"0lG99s8vyaKB2I/
> >> >> >> > I","space":"uuid","timestamp":1 5' AFTER='DB6
> 474294954},{"uuid":"
> >> >> >> > 0lG99sHT8P5e'
> >> >> >> >
> >> >> >> > I won't be able to move to Solr 6.0 due to some constraints in
> our
> >> >> >> > production environment and hence moving back to the cursor
> >> approach.
> >> >> Do
> >> >> >> you
> >> >> >> > have any other suggestion for me?
> >> >> >> >
> >> >> >> > Thanks,
> >> >> >> > Chetas.
> >> >> >> >
> >> >> >> > On Fri, Nov 4, 2016 at 10:17 PM, Erick Erickson <
> >> >> erickerickson@gmail.com
> >> >> >> >
> >> >> >> > wrote:
> >> >> >> >
> >> >> >> >> Have you considered the /xport functionality?
> >> >> >> >>
> >> >> >> >> On Fri, Nov 4, 2016 at 5:56 PM, Yonik Seeley <
> yseeley@gmail.com>
> >> >> wrote:
> >> >> >> >> > No, you can't get cursor-marks ahead of time.
> >> >> >> >> > They are the serialized representation of the last sort
> values
> >> >> >> >> > encountered (hence not known ahead of time).
> >> >> >> >> >
> >> >> >> >> > -Yonik
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> > On Fri, Nov 4, 2016 at 8:48 PM, Chetas Joshi <
> >> >> chetas.joshi@gmail.com>
> >> >> >> >> wrote:
> >> >> >> >> >> Hi,
> >> >> >> >> >>
> >> >> >> >> >> I am using the cursor approach to fetch results from Solr
> >> >> (5.5.0).
> >> >> >> Most
> >> >> >> >> of
> >> >> >> >> >> my queries return millions of results. Is there a way I can
> >> read
> >> >> the
> >> >> >> >> pages
> >> >> >> >> >> in parallel? Is there a way I can get all the cursors well
> in
> >> >> >> advance?
> >> >> >> >> >>
> >> >> >> >> >> Let's say my query returns 2M documents and I have set
> >> >> rows=100,000.
> >> >> >> >> >> Can I have multiple threads iterating over different pages
> like
> >> >> >> >> >> Thread1 -> docs 1 to 100K
> >> >> >> >> >> Thread2 -> docs 101K to 200K
> >> >> >> >> >> ......
> >> >> >> >> >> ......
> >> >> >> >> >>
> >> >> >> >> >> for this to happen, can I get all the cursorMarks for a
> given
> >> >> query
> >> >> >> so
> >> >> >> >> that
> >> >> >> >> >> I can leverage the following code in parallel
> >> >> >> >> >>
> >> >> >> >> >> cursorQ.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark)
> >> >> >> >> >> val rsp: QueryResponse = c.query(cursorQ)
> >> >> >> >> >>
> >> >> >> >> >> Thank you,
> >> >> >> >> >> Chetas.
> >> >> >> >>
> >> >> >>
> >> >>
> >> >
> >> >
> >>
>

Re: Parallelize Cursor approach

Posted by Erick Erickson <er...@gmail.com>.
You're executing all the queries to parallelize before even starting.
Seems very inefficient. My suggestion doesn't require this first step.
Perhaps it was confusing because I mentioned "your own cursorMark".
Really I meant bypass that entirely, just form N queries that were
restricted to N disjoint subsets of the data and process them all in
parallel, either with /export or /select.

Best,
Erick

On Mon, Nov 14, 2016 at 3:53 PM, Chetas Joshi <ch...@gmail.com> wrote:
> Thanks Joel for the explanation.
>
> Hi Erick,
>
> One of the ways I am trying to parallelize the cursor approach is by
> iterating the result set twice.
> (1) Once just to get all the cursor marks
>
> val q: SolrQuery = new solrj.SolrQuery()
> q.set("q", query)
> q.add("fq", query)
> q.add("rows", batchSize.toString)
> q.add("collection", collection)
> q.add("fl", "null")
> q.add("sort", "id asc")
>
> Here I am not asking for any field values ( "fl" -> null )
>
> (2) Once I get all the cursor marks, I can start parallel threads to get
> the results in parallel.
>
> However, the first step in fact takes a lot of time. Even more than when I
> would actually iterate through the results with "fl" -> field1, field2,
> field3
>
> Why is this happening?
>
> Thanks!
>
>
> On Thu, Nov 10, 2016 at 8:22 PM, Joel Bernstein <jo...@gmail.com> wrote:
>
>> Solr 5 was very early days for Streaming Expressions. Streaming Expressions
>> and SQL use Java 8 so development switched to the 6.0 branch five months
>> before the 6.0 release. So there was a very large jump in features and bug
>> fixes from Solr 5 to Solr 6 in Streaming Expressions.
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>> On Thu, Nov 10, 2016 at 11:14 PM, Joel Bernstein <jo...@gmail.com>
>> wrote:
>>
>> > In Solr 5 the /export handler wasn't escaping json text fields, which
>> > would produce json parse exceptions. This was fixed in Solr 6.0.
>> >
>> > Joel Bernstein
>> > http://joelsolr.blogspot.com/
>> >
>> > On Tue, Nov 8, 2016 at 6:17 PM, Erick Erickson <er...@gmail.com>
>> > wrote:
>> >
>> >> Hmm, that should work fine. Let us know what the logs show if anything
>> >> because this is weird.
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >> On Tue, Nov 8, 2016 at 1:00 PM, Chetas Joshi <ch...@gmail.com>
>> >> wrote:
>> >> > Hi Erick,
>> >> >
>> >> > This is how I use the streaming approach.
>> >> >
>> >> > Here is the solrconfig block.
>> >> >
>> >> > <requestHandler name="/export" class="solr.SearchHandler">
>> >> >     <lst name="invariants">
>> >> >         <str name="rq">{!xport}</str>
>> >> >         <str name="wt">xsort</str>
>> >> >         <str name="distrib">false</str>
>> >> >     </lst>
>> >> >     <arr name="components">
>> >> >         <str>query</str>
>> >> >     </arr>
>> >> > </requestHandler>
>> >> >
>> >> > And here is the code in which SolrJ is being used.
>> >> >
>> >> > String zkHost = args[0];
>> >> > String collection = args[1];
>> >> >
>> >> > Map props = new HashMap();
>> >> > props.put("q", "*:*");
>> >> > props.put("qt", "/export");
>> >> > props.put("sort", "fieldA asc");
>> >> > props.put("fl", "fieldA,fieldB,fieldC");
>> >> >
>> >> > CloudSolrStream cloudstream = new CloudSolrStream(zkHost,collect
>> >> ion,props);
>> >> >
>> >> > And then I iterate through the cloud stream (TupleStream).
>> >> > So I am using streaming expressions (SolrJ).
>> >> >
>> >> > I have not looked at the solr logs while I started getting the JSON
>> >> parsing
>> >> > exceptions. But I will let you know what I see the next time I run
>> into
>> >> the
>> >> > same exceptions.
>> >> >
>> >> > Thanks
>> >> >
>> >> > On Sat, Nov 5, 2016 at 9:32 PM, Erick Erickson <
>> erickerickson@gmail.com
>> >> >
>> >> > wrote:
>> >> >
>> >> >> Hmmm, export is supposed to handle 10s of million result sets. I know
>> >> >> of a situation where the Streaming Aggregation functionality back
>> >> >> ported to Solr 4.10 processes on that scale. So do you have any clue
>> >> >> what exactly is failing? Is there anything in the Solr logs?
>> >> >>
>> >> >> _How_ are you using /export, through Streaming Aggregation (SolrJ) or
>> >> >> just the raw xport handler? It might be worth trying to do this from
>> >> >> SolrJ if you're not, it should be a very quick program to write, just
>> >> >> to test we're talking 100 lines max.
>> >> >>
>> >> >> You could always roll your own cursor mark stuff by partitioning the
>> >> >> data amongst N threads/processes if you have any reasonable
>> >> >> expectation that you could form filter queries that partition the
>> >> >> result set anywhere near evenly.
>> >> >>
>> >> >> For example, let's say you have a field with random numbers between 0
>> >> >> and 100. You could spin off 10 cursorMark-aware processes each with
>> >> >> its own fq clause like
>> >> >>
>> >> >> fq=partition_field:[0 TO 10}
>> >> >> fq=[10 TO 20}
>> >> >> ....
>> >> >> fq=[90 TO 100]
>> >> >>
>> >> >> Note the use of inclusive/exclusive end points....
>> >> >>
>> >> >> Each one would be totally independent of all others with no
>> >> >> overlapping documents. And since the fq's would presumably be cached
>> >> >> you should be able to go as fast as you can drive your cluster. Of
>> >> >> course you lose query-wide sorting and the like, if that's important
>> >> >> you'd need to figure something out there.
>> >> >>
>> >> >> Do be aware of a potential issue. When regular doc fields are
>> >> >> returned, for each document returned, a 16K block of data will be
>> >> >> decompressed to get the stored field data. Streaming Aggregation
>> >> >> (/xport) reads docValues entries which are held in MMapDirectory
>> space
>> >> >> so will be much, much faster. As of Solr 5.5. You can override the
>> >> >> decompression stuff, see:
>> >> >> https://issues.apache.org/jira/browse/SOLR-8220 for fields that are
>> >> >> both stored and docvalues...
>> >> >>
>> >> >> Best,
>> >> >> Erick
>> >> >>
>> >> >> On Sat, Nov 5, 2016 at 6:41 PM, Chetas Joshi <chetas.joshi@gmail.com
>> >
>> >> >> wrote:
>> >> >> > Thanks Yonik for the explanation.
>> >> >> >
>> >> >> > Hi Erick,
>> >> >> > I was using the /xport functionality. But it hasn't been stable
>> (Solr
>> >> >> > 5.5.0). I started running into run time Exceptions (JSON parsing
>> >> >> > exceptions) while reading the stream of Tuples. This started
>> >> happening as
>> >> >> > the size of my collection increased 3 times and I started running
>> >> queries
>> >> >> > that return millions of documents (>10mm). I don't know if it is
>> the
>> >> >> query
>> >> >> > result size or the actual data size (total number of docs in the
>> >> >> > collection) that is causing the instability.
>> >> >> >
>> >> >> > org.noggit.JSONParser$ParseException: Expected ',' or '}':
>> >> >> > char=5,position=110938 BEFORE='uuid":"0lG99s8vyaKB2I/
>> >> >> > I","space":"uuid","timestamp":1 5' AFTER='DB6 474294954},{"uuid":"
>> >> >> > 0lG99sHT8P5e'
>> >> >> >
>> >> >> > I won't be able to move to Solr 6.0 due to some constraints in our
>> >> >> > production environment and hence moving back to the cursor
>> approach.
>> >> Do
>> >> >> you
>> >> >> > have any other suggestion for me?
>> >> >> >
>> >> >> > Thanks,
>> >> >> > Chetas.
>> >> >> >
>> >> >> > On Fri, Nov 4, 2016 at 10:17 PM, Erick Erickson <
>> >> erickerickson@gmail.com
>> >> >> >
>> >> >> > wrote:
>> >> >> >
>> >> >> >> Have you considered the /xport functionality?
>> >> >> >>
>> >> >> >> On Fri, Nov 4, 2016 at 5:56 PM, Yonik Seeley <ys...@gmail.com>
>> >> wrote:
>> >> >> >> > No, you can't get cursor-marks ahead of time.
>> >> >> >> > They are the serialized representation of the last sort values
>> >> >> >> > encountered (hence not known ahead of time).
>> >> >> >> >
>> >> >> >> > -Yonik
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > On Fri, Nov 4, 2016 at 8:48 PM, Chetas Joshi <
>> >> chetas.joshi@gmail.com>
>> >> >> >> wrote:
>> >> >> >> >> Hi,
>> >> >> >> >>
>> >> >> >> >> I am using the cursor approach to fetch results from Solr
>> >> (5.5.0).
>> >> >> Most
>> >> >> >> of
>> >> >> >> >> my queries return millions of results. Is there a way I can
>> read
>> >> the
>> >> >> >> pages
>> >> >> >> >> in parallel? Is there a way I can get all the cursors well in
>> >> >> advance?
>> >> >> >> >>
>> >> >> >> >> Let's say my query returns 2M documents and I have set
>> >> rows=100,000.
>> >> >> >> >> Can I have multiple threads iterating over different pages like
>> >> >> >> >> Thread1 -> docs 1 to 100K
>> >> >> >> >> Thread2 -> docs 101K to 200K
>> >> >> >> >> ......
>> >> >> >> >> ......
>> >> >> >> >>
>> >> >> >> >> for this to happen, can I get all the cursorMarks for a given
>> >> query
>> >> >> so
>> >> >> >> that
>> >> >> >> >> I can leverage the following code in parallel
>> >> >> >> >>
>> >> >> >> >> cursorQ.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark)
>> >> >> >> >> val rsp: QueryResponse = c.query(cursorQ)
>> >> >> >> >>
>> >> >> >> >> Thank you,
>> >> >> >> >> Chetas.
>> >> >> >>
>> >> >>
>> >>
>> >
>> >
>>

Re: Parallelize Cursor approach

Posted by Chetas Joshi <ch...@gmail.com>.
Thanks Joel for the explanation.

Hi Erick,

One of the ways I am trying to parallelize the cursor approach is by
iterating the result set twice.
(1) Once just to get all the cursor marks

val q: SolrQuery = new solrj.SolrQuery()
q.set("q", query)
q.add("fq", query)
q.add("rows", batchSize.toString)
q.add("collection", collection)
q.add("fl", "null")
q.add("sort", "id asc")

Here I am not asking for any field values ( "fl" -> null )

(2) Once I get all the cursor marks, I can start parallel threads to get
the results in parallel.

However, the first step in fact takes a lot of time. Even more than when I
would actually iterate through the results with "fl" -> field1, field2,
field3

Why is this happening?

Thanks!


On Thu, Nov 10, 2016 at 8:22 PM, Joel Bernstein <jo...@gmail.com> wrote:

> Solr 5 was very early days for Streaming Expressions. Streaming Expressions
> and SQL use Java 8 so development switched to the 6.0 branch five months
> before the 6.0 release. So there was a very large jump in features and bug
> fixes from Solr 5 to Solr 6 in Streaming Expressions.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Thu, Nov 10, 2016 at 11:14 PM, Joel Bernstein <jo...@gmail.com>
> wrote:
>
> > In Solr 5 the /export handler wasn't escaping json text fields, which
> > would produce json parse exceptions. This was fixed in Solr 6.0.
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Tue, Nov 8, 2016 at 6:17 PM, Erick Erickson <er...@gmail.com>
> > wrote:
> >
> >> Hmm, that should work fine. Let us know what the logs show if anything
> >> because this is weird.
> >>
> >> Best,
> >> Erick
> >>
> >> On Tue, Nov 8, 2016 at 1:00 PM, Chetas Joshi <ch...@gmail.com>
> >> wrote:
> >> > Hi Erick,
> >> >
> >> > This is how I use the streaming approach.
> >> >
> >> > Here is the solrconfig block.
> >> >
> >> > <requestHandler name="/export" class="solr.SearchHandler">
> >> >     <lst name="invariants">
> >> >         <str name="rq">{!xport}</str>
> >> >         <str name="wt">xsort</str>
> >> >         <str name="distrib">false</str>
> >> >     </lst>
> >> >     <arr name="components">
> >> >         <str>query</str>
> >> >     </arr>
> >> > </requestHandler>
> >> >
> >> > And here is the code in which SolrJ is being used.
> >> >
> >> > String zkHost = args[0];
> >> > String collection = args[1];
> >> >
> >> > Map props = new HashMap();
> >> > props.put("q", "*:*");
> >> > props.put("qt", "/export");
> >> > props.put("sort", "fieldA asc");
> >> > props.put("fl", "fieldA,fieldB,fieldC");
> >> >
> >> > CloudSolrStream cloudstream = new CloudSolrStream(zkHost,collect
> >> ion,props);
> >> >
> >> > And then I iterate through the cloud stream (TupleStream).
> >> > So I am using streaming expressions (SolrJ).
> >> >
> >> > I have not looked at the solr logs while I started getting the JSON
> >> parsing
> >> > exceptions. But I will let you know what I see the next time I run
> into
> >> the
> >> > same exceptions.
> >> >
> >> > Thanks
> >> >
> >> > On Sat, Nov 5, 2016 at 9:32 PM, Erick Erickson <
> erickerickson@gmail.com
> >> >
> >> > wrote:
> >> >
> >> >> Hmmm, export is supposed to handle 10s of million result sets. I know
> >> >> of a situation where the Streaming Aggregation functionality back
> >> >> ported to Solr 4.10 processes on that scale. So do you have any clue
> >> >> what exactly is failing? Is there anything in the Solr logs?
> >> >>
> >> >> _How_ are you using /export, through Streaming Aggregation (SolrJ) or
> >> >> just the raw xport handler? It might be worth trying to do this from
> >> >> SolrJ if you're not, it should be a very quick program to write, just
> >> >> to test we're talking 100 lines max.
> >> >>
> >> >> You could always roll your own cursor mark stuff by partitioning the
> >> >> data amongst N threads/processes if you have any reasonable
> >> >> expectation that you could form filter queries that partition the
> >> >> result set anywhere near evenly.
> >> >>
> >> >> For example, let's say you have a field with random numbers between 0
> >> >> and 100. You could spin off 10 cursorMark-aware processes each with
> >> >> its own fq clause like
> >> >>
> >> >> fq=partition_field:[0 TO 10}
> >> >> fq=[10 TO 20}
> >> >> ....
> >> >> fq=[90 TO 100]
> >> >>
> >> >> Note the use of inclusive/exclusive end points....
> >> >>
> >> >> Each one would be totally independent of all others with no
> >> >> overlapping documents. And since the fq's would presumably be cached
> >> >> you should be able to go as fast as you can drive your cluster. Of
> >> >> course you lose query-wide sorting and the like, if that's important
> >> >> you'd need to figure something out there.
> >> >>
> >> >> Do be aware of a potential issue. When regular doc fields are
> >> >> returned, for each document returned, a 16K block of data will be
> >> >> decompressed to get the stored field data. Streaming Aggregation
> >> >> (/xport) reads docValues entries which are held in MMapDirectory
> space
> >> >> so will be much, much faster. As of Solr 5.5. You can override the
> >> >> decompression stuff, see:
> >> >> https://issues.apache.org/jira/browse/SOLR-8220 for fields that are
> >> >> both stored and docvalues...
> >> >>
> >> >> Best,
> >> >> Erick
> >> >>
> >> >> On Sat, Nov 5, 2016 at 6:41 PM, Chetas Joshi <chetas.joshi@gmail.com
> >
> >> >> wrote:
> >> >> > Thanks Yonik for the explanation.
> >> >> >
> >> >> > Hi Erick,
> >> >> > I was using the /xport functionality. But it hasn't been stable
> (Solr
> >> >> > 5.5.0). I started running into run time Exceptions (JSON parsing
> >> >> > exceptions) while reading the stream of Tuples. This started
> >> happening as
> >> >> > the size of my collection increased 3 times and I started running
> >> queries
> >> >> > that return millions of documents (>10mm). I don't know if it is
> the
> >> >> query
> >> >> > result size or the actual data size (total number of docs in the
> >> >> > collection) that is causing the instability.
> >> >> >
> >> >> > org.noggit.JSONParser$ParseException: Expected ',' or '}':
> >> >> > char=5,position=110938 BEFORE='uuid":"0lG99s8vyaKB2I/
> >> >> > I","space":"uuid","timestamp":1 5' AFTER='DB6 474294954},{"uuid":"
> >> >> > 0lG99sHT8P5e'
> >> >> >
> >> >> > I won't be able to move to Solr 6.0 due to some constraints in our
> >> >> > production environment and hence moving back to the cursor
> approach.
> >> Do
> >> >> you
> >> >> > have any other suggestion for me?
> >> >> >
> >> >> > Thanks,
> >> >> > Chetas.
> >> >> >
> >> >> > On Fri, Nov 4, 2016 at 10:17 PM, Erick Erickson <
> >> erickerickson@gmail.com
> >> >> >
> >> >> > wrote:
> >> >> >
> >> >> >> Have you considered the /xport functionality?
> >> >> >>
> >> >> >> On Fri, Nov 4, 2016 at 5:56 PM, Yonik Seeley <ys...@gmail.com>
> >> wrote:
> >> >> >> > No, you can't get cursor-marks ahead of time.
> >> >> >> > They are the serialized representation of the last sort values
> >> >> >> > encountered (hence not known ahead of time).
> >> >> >> >
> >> >> >> > -Yonik
> >> >> >> >
> >> >> >> >
> >> >> >> > On Fri, Nov 4, 2016 at 8:48 PM, Chetas Joshi <
> >> chetas.joshi@gmail.com>
> >> >> >> wrote:
> >> >> >> >> Hi,
> >> >> >> >>
> >> >> >> >> I am using the cursor approach to fetch results from Solr
> >> (5.5.0).
> >> >> Most
> >> >> >> of
> >> >> >> >> my queries return millions of results. Is there a way I can
> read
> >> the
> >> >> >> pages
> >> >> >> >> in parallel? Is there a way I can get all the cursors well in
> >> >> advance?
> >> >> >> >>
> >> >> >> >> Let's say my query returns 2M documents and I have set
> >> rows=100,000.
> >> >> >> >> Can I have multiple threads iterating over different pages like
> >> >> >> >> Thread1 -> docs 1 to 100K
> >> >> >> >> Thread2 -> docs 101K to 200K
> >> >> >> >> ......
> >> >> >> >> ......
> >> >> >> >>
> >> >> >> >> for this to happen, can I get all the cursorMarks for a given
> >> query
> >> >> so
> >> >> >> that
> >> >> >> >> I can leverage the following code in parallel
> >> >> >> >>
> >> >> >> >> cursorQ.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark)
> >> >> >> >> val rsp: QueryResponse = c.query(cursorQ)
> >> >> >> >>
> >> >> >> >> Thank you,
> >> >> >> >> Chetas.
> >> >> >>
> >> >>
> >>
> >
> >
>

Re: Parallelize Cursor approach

Posted by Joel Bernstein <jo...@gmail.com>.
Solr 5 was very early days for Streaming Expressions. Streaming Expressions
and SQL use Java 8 so development switched to the 6.0 branch five months
before the 6.0 release. So there was a very large jump in features and bug
fixes from Solr 5 to Solr 6 in Streaming Expressions.

Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Nov 10, 2016 at 11:14 PM, Joel Bernstein <jo...@gmail.com> wrote:

> In Solr 5 the /export handler wasn't escaping json text fields, which
> would produce json parse exceptions. This was fixed in Solr 6.0.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Tue, Nov 8, 2016 at 6:17 PM, Erick Erickson <er...@gmail.com>
> wrote:
>
>> Hmm, that should work fine. Let us know what the logs show if anything
>> because this is weird.
>>
>> Best,
>> Erick
>>
>> On Tue, Nov 8, 2016 at 1:00 PM, Chetas Joshi <ch...@gmail.com>
>> wrote:
>> > Hi Erick,
>> >
>> > This is how I use the streaming approach.
>> >
>> > Here is the solrconfig block.
>> >
>> > <requestHandler name="/export" class="solr.SearchHandler">
>> >     <lst name="invariants">
>> >         <str name="rq">{!xport}</str>
>> >         <str name="wt">xsort</str>
>> >         <str name="distrib">false</str>
>> >     </lst>
>> >     <arr name="components">
>> >         <str>query</str>
>> >     </arr>
>> > </requestHandler>
>> >
>> > And here is the code in which SolrJ is being used.
>> >
>> > String zkHost = args[0];
>> > String collection = args[1];
>> >
>> > Map props = new HashMap();
>> > props.put("q", "*:*");
>> > props.put("qt", "/export");
>> > props.put("sort", "fieldA asc");
>> > props.put("fl", "fieldA,fieldB,fieldC");
>> >
>> > CloudSolrStream cloudstream = new CloudSolrStream(zkHost,collect
>> ion,props);
>> >
>> > And then I iterate through the cloud stream (TupleStream).
>> > So I am using streaming expressions (SolrJ).
>> >
>> > I have not looked at the solr logs while I started getting the JSON
>> parsing
>> > exceptions. But I will let you know what I see the next time I run into
>> the
>> > same exceptions.
>> >
>> > Thanks
>> >
>> > On Sat, Nov 5, 2016 at 9:32 PM, Erick Erickson <erickerickson@gmail.com
>> >
>> > wrote:
>> >
>> >> Hmmm, export is supposed to handle 10s of million result sets. I know
>> >> of a situation where the Streaming Aggregation functionality back
>> >> ported to Solr 4.10 processes on that scale. So do you have any clue
>> >> what exactly is failing? Is there anything in the Solr logs?
>> >>
>> >> _How_ are you using /export, through Streaming Aggregation (SolrJ) or
>> >> just the raw xport handler? It might be worth trying to do this from
>> >> SolrJ if you're not, it should be a very quick program to write, just
>> >> to test we're talking 100 lines max.
>> >>
>> >> You could always roll your own cursor mark stuff by partitioning the
>> >> data amongst N threads/processes if you have any reasonable
>> >> expectation that you could form filter queries that partition the
>> >> result set anywhere near evenly.
>> >>
>> >> For example, let's say you have a field with random numbers between 0
>> >> and 100. You could spin off 10 cursorMark-aware processes each with
>> >> its own fq clause like
>> >>
>> >> fq=partition_field:[0 TO 10}
>> >> fq=[10 TO 20}
>> >> ....
>> >> fq=[90 TO 100]
>> >>
>> >> Note the use of inclusive/exclusive end points....
>> >>
>> >> Each one would be totally independent of all others with no
>> >> overlapping documents. And since the fq's would presumably be cached
>> >> you should be able to go as fast as you can drive your cluster. Of
>> >> course you lose query-wide sorting and the like, if that's important
>> >> you'd need to figure something out there.
>> >>
>> >> Do be aware of a potential issue. When regular doc fields are
>> >> returned, for each document returned, a 16K block of data will be
>> >> decompressed to get the stored field data. Streaming Aggregation
>> >> (/xport) reads docValues entries which are held in MMapDirectory space
>> >> so will be much, much faster. As of Solr 5.5. You can override the
>> >> decompression stuff, see:
>> >> https://issues.apache.org/jira/browse/SOLR-8220 for fields that are
>> >> both stored and docvalues...
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >> On Sat, Nov 5, 2016 at 6:41 PM, Chetas Joshi <ch...@gmail.com>
>> >> wrote:
>> >> > Thanks Yonik for the explanation.
>> >> >
>> >> > Hi Erick,
>> >> > I was using the /xport functionality. But it hasn't been stable (Solr
>> >> > 5.5.0). I started running into run time Exceptions (JSON parsing
>> >> > exceptions) while reading the stream of Tuples. This started
>> happening as
>> >> > the size of my collection increased 3 times and I started running
>> queries
>> >> > that return millions of documents (>10mm). I don't know if it is the
>> >> query
>> >> > result size or the actual data size (total number of docs in the
>> >> > collection) that is causing the instability.
>> >> >
>> >> > org.noggit.JSONParser$ParseException: Expected ',' or '}':
>> >> > char=5,position=110938 BEFORE='uuid":"0lG99s8vyaKB2I/
>> >> > I","space":"uuid","timestamp":1 5' AFTER='DB6 474294954},{"uuid":"
>> >> > 0lG99sHT8P5e'
>> >> >
>> >> > I won't be able to move to Solr 6.0 due to some constraints in our
>> >> > production environment and hence moving back to the cursor approach.
>> Do
>> >> you
>> >> > have any other suggestion for me?
>> >> >
>> >> > Thanks,
>> >> > Chetas.
>> >> >
>> >> > On Fri, Nov 4, 2016 at 10:17 PM, Erick Erickson <
>> erickerickson@gmail.com
>> >> >
>> >> > wrote:
>> >> >
>> >> >> Have you considered the /xport functionality?
>> >> >>
>> >> >> On Fri, Nov 4, 2016 at 5:56 PM, Yonik Seeley <ys...@gmail.com>
>> wrote:
>> >> >> > No, you can't get cursor-marks ahead of time.
>> >> >> > They are the serialized representation of the last sort values
>> >> >> > encountered (hence not known ahead of time).
>> >> >> >
>> >> >> > -Yonik
>> >> >> >
>> >> >> >
>> >> >> > On Fri, Nov 4, 2016 at 8:48 PM, Chetas Joshi <
>> chetas.joshi@gmail.com>
>> >> >> wrote:
>> >> >> >> Hi,
>> >> >> >>
>> >> >> >> I am using the cursor approach to fetch results from Solr
>> (5.5.0).
>> >> Most
>> >> >> of
>> >> >> >> my queries return millions of results. Is there a way I can read
>> the
>> >> >> pages
>> >> >> >> in parallel? Is there a way I can get all the cursors well in
>> >> advance?
>> >> >> >>
>> >> >> >> Let's say my query returns 2M documents and I have set
>> rows=100,000.
>> >> >> >> Can I have multiple threads iterating over different pages like
>> >> >> >> Thread1 -> docs 1 to 100K
>> >> >> >> Thread2 -> docs 101K to 200K
>> >> >> >> ......
>> >> >> >> ......
>> >> >> >>
>> >> >> >> for this to happen, can I get all the cursorMarks for a given
>> query
>> >> so
>> >> >> that
>> >> >> >> I can leverage the following code in parallel
>> >> >> >>
>> >> >> >> cursorQ.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark)
>> >> >> >> val rsp: QueryResponse = c.query(cursorQ)
>> >> >> >>
>> >> >> >> Thank you,
>> >> >> >> Chetas.
>> >> >>
>> >>
>>
>
>

Re: Parallelize Cursor approach

Posted by Joel Bernstein <jo...@gmail.com>.
In Solr 5 the /export handler wasn't escaping json text fields, which would
produce json parse exceptions. This was fixed in Solr 6.0.

Joel Bernstein
http://joelsolr.blogspot.com/

On Tue, Nov 8, 2016 at 6:17 PM, Erick Erickson <er...@gmail.com>
wrote:

> Hmm, that should work fine. Let us know what the logs show if anything
> because this is weird.
>
> Best,
> Erick
>
> On Tue, Nov 8, 2016 at 1:00 PM, Chetas Joshi <ch...@gmail.com>
> wrote:
> > Hi Erick,
> >
> > This is how I use the streaming approach.
> >
> > Here is the solrconfig block.
> >
> > <requestHandler name="/export" class="solr.SearchHandler">
> >     <lst name="invariants">
> >         <str name="rq">{!xport}</str>
> >         <str name="wt">xsort</str>
> >         <str name="distrib">false</str>
> >     </lst>
> >     <arr name="components">
> >         <str>query</str>
> >     </arr>
> > </requestHandler>
> >
> > And here is the code in which SolrJ is being used.
> >
> > String zkHost = args[0];
> > String collection = args[1];
> >
> > Map props = new HashMap();
> > props.put("q", "*:*");
> > props.put("qt", "/export");
> > props.put("sort", "fieldA asc");
> > props.put("fl", "fieldA,fieldB,fieldC");
> >
> > CloudSolrStream cloudstream = new CloudSolrStream(zkHost,
> collection,props);
> >
> > And then I iterate through the cloud stream (TupleStream).
> > So I am using streaming expressions (SolrJ).
> >
> > I have not looked at the solr logs while I started getting the JSON
> parsing
> > exceptions. But I will let you know what I see the next time I run into
> the
> > same exceptions.
> >
> > Thanks
> >
> > On Sat, Nov 5, 2016 at 9:32 PM, Erick Erickson <er...@gmail.com>
> > wrote:
> >
> >> Hmmm, export is supposed to handle 10s of million result sets. I know
> >> of a situation where the Streaming Aggregation functionality back
> >> ported to Solr 4.10 processes on that scale. So do you have any clue
> >> what exactly is failing? Is there anything in the Solr logs?
> >>
> >> _How_ are you using /export, through Streaming Aggregation (SolrJ) or
> >> just the raw xport handler? It might be worth trying to do this from
> >> SolrJ if you're not, it should be a very quick program to write, just
> >> to test we're talking 100 lines max.
> >>
> >> You could always roll your own cursor mark stuff by partitioning the
> >> data amongst N threads/processes if you have any reasonable
> >> expectation that you could form filter queries that partition the
> >> result set anywhere near evenly.
> >>
> >> For example, let's say you have a field with random numbers between 0
> >> and 100. You could spin off 10 cursorMark-aware processes each with
> >> its own fq clause like
> >>
> >> fq=partition_field:[0 TO 10}
> >> fq=[10 TO 20}
> >> ....
> >> fq=[90 TO 100]
> >>
> >> Note the use of inclusive/exclusive end points....
> >>
> >> Each one would be totally independent of all others with no
> >> overlapping documents. And since the fq's would presumably be cached
> >> you should be able to go as fast as you can drive your cluster. Of
> >> course you lose query-wide sorting and the like, if that's important
> >> you'd need to figure something out there.
> >>
> >> Do be aware of a potential issue. When regular doc fields are
> >> returned, for each document returned, a 16K block of data will be
> >> decompressed to get the stored field data. Streaming Aggregation
> >> (/xport) reads docValues entries which are held in MMapDirectory space
> >> so will be much, much faster. As of Solr 5.5. You can override the
> >> decompression stuff, see:
> >> https://issues.apache.org/jira/browse/SOLR-8220 for fields that are
> >> both stored and docvalues...
> >>
> >> Best,
> >> Erick
> >>
> >> On Sat, Nov 5, 2016 at 6:41 PM, Chetas Joshi <ch...@gmail.com>
> >> wrote:
> >> > Thanks Yonik for the explanation.
> >> >
> >> > Hi Erick,
> >> > I was using the /xport functionality. But it hasn't been stable (Solr
> >> > 5.5.0). I started running into run time Exceptions (JSON parsing
> >> > exceptions) while reading the stream of Tuples. This started
> happening as
> >> > the size of my collection increased 3 times and I started running
> queries
> >> > that return millions of documents (>10mm). I don't know if it is the
> >> query
> >> > result size or the actual data size (total number of docs in the
> >> > collection) that is causing the instability.
> >> >
> >> > org.noggit.JSONParser$ParseException: Expected ',' or '}':
> >> > char=5,position=110938 BEFORE='uuid":"0lG99s8vyaKB2I/
> >> > I","space":"uuid","timestamp":1 5' AFTER='DB6 474294954},{"uuid":"
> >> > 0lG99sHT8P5e'
> >> >
> >> > I won't be able to move to Solr 6.0 due to some constraints in our
> >> > production environment and hence moving back to the cursor approach.
> Do
> >> you
> >> > have any other suggestion for me?
> >> >
> >> > Thanks,
> >> > Chetas.
> >> >
> >> > On Fri, Nov 4, 2016 at 10:17 PM, Erick Erickson <
> erickerickson@gmail.com
> >> >
> >> > wrote:
> >> >
> >> >> Have you considered the /xport functionality?
> >> >>
> >> >> On Fri, Nov 4, 2016 at 5:56 PM, Yonik Seeley <ys...@gmail.com>
> wrote:
> >> >> > No, you can't get cursor-marks ahead of time.
> >> >> > They are the serialized representation of the last sort values
> >> >> > encountered (hence not known ahead of time).
> >> >> >
> >> >> > -Yonik
> >> >> >
> >> >> >
> >> >> > On Fri, Nov 4, 2016 at 8:48 PM, Chetas Joshi <
> chetas.joshi@gmail.com>
> >> >> wrote:
> >> >> >> Hi,
> >> >> >>
> >> >> >> I am using the cursor approach to fetch results from Solr (5.5.0).
> >> Most
> >> >> of
> >> >> >> my queries return millions of results. Is there a way I can read
> the
> >> >> pages
> >> >> >> in parallel? Is there a way I can get all the cursors well in
> >> advance?
> >> >> >>
> >> >> >> Let's say my query returns 2M documents and I have set
> rows=100,000.
> >> >> >> Can I have multiple threads iterating over different pages like
> >> >> >> Thread1 -> docs 1 to 100K
> >> >> >> Thread2 -> docs 101K to 200K
> >> >> >> ......
> >> >> >> ......
> >> >> >>
> >> >> >> for this to happen, can I get all the cursorMarks for a given
> query
> >> so
> >> >> that
> >> >> >> I can leverage the following code in parallel
> >> >> >>
> >> >> >> cursorQ.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark)
> >> >> >> val rsp: QueryResponse = c.query(cursorQ)
> >> >> >>
> >> >> >> Thank you,
> >> >> >> Chetas.
> >> >>
> >>
>

Re: Parallelize Cursor approach

Posted by Erick Erickson <er...@gmail.com>.
Hmm, that should work fine. Let us know what the logs show if anything
because this is weird.

Best,
Erick

On Tue, Nov 8, 2016 at 1:00 PM, Chetas Joshi <ch...@gmail.com> wrote:
> Hi Erick,
>
> This is how I use the streaming approach.
>
> Here is the solrconfig block.
>
> <requestHandler name="/export" class="solr.SearchHandler">
>     <lst name="invariants">
>         <str name="rq">{!xport}</str>
>         <str name="wt">xsort</str>
>         <str name="distrib">false</str>
>     </lst>
>     <arr name="components">
>         <str>query</str>
>     </arr>
> </requestHandler>
>
> And here is the code in which SolrJ is being used.
>
> String zkHost = args[0];
> String collection = args[1];
>
> Map props = new HashMap();
> props.put("q", "*:*");
> props.put("qt", "/export");
> props.put("sort", "fieldA asc");
> props.put("fl", "fieldA,fieldB,fieldC");
>
> CloudSolrStream cloudstream = new CloudSolrStream(zkHost,collection,props);
>
> And then I iterate through the cloud stream (TupleStream).
> So I am using streaming expressions (SolrJ).
>
> I have not looked at the solr logs while I started getting the JSON parsing
> exceptions. But I will let you know what I see the next time I run into the
> same exceptions.
>
> Thanks
>
> On Sat, Nov 5, 2016 at 9:32 PM, Erick Erickson <er...@gmail.com>
> wrote:
>
>> Hmmm, export is supposed to handle 10s of million result sets. I know
>> of a situation where the Streaming Aggregation functionality back
>> ported to Solr 4.10 processes on that scale. So do you have any clue
>> what exactly is failing? Is there anything in the Solr logs?
>>
>> _How_ are you using /export, through Streaming Aggregation (SolrJ) or
>> just the raw xport handler? It might be worth trying to do this from
>> SolrJ if you're not, it should be a very quick program to write, just
>> to test we're talking 100 lines max.
>>
>> You could always roll your own cursor mark stuff by partitioning the
>> data amongst N threads/processes if you have any reasonable
>> expectation that you could form filter queries that partition the
>> result set anywhere near evenly.
>>
>> For example, let's say you have a field with random numbers between 0
>> and 100. You could spin off 10 cursorMark-aware processes each with
>> its own fq clause like
>>
>> fq=partition_field:[0 TO 10}
>> fq=[10 TO 20}
>> ....
>> fq=[90 TO 100]
>>
>> Note the use of inclusive/exclusive end points....
>>
>> Each one would be totally independent of all others with no
>> overlapping documents. And since the fq's would presumably be cached
>> you should be able to go as fast as you can drive your cluster. Of
>> course you lose query-wide sorting and the like, if that's important
>> you'd need to figure something out there.
>>
>> Do be aware of a potential issue. When regular doc fields are
>> returned, for each document returned, a 16K block of data will be
>> decompressed to get the stored field data. Streaming Aggregation
>> (/xport) reads docValues entries which are held in MMapDirectory space
>> so will be much, much faster. As of Solr 5.5. You can override the
>> decompression stuff, see:
>> https://issues.apache.org/jira/browse/SOLR-8220 for fields that are
>> both stored and docvalues...
>>
>> Best,
>> Erick
>>
>> On Sat, Nov 5, 2016 at 6:41 PM, Chetas Joshi <ch...@gmail.com>
>> wrote:
>> > Thanks Yonik for the explanation.
>> >
>> > Hi Erick,
>> > I was using the /xport functionality. But it hasn't been stable (Solr
>> > 5.5.0). I started running into run time Exceptions (JSON parsing
>> > exceptions) while reading the stream of Tuples. This started happening as
>> > the size of my collection increased 3 times and I started running queries
>> > that return millions of documents (>10mm). I don't know if it is the
>> query
>> > result size or the actual data size (total number of docs in the
>> > collection) that is causing the instability.
>> >
>> > org.noggit.JSONParser$ParseException: Expected ',' or '}':
>> > char=5,position=110938 BEFORE='uuid":"0lG99s8vyaKB2I/
>> > I","space":"uuid","timestamp":1 5' AFTER='DB6 474294954},{"uuid":"
>> > 0lG99sHT8P5e'
>> >
>> > I won't be able to move to Solr 6.0 due to some constraints in our
>> > production environment and hence moving back to the cursor approach. Do
>> you
>> > have any other suggestion for me?
>> >
>> > Thanks,
>> > Chetas.
>> >
>> > On Fri, Nov 4, 2016 at 10:17 PM, Erick Erickson <erickerickson@gmail.com
>> >
>> > wrote:
>> >
>> >> Have you considered the /xport functionality?
>> >>
>> >> On Fri, Nov 4, 2016 at 5:56 PM, Yonik Seeley <ys...@gmail.com> wrote:
>> >> > No, you can't get cursor-marks ahead of time.
>> >> > They are the serialized representation of the last sort values
>> >> > encountered (hence not known ahead of time).
>> >> >
>> >> > -Yonik
>> >> >
>> >> >
>> >> > On Fri, Nov 4, 2016 at 8:48 PM, Chetas Joshi <ch...@gmail.com>
>> >> wrote:
>> >> >> Hi,
>> >> >>
>> >> >> I am using the cursor approach to fetch results from Solr (5.5.0).
>> Most
>> >> of
>> >> >> my queries return millions of results. Is there a way I can read the
>> >> pages
>> >> >> in parallel? Is there a way I can get all the cursors well in
>> advance?
>> >> >>
>> >> >> Let's say my query returns 2M documents and I have set rows=100,000.
>> >> >> Can I have multiple threads iterating over different pages like
>> >> >> Thread1 -> docs 1 to 100K
>> >> >> Thread2 -> docs 101K to 200K
>> >> >> ......
>> >> >> ......
>> >> >>
>> >> >> for this to happen, can I get all the cursorMarks for a given query
>> so
>> >> that
>> >> >> I can leverage the following code in parallel
>> >> >>
>> >> >> cursorQ.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark)
>> >> >> val rsp: QueryResponse = c.query(cursorQ)
>> >> >>
>> >> >> Thank you,
>> >> >> Chetas.
>> >>
>>

Re: Parallelize Cursor approach

Posted by Chetas Joshi <ch...@gmail.com>.
Hi Erick,

This is how I use the streaming approach.

Here is the solrconfig block.

<requestHandler name="/export" class="solr.SearchHandler">
    <lst name="invariants">
        <str name="rq">{!xport}</str>
        <str name="wt">xsort</str>
        <str name="distrib">false</str>
    </lst>
    <arr name="components">
        <str>query</str>
    </arr>
</requestHandler>

And here is the code in which SolrJ is being used.

String zkHost = args[0];
String collection = args[1];

Map props = new HashMap();
props.put("q", "*:*");
props.put("qt", "/export");
props.put("sort", "fieldA asc");
props.put("fl", "fieldA,fieldB,fieldC");

CloudSolrStream cloudstream = new CloudSolrStream(zkHost,collection,props);

And then I iterate through the cloud stream (TupleStream).
So I am using streaming expressions (SolrJ).

I have not looked at the solr logs while I started getting the JSON parsing
exceptions. But I will let you know what I see the next time I run into the
same exceptions.

Thanks

On Sat, Nov 5, 2016 at 9:32 PM, Erick Erickson <er...@gmail.com>
wrote:

> Hmmm, export is supposed to handle 10s of million result sets. I know
> of a situation where the Streaming Aggregation functionality back
> ported to Solr 4.10 processes on that scale. So do you have any clue
> what exactly is failing? Is there anything in the Solr logs?
>
> _How_ are you using /export, through Streaming Aggregation (SolrJ) or
> just the raw xport handler? It might be worth trying to do this from
> SolrJ if you're not, it should be a very quick program to write, just
> to test we're talking 100 lines max.
>
> You could always roll your own cursor mark stuff by partitioning the
> data amongst N threads/processes if you have any reasonable
> expectation that you could form filter queries that partition the
> result set anywhere near evenly.
>
> For example, let's say you have a field with random numbers between 0
> and 100. You could spin off 10 cursorMark-aware processes each with
> its own fq clause like
>
> fq=partition_field:[0 TO 10}
> fq=[10 TO 20}
> ....
> fq=[90 TO 100]
>
> Note the use of inclusive/exclusive end points....
>
> Each one would be totally independent of all others with no
> overlapping documents. And since the fq's would presumably be cached
> you should be able to go as fast as you can drive your cluster. Of
> course you lose query-wide sorting and the like, if that's important
> you'd need to figure something out there.
>
> Do be aware of a potential issue. When regular doc fields are
> returned, for each document returned, a 16K block of data will be
> decompressed to get the stored field data. Streaming Aggregation
> (/xport) reads docValues entries which are held in MMapDirectory space
> so will be much, much faster. As of Solr 5.5. You can override the
> decompression stuff, see:
> https://issues.apache.org/jira/browse/SOLR-8220 for fields that are
> both stored and docvalues...
>
> Best,
> Erick
>
> On Sat, Nov 5, 2016 at 6:41 PM, Chetas Joshi <ch...@gmail.com>
> wrote:
> > Thanks Yonik for the explanation.
> >
> > Hi Erick,
> > I was using the /xport functionality. But it hasn't been stable (Solr
> > 5.5.0). I started running into run time Exceptions (JSON parsing
> > exceptions) while reading the stream of Tuples. This started happening as
> > the size of my collection increased 3 times and I started running queries
> > that return millions of documents (>10mm). I don't know if it is the
> query
> > result size or the actual data size (total number of docs in the
> > collection) that is causing the instability.
> >
> > org.noggit.JSONParser$ParseException: Expected ',' or '}':
> > char=5,position=110938 BEFORE='uuid":"0lG99s8vyaKB2I/
> > I","space":"uuid","timestamp":1 5' AFTER='DB6 474294954},{"uuid":"
> > 0lG99sHT8P5e'
> >
> > I won't be able to move to Solr 6.0 due to some constraints in our
> > production environment and hence moving back to the cursor approach. Do
> you
> > have any other suggestion for me?
> >
> > Thanks,
> > Chetas.
> >
> > On Fri, Nov 4, 2016 at 10:17 PM, Erick Erickson <erickerickson@gmail.com
> >
> > wrote:
> >
> >> Have you considered the /xport functionality?
> >>
> >> On Fri, Nov 4, 2016 at 5:56 PM, Yonik Seeley <ys...@gmail.com> wrote:
> >> > No, you can't get cursor-marks ahead of time.
> >> > They are the serialized representation of the last sort values
> >> > encountered (hence not known ahead of time).
> >> >
> >> > -Yonik
> >> >
> >> >
> >> > On Fri, Nov 4, 2016 at 8:48 PM, Chetas Joshi <ch...@gmail.com>
> >> wrote:
> >> >> Hi,
> >> >>
> >> >> I am using the cursor approach to fetch results from Solr (5.5.0).
> Most
> >> of
> >> >> my queries return millions of results. Is there a way I can read the
> >> pages
> >> >> in parallel? Is there a way I can get all the cursors well in
> advance?
> >> >>
> >> >> Let's say my query returns 2M documents and I have set rows=100,000.
> >> >> Can I have multiple threads iterating over different pages like
> >> >> Thread1 -> docs 1 to 100K
> >> >> Thread2 -> docs 101K to 200K
> >> >> ......
> >> >> ......
> >> >>
> >> >> for this to happen, can I get all the cursorMarks for a given query
> so
> >> that
> >> >> I can leverage the following code in parallel
> >> >>
> >> >> cursorQ.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark)
> >> >> val rsp: QueryResponse = c.query(cursorQ)
> >> >>
> >> >> Thank you,
> >> >> Chetas.
> >>
>

Re: Parallelize Cursor approach

Posted by Erick Erickson <er...@gmail.com>.
Hmmm, export is supposed to handle 10s of million result sets. I know
of a situation where the Streaming Aggregation functionality back
ported to Solr 4.10 processes on that scale. So do you have any clue
what exactly is failing? Is there anything in the Solr logs?

_How_ are you using /export, through Streaming Aggregation (SolrJ) or
just the raw xport handler? It might be worth trying to do this from
SolrJ if you're not, it should be a very quick program to write, just
to test we're talking 100 lines max.

You could always roll your own cursor mark stuff by partitioning the
data amongst N threads/processes if you have any reasonable
expectation that you could form filter queries that partition the
result set anywhere near evenly.

For example, let's say you have a field with random numbers between 0
and 100. You could spin off 10 cursorMark-aware processes each with
its own fq clause like

fq=partition_field:[0 TO 10}
fq=[10 TO 20}
....
fq=[90 TO 100]

Note the use of inclusive/exclusive end points....

Each one would be totally independent of all others with no
overlapping documents. And since the fq's would presumably be cached
you should be able to go as fast as you can drive your cluster. Of
course you lose query-wide sorting and the like, if that's important
you'd need to figure something out there.

Do be aware of a potential issue. When regular doc fields are
returned, for each document returned, a 16K block of data will be
decompressed to get the stored field data. Streaming Aggregation
(/xport) reads docValues entries which are held in MMapDirectory space
so will be much, much faster. As of Solr 5.5. You can override the
decompression stuff, see:
https://issues.apache.org/jira/browse/SOLR-8220 for fields that are
both stored and docvalues...

Best,
Erick

On Sat, Nov 5, 2016 at 6:41 PM, Chetas Joshi <ch...@gmail.com> wrote:
> Thanks Yonik for the explanation.
>
> Hi Erick,
> I was using the /xport functionality. But it hasn't been stable (Solr
> 5.5.0). I started running into run time Exceptions (JSON parsing
> exceptions) while reading the stream of Tuples. This started happening as
> the size of my collection increased 3 times and I started running queries
> that return millions of documents (>10mm). I don't know if it is the query
> result size or the actual data size (total number of docs in the
> collection) that is causing the instability.
>
> org.noggit.JSONParser$ParseException: Expected ',' or '}':
> char=5,position=110938 BEFORE='uuid":"0lG99s8vyaKB2I/
> I","space":"uuid","timestamp":1 5' AFTER='DB6 474294954},{"uuid":"
> 0lG99sHT8P5e'
>
> I won't be able to move to Solr 6.0 due to some constraints in our
> production environment and hence moving back to the cursor approach. Do you
> have any other suggestion for me?
>
> Thanks,
> Chetas.
>
> On Fri, Nov 4, 2016 at 10:17 PM, Erick Erickson <er...@gmail.com>
> wrote:
>
>> Have you considered the /xport functionality?
>>
>> On Fri, Nov 4, 2016 at 5:56 PM, Yonik Seeley <ys...@gmail.com> wrote:
>> > No, you can't get cursor-marks ahead of time.
>> > They are the serialized representation of the last sort values
>> > encountered (hence not known ahead of time).
>> >
>> > -Yonik
>> >
>> >
>> > On Fri, Nov 4, 2016 at 8:48 PM, Chetas Joshi <ch...@gmail.com>
>> wrote:
>> >> Hi,
>> >>
>> >> I am using the cursor approach to fetch results from Solr (5.5.0). Most
>> of
>> >> my queries return millions of results. Is there a way I can read the
>> pages
>> >> in parallel? Is there a way I can get all the cursors well in advance?
>> >>
>> >> Let's say my query returns 2M documents and I have set rows=100,000.
>> >> Can I have multiple threads iterating over different pages like
>> >> Thread1 -> docs 1 to 100K
>> >> Thread2 -> docs 101K to 200K
>> >> ......
>> >> ......
>> >>
>> >> for this to happen, can I get all the cursorMarks for a given query so
>> that
>> >> I can leverage the following code in parallel
>> >>
>> >> cursorQ.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark)
>> >> val rsp: QueryResponse = c.query(cursorQ)
>> >>
>> >> Thank you,
>> >> Chetas.
>>

Re: Parallelize Cursor approach

Posted by Chetas Joshi <ch...@gmail.com>.
Thanks Yonik for the explanation.

Hi Erick,
I was using the /xport functionality. But it hasn't been stable (Solr
5.5.0). I started running into run time Exceptions (JSON parsing
exceptions) while reading the stream of Tuples. This started happening as
the size of my collection increased 3 times and I started running queries
that return millions of documents (>10mm). I don't know if it is the query
result size or the actual data size (total number of docs in the
collection) that is causing the instability.

org.noggit.JSONParser$ParseException: Expected ',' or '}':
char=5,position=110938 BEFORE='uuid":"0lG99s8vyaKB2I/
I","space":"uuid","timestamp":1 5' AFTER='DB6 474294954},{"uuid":"
0lG99sHT8P5e'

I won't be able to move to Solr 6.0 due to some constraints in our
production environment and hence moving back to the cursor approach. Do you
have any other suggestion for me?

Thanks,
Chetas.

On Fri, Nov 4, 2016 at 10:17 PM, Erick Erickson <er...@gmail.com>
wrote:

> Have you considered the /xport functionality?
>
> On Fri, Nov 4, 2016 at 5:56 PM, Yonik Seeley <ys...@gmail.com> wrote:
> > No, you can't get cursor-marks ahead of time.
> > They are the serialized representation of the last sort values
> > encountered (hence not known ahead of time).
> >
> > -Yonik
> >
> >
> > On Fri, Nov 4, 2016 at 8:48 PM, Chetas Joshi <ch...@gmail.com>
> wrote:
> >> Hi,
> >>
> >> I am using the cursor approach to fetch results from Solr (5.5.0). Most
> of
> >> my queries return millions of results. Is there a way I can read the
> pages
> >> in parallel? Is there a way I can get all the cursors well in advance?
> >>
> >> Let's say my query returns 2M documents and I have set rows=100,000.
> >> Can I have multiple threads iterating over different pages like
> >> Thread1 -> docs 1 to 100K
> >> Thread2 -> docs 101K to 200K
> >> ......
> >> ......
> >>
> >> for this to happen, can I get all the cursorMarks for a given query so
> that
> >> I can leverage the following code in parallel
> >>
> >> cursorQ.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark)
> >> val rsp: QueryResponse = c.query(cursorQ)
> >>
> >> Thank you,
> >> Chetas.
>

Re: Parallelize Cursor approach

Posted by Erick Erickson <er...@gmail.com>.
Have you considered the /xport functionality?

On Fri, Nov 4, 2016 at 5:56 PM, Yonik Seeley <ys...@gmail.com> wrote:
> No, you can't get cursor-marks ahead of time.
> They are the serialized representation of the last sort values
> encountered (hence not known ahead of time).
>
> -Yonik
>
>
> On Fri, Nov 4, 2016 at 8:48 PM, Chetas Joshi <ch...@gmail.com> wrote:
>> Hi,
>>
>> I am using the cursor approach to fetch results from Solr (5.5.0). Most of
>> my queries return millions of results. Is there a way I can read the pages
>> in parallel? Is there a way I can get all the cursors well in advance?
>>
>> Let's say my query returns 2M documents and I have set rows=100,000.
>> Can I have multiple threads iterating over different pages like
>> Thread1 -> docs 1 to 100K
>> Thread2 -> docs 101K to 200K
>> ......
>> ......
>>
>> for this to happen, can I get all the cursorMarks for a given query so that
>> I can leverage the following code in parallel
>>
>> cursorQ.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark)
>> val rsp: QueryResponse = c.query(cursorQ)
>>
>> Thank you,
>> Chetas.

Re: Parallelize Cursor approach

Posted by Yonik Seeley <ys...@gmail.com>.
No, you can't get cursor-marks ahead of time.
They are the serialized representation of the last sort values
encountered (hence not known ahead of time).

-Yonik


On Fri, Nov 4, 2016 at 8:48 PM, Chetas Joshi <ch...@gmail.com> wrote:
> Hi,
>
> I am using the cursor approach to fetch results from Solr (5.5.0). Most of
> my queries return millions of results. Is there a way I can read the pages
> in parallel? Is there a way I can get all the cursors well in advance?
>
> Let's say my query returns 2M documents and I have set rows=100,000.
> Can I have multiple threads iterating over different pages like
> Thread1 -> docs 1 to 100K
> Thread2 -> docs 101K to 200K
> ......
> ......
>
> for this to happen, can I get all the cursorMarks for a given query so that
> I can leverage the following code in parallel
>
> cursorQ.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark)
> val rsp: QueryResponse = c.query(cursorQ)
>
> Thank you,
> Chetas.