You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Walter Underwood <wu...@wunderwood.org> on 2020/02/10 18:30:49 UTC

cursorMark and shards? (6.6.2)

I tried to get fancy and dump our content with one thread per shard, but it did distributed search anyway. I specified the shard using the “shards” param and set distrib=false.

Is this a bug or expected behavior in 6.6.2? I did not see it mentioned in the docs.

It is working fine with a single thread and distributed search. Should have followed the old Kernighan and Plauger rule, “Make it right before youmake it faster."

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

Re: cursorMark and shards? (6.6.2)

Posted by Erick Erickson <er...@gmail.com>.

Curiouser and curiouser. So two possibilities are just the time it takes to assemble the packet and the time it takes to send it back. Three more experiments then.

1> change the returned doc to return a single docValues=true field. My claim: The response will be very close to the 400-600 ms range back at the client. This is kind of a sanity check, it should collect the field to return without visiting the stored data/decompressing it etc.

2> change the returned doc to return a single docValues=false stored=true field. That’ll exercise the whole fetch-from-disk-and-decompress cycle because all the stored values for a doc needs to be decompressed if you access even one. If that comes back in, say, < 1 second then the speed issues are either GC thrashing or your network would be my guess. If it’s in the same 10s range, then I’d be looking at GC and the like.

3> change the returned rows to, say, 100 while returning all rows. If you see a pretty linear relationship between the number of docs and the response time then at least we know where to dig.

But the 1% CPU utilization makes me suspect transmission. If it were read-from-disk-and-decompress I’d expect more CPU due to the decompression phase unless you have a very slow disk.

So apparently CursorMark was a red herring?

it is _vaguely_ possible that bumping the documentCache higher might help, but frankly I doubt it. That would help if you had a situation where you were having to re-access the stored data from disk for several different components in a single query, but I don’t think that pertains. But then I’m surprised you’re seeing this at all so what I think, as an author I read had a character say, “ain’t worth a goldfish fart” in the face of evidence.

Best,
Erick

> On Feb 11, 2020, at 12:33 PM, Walter Underwood <wu...@wunderwood.org> wrote:
> 
> Good questions. Here is the QTime for rows=1000. Looks pretty reasonable. I’d blame the slowness on the VPN connection, but the median response time of 10,000 msec is measured at the server.
> 
> The client is in Python, using wt=json. Average document size in JSON is 5132 bytes. The system should not be IO bound, but I’ll check. The instances have 31 GB of memory, shards are 40 GB on SSD. I don’t think I set up JVM monitoring on this cluster, so I can’t see if the GC is thrashing.
> 
> 2020-02-11T08:51:33 INFO QTime=401
> 2020-02-11T08:51:34 INFO QTime=612
> 2020-02-11T08:51:34 INFO QTime=492
> 2020-02-11T08:51:35 INFO QTime=513
> 2020-02-11T08:51:36 INFO QTime=458
> 2020-02-11T08:51:36 INFO QTime=419
> 2020-02-11T08:51:46 INFO QTime=477
> 2020-02-11T08:51:47 INFO QTime=479
> 2020-02-11T08:51:47 INFO QTime=457
> 2020-02-11T08:51:50 INFO QTime=553
> 2020-02-11T08:51:50 INFO QTime=658
> 2020-02-11T08:51:52 INFO QTime=379
> 
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
>> On Feb 11, 2020, at 6:28 AM, Erick Erickson <er...@gmail.com> wrote:
>> 
>> Wow, that’s pretty horrible performance. 
>> 
>> Yeah, I was conflating a couple of things here. Now it’s clear.
>> 
>> If you specify rows=1, what do you get in response time? I’m wondering if
>> your time is spent just assembling the response rather than searching. You’d
>> have to have massive docs for that to be the case, kind of a shot in the dark.
>> The assembly step requires the docs be read off disk, decompressed and then
>> transmitted, but 10 seconds is ridiculous for that. I’m starting to wonder about
>> being I/O bound either disk wise or network, but I’m pretty sure you’ve already
>> thought about that.
>> 
>> You are transmitting things around your servers given your statement that you
>> are seeing the searches distributed, which is also a waste, but again I wouldn’t
>> expect it to be that bad.
>> 
>> Hmmm, quick thing to check: What are the QTime’s reported? Those are
>> exclusive of assembling the return packet. If they were a few milliseconds and
>> your response back at the client was 10s, that’d be a clue.
>> 
>> Best,
>> Erick
>> 
>>> On Feb 11, 2020, at 2:13 AM, Walter Underwood <wu...@wunderwood.org> wrote:
>>> 
>>> sort=“id asc”
>>> 
>>> wunder
>>> Walter Underwood
>>> wunder@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>>>> On Feb 10, 2020, at 9:50 PM, Tim Casey <tc...@gmail.com> wrote:
>>>> 
>>>> Walter,
>>>> 
>>>> When you do the query, what is the sort of the results?
>>>> 
>>>> tim
>>>> 
>>>> On Mon, Feb 10, 2020 at 8:44 PM Walter Underwood <wu...@wunderwood.org>
>>>> wrote:
>>>> 
>>>>> I’ll back up a bit, since it is sort of an X/Y problem.
>>>>> 
>>>>> I have an index with four shards and 17 million documents. I want to dump
>>>>> all the docs in JSON, label each one with a classifier, then load them back
>>>>> in with the labels. This is a one-time (or rare) bootstrap of the
>>>>> classified data. This will unblock testing and relevance work while we get
>>>>> the classifier hooked into the indexing pipeline.
>>>>> 
>>>>> Because I’m dumping all the fields, we can’t rely on docValues.
>>>>> 
>>>>> It is OK if it takes a few hours.
>>>>> 
>>>>> Right now, it is running about 1.7 Mdoc/hour, so roughly 10 hours. That is
>>>>> 16 threads searching id:0* through id:f*, fetching 1000 rows each time,
>>>>> using cursorMark and distributed search. Median response time is 10 s. CPU
>>>>> usage is about 1%.
>>>>> 
>>>>> It is all pretty grubby and it seems like there could be a better way.
>>>>> 
>>>>> wunder
>>>>> Walter Underwood
>>>>> wunder@wunderwood.org
>>>>> http://observer.wunderwood.org/  (my blog)
>>>>> 
>>>>>> On Feb 10, 2020, at 3:39 PM, Erick Erickson <er...@gmail.com>
>>>>> wrote:
>>>>>> 
>>>>>> Any field that’s unique per doc would do, but yeah, that’s usually an ID.
>>>>>> 
>>>>>> Hmmm, I don’t see why separate queries for 0-f are necessary if you’re
>>>>> firing
>>>>>> at individual replicas. Each replica should have multiple UUIDs that
>>>>> start with 0-f.
>>>>>> 
>>>>>> Unless I misunderstand and you’re just firing off, say, 16 threads at
>>>>> the entire
>>>>>> collection rather than individual shards which would work too. But for
>>>>> individual
>>>>>> shards I think you need to look for all possible IDs...
>>>>>> 
>>>>>> Erick
>>>>>> 
>>>>>>> On Feb 10, 2020, at 5:37 PM, Walter Underwood <wu...@wunderwood.org>
>>>>> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>>> On Feb 10, 2020, at 2:24 PM, Walter Underwood <wu...@wunderwood.org>
>>>>> wrote:
>>>>>>>> 
>>>>>>>> Not sure if range queries work on a UUID field, ...
>>>>>>> 
>>>>>>> A search for id:0* took 260 ms, so it looks like they work just fine.
>>>>> I’ll try separate queries for 0-f.
>>>>>>> 
>>>>>>> wunder
>>>>>>> Walter Underwood
>>>>>>> wunder@wunderwood.org
>>>>>>> http://observer.wunderwood.org/  (my blog)
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>> 
>> 
>

Re: cursorMark and shards? (6.6.2)

Posted by Walter Underwood <wu...@wunderwood.org>.

Good questions. Here is the QTime for rows=1000. Looks pretty reasonable. I’d blame the slowness on the VPN connection, but the median response time of 10,000 msec is measured at the server.

The client is in Python, using wt=json. Average document size in JSON is 5132 bytes. The system should not be IO bound, but I’ll check. The instances have 31 GB of memory, shards are 40 GB on SSD. I don’t think I set up JVM monitoring on this cluster, so I can’t see if the GC is thrashing.

2020-02-11T08:51:33 INFO QTime=401
2020-02-11T08:51:34 INFO QTime=612
2020-02-11T08:51:34 INFO QTime=492
2020-02-11T08:51:35 INFO QTime=513
2020-02-11T08:51:36 INFO QTime=458
2020-02-11T08:51:36 INFO QTime=419
2020-02-11T08:51:46 INFO QTime=477
2020-02-11T08:51:47 INFO QTime=479
2020-02-11T08:51:47 INFO QTime=457
2020-02-11T08:51:50 INFO QTime=553
2020-02-11T08:51:50 INFO QTime=658
2020-02-11T08:51:52 INFO QTime=379

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 11, 2020, at 6:28 AM, Erick Erickson <er...@gmail.com> wrote:
> 
> Wow, that’s pretty horrible performance. 
> 
> Yeah, I was conflating a couple of things here. Now it’s clear.
> 
> If you specify rows=1, what do you get in response time? I’m wondering if
> your time is spent just assembling the response rather than searching. You’d
> have to have massive docs for that to be the case, kind of a shot in the dark.
> The assembly step requires the docs be read off disk, decompressed and then
> transmitted, but 10 seconds is ridiculous for that. I’m starting to wonder about
> being I/O bound either disk wise or network, but I’m pretty sure you’ve already
> thought about that.
> 
> You are transmitting things around your servers given your statement that you
> are seeing the searches distributed, which is also a waste, but again I wouldn’t
> expect it to be that bad.
> 
> Hmmm, quick thing to check: What are the QTime’s reported? Those are
> exclusive of assembling the return packet. If they were a few milliseconds and
> your response back at the client was 10s, that’d be a clue.
> 
> Best,
> Erick
> 
>> On Feb 11, 2020, at 2:13 AM, Walter Underwood <wu...@wunderwood.org> wrote:
>> 
>> sort=“id asc”
>> 
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Feb 10, 2020, at 9:50 PM, Tim Casey <tc...@gmail.com> wrote:
>>> 
>>> Walter,
>>> 
>>> When you do the query, what is the sort of the results?
>>> 
>>> tim
>>> 
>>> On Mon, Feb 10, 2020 at 8:44 PM Walter Underwood <wu...@wunderwood.org>
>>> wrote:
>>> 
>>>> I’ll back up a bit, since it is sort of an X/Y problem.
>>>> 
>>>> I have an index with four shards and 17 million documents. I want to dump
>>>> all the docs in JSON, label each one with a classifier, then load them back
>>>> in with the labels. This is a one-time (or rare) bootstrap of the
>>>> classified data. This will unblock testing and relevance work while we get
>>>> the classifier hooked into the indexing pipeline.
>>>> 
>>>> Because I’m dumping all the fields, we can’t rely on docValues.
>>>> 
>>>> It is OK if it takes a few hours.
>>>> 
>>>> Right now, it is running about 1.7 Mdoc/hour, so roughly 10 hours. That is
>>>> 16 threads searching id:0* through id:f*, fetching 1000 rows each time,
>>>> using cursorMark and distributed search. Median response time is 10 s. CPU
>>>> usage is about 1%.
>>>> 
>>>> It is all pretty grubby and it seems like there could be a better way.
>>>> 
>>>> wunder
>>>> Walter Underwood
>>>> wunder@wunderwood.org
>>>> http://observer.wunderwood.org/  (my blog)
>>>> 
>>>>> On Feb 10, 2020, at 3:39 PM, Erick Erickson <er...@gmail.com>
>>>> wrote:
>>>>> 
>>>>> Any field that’s unique per doc would do, but yeah, that’s usually an ID.
>>>>> 
>>>>> Hmmm, I don’t see why separate queries for 0-f are necessary if you’re
>>>> firing
>>>>> at individual replicas. Each replica should have multiple UUIDs that
>>>> start with 0-f.
>>>>> 
>>>>> Unless I misunderstand and you’re just firing off, say, 16 threads at
>>>> the entire
>>>>> collection rather than individual shards which would work too. But for
>>>> individual
>>>>> shards I think you need to look for all possible IDs...
>>>>> 
>>>>> Erick
>>>>> 
>>>>>> On Feb 10, 2020, at 5:37 PM, Walter Underwood <wu...@wunderwood.org>
>>>> wrote:
>>>>>> 
>>>>>> 
>>>>>>> On Feb 10, 2020, at 2:24 PM, Walter Underwood <wu...@wunderwood.org>
>>>> wrote:
>>>>>>> 
>>>>>>> Not sure if range queries work on a UUID field, ...
>>>>>> 
>>>>>> A search for id:0* took 260 ms, so it looks like they work just fine.
>>>> I’ll try separate queries for 0-f.
>>>>>> 
>>>>>> wunder
>>>>>> Walter Underwood
>>>>>> wunder@wunderwood.org
>>>>>> http://observer.wunderwood.org/  (my blog)
>>>>>> 
>>>>> 
>>>> 
>>>> 
>> 
>

Re: cursorMark and shards? (6.6.2)

Posted by Erick Erickson <er...@gmail.com>.

Wow, that’s pretty horrible performance. 

Yeah, I was conflating a couple of things here. Now it’s clear.

If you specify rows=1, what do you get in response time? I’m wondering if
your time is spent just assembling the response rather than searching. You’d
have to have massive docs for that to be the case, kind of a shot in the dark.
The assembly step requires the docs be read off disk, decompressed and then
transmitted, but 10 seconds is ridiculous for that. I’m starting to wonder about
being I/O bound either disk wise or network, but I’m pretty sure you’ve already
thought about that.

You are transmitting things around your servers given your statement that you
are seeing the searches distributed, which is also a waste, but again I wouldn’t
expect it to be that bad.

Hmmm, quick thing to check: What are the QTime’s reported? Those are
exclusive of assembling the return packet. If they were a few milliseconds and
your response back at the client was 10s, that’d be a clue.

Best,
Erick

> On Feb 11, 2020, at 2:13 AM, Walter Underwood <wu...@wunderwood.org> wrote:
> 
> sort=“id asc”
> 
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
>> On Feb 10, 2020, at 9:50 PM, Tim Casey <tc...@gmail.com> wrote:
>> 
>> Walter,
>> 
>> When you do the query, what is the sort of the results?
>> 
>> tim
>> 
>> On Mon, Feb 10, 2020 at 8:44 PM Walter Underwood <wu...@wunderwood.org>
>> wrote:
>> 
>>> I’ll back up a bit, since it is sort of an X/Y problem.
>>> 
>>> I have an index with four shards and 17 million documents. I want to dump
>>> all the docs in JSON, label each one with a classifier, then load them back
>>> in with the labels. This is a one-time (or rare) bootstrap of the
>>> classified data. This will unblock testing and relevance work while we get
>>> the classifier hooked into the indexing pipeline.
>>> 
>>> Because I’m dumping all the fields, we can’t rely on docValues.
>>> 
>>> It is OK if it takes a few hours.
>>> 
>>> Right now, it is running about 1.7 Mdoc/hour, so roughly 10 hours. That is
>>> 16 threads searching id:0* through id:f*, fetching 1000 rows each time,
>>> using cursorMark and distributed search. Median response time is 10 s. CPU
>>> usage is about 1%.
>>> 
>>> It is all pretty grubby and it seems like there could be a better way.
>>> 
>>> wunder
>>> Walter Underwood
>>> wunder@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>>>> On Feb 10, 2020, at 3:39 PM, Erick Erickson <er...@gmail.com>
>>> wrote:
>>>> 
>>>> Any field that’s unique per doc would do, but yeah, that’s usually an ID.
>>>> 
>>>> Hmmm, I don’t see why separate queries for 0-f are necessary if you’re
>>> firing
>>>> at individual replicas. Each replica should have multiple UUIDs that
>>> start with 0-f.
>>>> 
>>>> Unless I misunderstand and you’re just firing off, say, 16 threads at
>>> the entire
>>>> collection rather than individual shards which would work too. But for
>>> individual
>>>> shards I think you need to look for all possible IDs...
>>>> 
>>>> Erick
>>>> 
>>>>> On Feb 10, 2020, at 5:37 PM, Walter Underwood <wu...@wunderwood.org>
>>> wrote:
>>>>> 
>>>>> 
>>>>>> On Feb 10, 2020, at 2:24 PM, Walter Underwood <wu...@wunderwood.org>
>>> wrote:
>>>>>> 
>>>>>> Not sure if range queries work on a UUID field, ...
>>>>> 
>>>>> A search for id:0* took 260 ms, so it looks like they work just fine.
>>> I’ll try separate queries for 0-f.
>>>>> 
>>>>> wunder
>>>>> Walter Underwood
>>>>> wunder@wunderwood.org
>>>>> http://observer.wunderwood.org/  (my blog)
>>>>> 
>>>> 
>>> 
>>> 
>

Re: cursorMark and shards? (6.6.2)

Posted by Walter Underwood <wu...@wunderwood.org>.

sort=“id asc”

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 10, 2020, at 9:50 PM, Tim Casey <tc...@gmail.com> wrote:
> 
> Walter,
> 
> When you do the query, what is the sort of the results?
> 
> tim
> 
> On Mon, Feb 10, 2020 at 8:44 PM Walter Underwood <wu...@wunderwood.org>
> wrote:
> 
>> I’ll back up a bit, since it is sort of an X/Y problem.
>> 
>> I have an index with four shards and 17 million documents. I want to dump
>> all the docs in JSON, label each one with a classifier, then load them back
>> in with the labels. This is a one-time (or rare) bootstrap of the
>> classified data. This will unblock testing and relevance work while we get
>> the classifier hooked into the indexing pipeline.
>> 
>> Because I’m dumping all the fields, we can’t rely on docValues.
>> 
>> It is OK if it takes a few hours.
>> 
>> Right now, it is running about 1.7 Mdoc/hour, so roughly 10 hours. That is
>> 16 threads searching id:0* through id:f*, fetching 1000 rows each time,
>> using cursorMark and distributed search. Median response time is 10 s. CPU
>> usage is about 1%.
>> 
>> It is all pretty grubby and it seems like there could be a better way.
>> 
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Feb 10, 2020, at 3:39 PM, Erick Erickson <er...@gmail.com>
>> wrote:
>>> 
>>> Any field that’s unique per doc would do, but yeah, that’s usually an ID.
>>> 
>>> Hmmm, I don’t see why separate queries for 0-f are necessary if you’re
>> firing
>>> at individual replicas. Each replica should have multiple UUIDs that
>> start with 0-f.
>>> 
>>> Unless I misunderstand and you’re just firing off, say, 16 threads at
>> the entire
>>> collection rather than individual shards which would work too. But for
>> individual
>>> shards I think you need to look for all possible IDs...
>>> 
>>> Erick
>>> 
>>>> On Feb 10, 2020, at 5:37 PM, Walter Underwood <wu...@wunderwood.org>
>> wrote:
>>>> 
>>>> 
>>>>> On Feb 10, 2020, at 2:24 PM, Walter Underwood <wu...@wunderwood.org>
>> wrote:
>>>>> 
>>>>> Not sure if range queries work on a UUID field, ...
>>>> 
>>>> A search for id:0* took 260 ms, so it looks like they work just fine.
>> I’ll try separate queries for 0-f.
>>>> 
>>>> wunder
>>>> Walter Underwood
>>>> wunder@wunderwood.org
>>>> http://observer.wunderwood.org/  (my blog)
>>>> 
>>> 
>> 
>>

Re: cursorMark and shards? (6.6.2)

Posted by Tim Casey <tc...@gmail.com>.

Walter,

When you do the query, what is the sort of the results?

tim

On Mon, Feb 10, 2020 at 8:44 PM Walter Underwood <wu...@wunderwood.org>
wrote:

> I’ll back up a bit, since it is sort of an X/Y problem.
>
> I have an index with four shards and 17 million documents. I want to dump
> all the docs in JSON, label each one with a classifier, then load them back
> in with the labels. This is a one-time (or rare) bootstrap of the
> classified data. This will unblock testing and relevance work while we get
> the classifier hooked into the indexing pipeline.
>
> Because I’m dumping all the fields, we can’t rely on docValues.
>
> It is OK if it takes a few hours.
>
> Right now, it is running about 1.7 Mdoc/hour, so roughly 10 hours. That is
> 16 threads searching id:0* through id:f*, fetching 1000 rows each time,
> using cursorMark and distributed search. Median response time is 10 s. CPU
> usage is about 1%.
>
> It is all pretty grubby and it seems like there could be a better way.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Feb 10, 2020, at 3:39 PM, Erick Erickson <er...@gmail.com>
> wrote:
> >
> > Any field that’s unique per doc would do, but yeah, that’s usually an ID.
> >
> > Hmmm, I don’t see why separate queries for 0-f are necessary if you’re
> firing
> > at individual replicas. Each replica should have multiple UUIDs that
> start with 0-f.
> >
> > Unless I misunderstand and you’re just firing off, say, 16 threads at
> the entire
> > collection rather than individual shards which would work too. But for
> individual
> > shards I think you need to look for all possible IDs...
> >
> > Erick
> >
> >> On Feb 10, 2020, at 5:37 PM, Walter Underwood <wu...@wunderwood.org>
> wrote:
> >>
> >>
> >>> On Feb 10, 2020, at 2:24 PM, Walter Underwood <wu...@wunderwood.org>
> wrote:
> >>>
> >>> Not sure if range queries work on a UUID field, ...
> >>
> >> A search for id:0* took 260 ms, so it looks like they work just fine.
> I’ll try separate queries for 0-f.
> >>
> >> wunder
> >> Walter Underwood
> >> wunder@wunderwood.org
> >> http://observer.wunderwood.org/  (my blog)
> >>
> >
>
>

Re: cursorMark and shards? (6.6.2)

Posted by Walter Underwood <wu...@wunderwood.org>.

I’ll back up a bit, since it is sort of an X/Y problem.

I have an index with four shards and 17 million documents. I want to dump all the docs in JSON, label each one with a classifier, then load them back in with the labels. This is a one-time (or rare) bootstrap of the classified data. This will unblock testing and relevance work while we get the classifier hooked into the indexing pipeline.

Because I’m dumping all the fields, we can’t rely on docValues.

It is OK if it takes a few hours.

Right now, it is running about 1.7 Mdoc/hour, so roughly 10 hours. That is 16 threads searching id:0* through id:f*, fetching 1000 rows each time, using cursorMark and distributed search. Median response time is 10 s. CPU usage is about 1%.

It is all pretty grubby and it seems like there could be a better way.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 10, 2020, at 3:39 PM, Erick Erickson <er...@gmail.com> wrote:
> 
> Any field that’s unique per doc would do, but yeah, that’s usually an ID.
> 
> Hmmm, I don’t see why separate queries for 0-f are necessary if you’re firing
> at individual replicas. Each replica should have multiple UUIDs that start with 0-f.
> 
> Unless I misunderstand and you’re just firing off, say, 16 threads at the entire
> collection rather than individual shards which would work too. But for individual
> shards I think you need to look for all possible IDs...
> 
> Erick
> 
>> On Feb 10, 2020, at 5:37 PM, Walter Underwood <wu...@wunderwood.org> wrote:
>> 
>> 
>>> On Feb 10, 2020, at 2:24 PM, Walter Underwood <wu...@wunderwood.org> wrote:
>>> 
>>> Not sure if range queries work on a UUID field, ...
>> 
>> A search for id:0* took 260 ms, so it looks like they work just fine. I’ll try separate queries for 0-f. 
>> 
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>

Re: cursorMark and shards? (6.6.2)

Posted by Michael Gibney <mi...@michaelgibney.net>.

Possibly worth mentioning, although it might not be appropriate for
your use case: if the fields you're interested in are configured with
docValues, you could use streaming expressions (or directly handle
thread-per-shard connections to the /export handler) and get
everything in a single shot without paging of any kind. (I'm actually
working on something of this nature now; though not quite ready for
prime time, it's reliably exporting 68 million records to a 24G
compressed zip archive in 23 minutes -- 24 shards).

On Mon, Feb 10, 2020 at 6:39 PM Erick Erickson <er...@gmail.com> wrote:
>
> Any field that’s unique per doc would do, but yeah, that’s usually an ID.
>
> Hmmm, I don’t see why separate queries for 0-f are necessary if you’re firing
> at individual replicas. Each replica should have multiple UUIDs that start with 0-f.
>
> Unless I misunderstand and you’re just firing off, say, 16 threads at the entire
> collection rather than individual shards which would work too. But for individual
> shards I think you need to look for all possible IDs...
>
> Erick
>
> > On Feb 10, 2020, at 5:37 PM, Walter Underwood <wu...@wunderwood.org> wrote:
> >
> >
> >> On Feb 10, 2020, at 2:24 PM, Walter Underwood <wu...@wunderwood.org> wrote:
> >>
> >> Not sure if range queries work on a UUID field, ...
> >
> > A search for id:0* took 260 ms, so it looks like they work just fine. I’ll try separate queries for 0-f.
> >
> > wunder
> > Walter Underwood
> > wunder@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
>

Re: cursorMark and shards? (6.6.2)

Posted by Erick Erickson <er...@gmail.com>.

Any field that’s unique per doc would do, but yeah, that’s usually an ID.

Hmmm, I don’t see why separate queries for 0-f are necessary if you’re firing
at individual replicas. Each replica should have multiple UUIDs that start with 0-f.

Unless I misunderstand and you’re just firing off, say, 16 threads at the entire
collection rather than individual shards which would work too. But for individual
shards I think you need to look for all possible IDs...

Erick

> On Feb 10, 2020, at 5:37 PM, Walter Underwood <wu...@wunderwood.org> wrote:
> 
> 
>> On Feb 10, 2020, at 2:24 PM, Walter Underwood <wu...@wunderwood.org> wrote:
>> 
>> Not sure if range queries work on a UUID field, ...
> 
> A search for id:0* took 260 ms, so it looks like they work just fine. I’ll try separate queries for 0-f. 
> 
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>

Re: cursorMark and shards? (6.6.2)

Posted by Walter Underwood <wu...@wunderwood.org>.

> On Feb 10, 2020, at 2:24 PM, Walter Underwood <wu...@wunderwood.org> wrote:
> 
> Not sure if range queries work on a UUID field, ...

A search for id:0* took 260 ms, so it looks like they work just fine. I’ll try separate queries for 0-f. 

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

Re: cursorMark and shards? (6.6.2)

Posted by Walter Underwood <wu...@wunderwood.org>.

I’ll give that a shot.

Not sure if range queries work on a UUID field, but I have thought of segmenting the ID space and running parallel queries on those.

Right now it is sucking over 1.6 million docs per hour, so that is bearable. Making it 4X or 16 X faster would be nice, though.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 10, 2020, at 2:19 PM, Erick Erickson <er...@gmail.com> wrote:
> 
> Not sure whether cursormark respects distrib=false, although I can easily see there being “complications” here.
> 
> Hmmm, whenever I try to use distrib=false, I usually fire the query at the specific replica rather than use the shards parameter. IDK whether that’ll make any difference.
> 
> https://node:port/solr/collection1_shard1_replica1/query?distrib=false…….
> 
> You could also make it simpler...
> 
> q=id:{last_id_from_last_packet TO *]&rows=some_reasonable_number&sort=id asc&distrib=false
> 
> That doesn’t pay the penalty of having a huge start param.
> 
> Again, I’d use the specific replica rather than shards parameter, but just because I’ve never tried it with shards….
> 
> Best,
> Erick
> 
>> On Feb 10, 2020, at 1:30 PM, Walter Underwood <wu...@wunderwood.org> wrote:
>> 
>> I tried to get fancy and dump our content with one thread per shard, but it did distributed search anyway. I specified the shard using the “shards” param and set distrib=false.
>> 
>> Is this a bug or expected behavior in 6.6.2? I did not see it mentioned in the docs.
>> 
>> It is working fine with a single thread and distributed search. Should have followed the old Kernighan and Plauger rule, “Make it right before youmake it faster."
>> 
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>

Re: cursorMark and shards? (6.6.2)

Posted by Erick Erickson <er...@gmail.com>.

Not sure whether cursormark respects distrib=false, although I can easily see there being “complications” here.

Hmmm, whenever I try to use distrib=false, I usually fire the query at the specific replica rather than use the shards parameter. IDK whether that’ll make any difference.

https://node:port/solr/collection1_shard1_replica1/query?distrib=false…….

You could also make it simpler...

q=id:{last_id_from_last_packet TO *]&rows=some_reasonable_number&sort=id asc&distrib=false

That doesn’t pay the penalty of having a huge start param.

Again, I’d use the specific replica rather than shards parameter, but just because I’ve never tried it with shards….

Best,
Erick

> On Feb 10, 2020, at 1:30 PM, Walter Underwood <wu...@wunderwood.org> wrote:
> 
> I tried to get fancy and dump our content with one thread per shard, but it did distributed search anyway. I specified the shard using the “shards” param and set distrib=false.
> 
> Is this a bug or expected behavior in 6.6.2? I did not see it mentioned in the docs.
> 
> It is working fine with a single thread and distributed search. Should have followed the old Kernighan and Plauger rule, “Make it right before youmake it faster."
> 
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>