You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Markus Jelsma <ma...@openindex.io> on 2019/06/12 21:59:29 UTC

CursorMark, batch size/speed

Hello,

One of our collections hates CursorMark, it really does. When under very heavy load the nodes can occasionally consume GBs additional heap for no clear reason immediately after downloading the entire corpus.

Although the additional heap consumption is a separate problem that i hope anyone can shed some light on, there is another strange behaviour i would like to see explained.

When under little load and with a batch size of just a few hundred, the download speed creeps at at most 150 doc/s. But when i increase batch size to absurd numbers such as 20k, the speed jumps to 2.5k docs/s. Changing total time from days to just a few hours.

We see the heap and the speed differences only really with one big collection of millions of small documents. They are just query, click and view logs with additional metadata fields such as time, digests, ranks, dates, uids, view time etc.

Is there someone here to shed some light on these vague subjects?

Many thanks,
Markus

Re: CursorMark, batch size/speed

Posted by Erick Erickson <er...@gmail.com>.

If there’s any chance of using Streaming for this rather than
re-querying the data using CursorMark, it would solve
a lot of these issues.

> On Jun 12, 2019, at 3:26 PM, Mikhail Khludnev <mk...@apache.org> wrote:
> 
> Every cursorMark request goes through full results. Previous results just
> bypass scoring heap. So, reducing number of such request should reasonably
> reduce wall-clock time exporting all results.
> 
> On Wed, Jun 12, 2019 at 11:59 PM Markus Jelsma <ma...@openindex.io>
> wrote:
> 
>> Hello,
>> 
>> One of our collections hates CursorMark, it really does. When under very
>> heavy load the nodes can occasionally consume GBs additional heap for no
>> clear reason immediately after downloading the entire corpus.
>> 
>> Although the additional heap consumption is a separate problem that i hope
>> anyone can shed some light on, there is another strange behaviour i would
>> like to see explained.
>> 
>> When under little load and with a batch size of just a few hundred, the
>> download speed creeps at at most 150 doc/s. But when i increase batch size
>> to absurd numbers such as 20k, the speed jumps to 2.5k docs/s. Changing
>> total time from days to just a few hours.
>> 
>> We see the heap and the speed differences only really with one big
>> collection of millions of small documents. They are just query, click and
>> view logs with additional metadata fields such as time, digests, ranks,
>> dates, uids, view time etc.
>> 
>> Is there someone here to shed some light on these vague subjects?
>> 
>> Many thanks,
>> Markus
>> 
> 
> 
> -- 
> Sincerely yours
> Mikhail Khludnev

Re: CursorMark, batch size/speed

Posted by Mikhail Khludnev <mk...@apache.org>.

Every cursorMark request goes through full results. Previous results just
bypass scoring heap. So, reducing number of such request should reasonably
reduce wall-clock time exporting all results.

On Wed, Jun 12, 2019 at 11:59 PM Markus Jelsma <ma...@openindex.io>
wrote:

> Hello,
>
> One of our collections hates CursorMark, it really does. When under very
> heavy load the nodes can occasionally consume GBs additional heap for no
> clear reason immediately after downloading the entire corpus.
>
> Although the additional heap consumption is a separate problem that i hope
> anyone can shed some light on, there is another strange behaviour i would
> like to see explained.
>
> When under little load and with a batch size of just a few hundred, the
> download speed creeps at at most 150 doc/s. But when i increase batch size
> to absurd numbers such as 20k, the speed jumps to 2.5k docs/s. Changing
> total time from days to just a few hours.
>
> We see the heap and the speed differences only really with one big
> collection of millions of small documents. They are just query, click and
> view logs with additional metadata fields such as time, digests, ranks,
> dates, uids, view time etc.
>
> Is there someone here to shed some light on these vague subjects?
>
> Many thanks,
> Markus
>


-- 
Sincerely yours
Mikhail Khludnev