You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by msuh <ms...@jobcase.com> on 2019/01/24 19:03:35 UTC

QueryCursor checkpoint

Hello,

Our end production cluster would be working with many billions of entities
in many caches, and have use cases where we need to run ScanQuery over an
entire cache to update certain fields. 

We expect that there could definitely be failures in the middle of a single
ScanQuery due to the sheer size of the caches. Since we wouldn't want to
rerun ScanQuery from the start, we're wondering if we could keep some
checkpoint of up to which point we've processed in the QueryCursor. The
QueryCursor API doesn't seem to show any methods that allow that, but I may
not be looking at the right place? Would there be any other efficient ways
to keep track of vaguely up to which point we've processed? If QueryCursor
doesn't provide anything externally, would partition number be the best
option?

But from what I've seen, it seemed like entities in partitions shift around
(from rebalancing or something?), so not sure if that's even possible.



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: QueryCursor checkpoint

Posted by Ilya Kasnacheev <il...@gmail.com>.

Hello!

You can indeed scan on per partition basis. Data will not move between
partitions unless key is changed or affinity function is modified. Note
that partition is normally stored on a single node but it might be moved in
case of rebalance.

You can probably read more here:
https://cwiki.apache.org/confluence/display/IGNITE/%28Partition+Map%29+Exchange+-+under+the+hood

Regards,
-- 
Ilya Kasnacheev


чт, 24 янв. 2019 г. в 22:03, msuh <ms...@jobcase.com>:

> Hello,
>
> Our end production cluster would be working with many billions of entities
> in many caches, and have use cases where we need to run ScanQuery over an
> entire cache to update certain fields.
>
> We expect that there could definitely be failures in the middle of a single
> ScanQuery due to the sheer size of the caches. Since we wouldn't want to
> rerun ScanQuery from the start, we're wondering if we could keep some
> checkpoint of up to which point we've processed in the QueryCursor. The
> QueryCursor API doesn't seem to show any methods that allow that, but I may
> not be looking at the right place? Would there be any other efficient ways
> to keep track of vaguely up to which point we've processed? If QueryCursor
> doesn't provide anything externally, would partition number be the best
> option?
>
> But from what I've seen, it seemed like entities in partitions shift around
> (from rebalancing or something?), so not sure if that's even possible.
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>

Re: QueryCursor checkpoint

Posted by Karun Chand <bu...@gmail.com>.

Hi,

You are right about the available options in terms of lack of flexibility
of approaching this problem.

How have you partitioned your data? You can probably have a local variable
specific to each node or compute thread (or maybe use an Atomic Type -
https://apacheignite.readme.io/docs/atomic-types but may reduce
performance) that keeps track of the keys that it has already processed.
Within your scan query filter, do a contains operation on this set of
already processed keys. The next time your scan query throws an exception,
you would already have an idea about which keys have been processed.

HTH,
RH
https://www.apacheignitetutorial.com/

On Thu, Jan 24, 2019 at 11:03 AM msuh <ms...@jobcase.com> wrote:

> Hello,
>
> Our end production cluster would be working with many billions of entities
> in many caches, and have use cases where we need to run ScanQuery over an
> entire cache to update certain fields.
>
> We expect that there could definitely be failures in the middle of a single
> ScanQuery due to the sheer size of the caches. Since we wouldn't want to
> rerun ScanQuery from the start, we're wondering if we could keep some
> checkpoint of up to which point we've processed in the QueryCursor. The
> QueryCursor API doesn't seem to show any methods that allow that, but I may
> not be looking at the right place? Would there be any other efficient ways
> to keep track of vaguely up to which point we've processed? If QueryCursor
> doesn't provide anything externally, would partition number be the best
> option?
>
> But from what I've seen, it seemed like entities in partitions shift around
> (from rebalancing or something?), so not sure if that's even possible.
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>