You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@ignite.apache.org by "Sobolewski, Krzysztof" <Kr...@gs.com> on 2018/11/23 16:25:26 UTC

Continuous queries and duplicates

Hi,

I'm wanting to use a ContinuousQuery and there is a slight issue with how it transitions from the initial query to the notifications phase. It turns out that if there are additions to the cache happening while the continuous query runs, an entry may be reported twice - once by the initial query and once by the listener. This is confirmed by experiment, BTW :) The initial query in this case is an SqlQuery.

So my question is: is this intentional? Or is it a bug? Is there something I can do to mitigate this? Is this is an issue of isolation level?

Thanks a lot for any pointers :)
-KS

________________________________

Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>

Re: Continuous queries and duplicates

Posted by Piotr Romański <pi...@gmail.com>.
Ok, thanks.
Moved the discussion here:
http://apache-ignite-developers.2346864.n4.nabble.com/Continuous-queries-and-duplicates-td39444.html

wt., 11 gru 2018 o 14:05 Ilya Kasnacheev <il...@gmail.com>
napisał(a):

> Hello!
>
> You could write this message to developers list in a separate thread, see
> if there will be any discussion.
>
> Regards,
> --
> Ilya Kasnacheev
>
>
> ср, 5 дек. 2018 г. в 18:29, piotr.romanski <pi...@gmail.com>:
>
>> Hi all, I think that Krzysztof raised a valid concern. Actually, in my
>> opinion the manual deduplication in some use cases may lead to possible
>> memory problems on the client side. In order to remove duplicated
>> notifications which we are receiving in the local listener, we need to
>> keep
>> all initial query results in memory (or at least their unique ids).
>> Unfortunately, there is no way (is there?) to find a point in time when we
>> can be sure that no dups will arrive anymore. That would mean that we need
>> to keep that data indefinitely and use it every time a new notification
>> arrives. In case of multiple continuous queries run from a single JVM,
>> this
>> might eventually become a memory or performance problem. I can see the
>> following possible improvements to Ignite:
>>
>> 1. The deduplication between initial query and incoming notification could
>> be done fully in Ignite. As far as I know there is already the
>> updateCounter
>> and partition id for all the objects so it could be used internally.
>>
>> 2. Add a guarantee that notifications arriving in the local listener after
>> query() method returns are not duplicates. This kind of functionality
>> would
>> require a specific synchronization inside Ignite. It would also mean that
>> the query() method cannot return before all potential duplicates are
>> processed by a local listener what looks wrong.
>>
>> 3. Notify users that starting from a given notification they can be sure
>> they will not receive any duplicates anymore. This could be an additional
>> boolean flag in the CacheQueryEntryEvent.
>>
>> 4. CacheQueryEntryEvent already exposes the partitionUpdateCounter.
>> Unfortunately we don't have this information for initial query results. If
>> we had, a client could manually deduplicate notifications and get rid of
>> initial query results for a given partition after newer notifications
>> arrive. Also it would be very convenient to expose partition id as well
>> but
>> now we can figure it out using the affinity service. The assumption here
>> is
>> that notifications are ordered by partitionUpdateCounter (is it true?).
>>
>> Please correct me if I'm missing anything.
>>
>> What do you think?
>>
>>
>>
>> --
>> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>>
>

Re: Continuous queries and duplicates

Posted by Ilya Kasnacheev <il...@gmail.com>.
Hello!

You could write this message to developers list in a separate thread, see
if there will be any discussion.

Regards,
-- 
Ilya Kasnacheev


ср, 5 дек. 2018 г. в 18:29, piotr.romanski <pi...@gmail.com>:

> Hi all, I think that Krzysztof raised a valid concern. Actually, in my
> opinion the manual deduplication in some use cases may lead to possible
> memory problems on the client side. In order to remove duplicated
> notifications which we are receiving in the local listener, we need to keep
> all initial query results in memory (or at least their unique ids).
> Unfortunately, there is no way (is there?) to find a point in time when we
> can be sure that no dups will arrive anymore. That would mean that we need
> to keep that data indefinitely and use it every time a new notification
> arrives. In case of multiple continuous queries run from a single JVM, this
> might eventually become a memory or performance problem. I can see the
> following possible improvements to Ignite:
>
> 1. The deduplication between initial query and incoming notification could
> be done fully in Ignite. As far as I know there is already the
> updateCounter
> and partition id for all the objects so it could be used internally.
>
> 2. Add a guarantee that notifications arriving in the local listener after
> query() method returns are not duplicates. This kind of functionality would
> require a specific synchronization inside Ignite. It would also mean that
> the query() method cannot return before all potential duplicates are
> processed by a local listener what looks wrong.
>
> 3. Notify users that starting from a given notification they can be sure
> they will not receive any duplicates anymore. This could be an additional
> boolean flag in the CacheQueryEntryEvent.
>
> 4. CacheQueryEntryEvent already exposes the partitionUpdateCounter.
> Unfortunately we don't have this information for initial query results. If
> we had, a client could manually deduplicate notifications and get rid of
> initial query results for a given partition after newer notifications
> arrive. Also it would be very convenient to expose partition id as well but
> now we can figure it out using the affinity service. The assumption here is
> that notifications are ordered by partitionUpdateCounter (is it true?).
>
> Please correct me if I'm missing anything.
>
> What do you think?
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>

Re: Continuous queries and duplicates

Posted by "piotr.romanski" <pi...@gmail.com>.
Hi all, I think that Krzysztof raised a valid concern. Actually, in my
opinion the manual deduplication in some use cases may lead to possible
memory problems on the client side. In order to remove duplicated
notifications which we are receiving in the local listener, we need to keep
all initial query results in memory (or at least their unique ids).
Unfortunately, there is no way (is there?) to find a point in time when we
can be sure that no dups will arrive anymore. That would mean that we need
to keep that data indefinitely and use it every time a new notification
arrives. In case of multiple continuous queries run from a single JVM, this
might eventually become a memory or performance problem. I can see the
following possible improvements to Ignite:

1. The deduplication between initial query and incoming notification could
be done fully in Ignite. As far as I know there is already the updateCounter
and partition id for all the objects so it could be used internally.

2. Add a guarantee that notifications arriving in the local listener after
query() method returns are not duplicates. This kind of functionality would
require a specific synchronization inside Ignite. It would also mean that
the query() method cannot return before all potential duplicates are
processed by a local listener what looks wrong.

3. Notify users that starting from a given notification they can be sure
they will not receive any duplicates anymore. This could be an additional
boolean flag in the CacheQueryEntryEvent.

4. CacheQueryEntryEvent already exposes the partitionUpdateCounter.
Unfortunately we don't have this information for initial query results. If
we had, a client could manually deduplicate notifications and get rid of
initial query results for a given partition after newer notifications
arrive. Also it would be very convenient to expose partition id as well but
now we can figure it out using the affinity service. The assumption here is
that notifications are ordered by partitionUpdateCounter (is it true?).

Please correct me if I'm missing anything.

What do you think?



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Continuous queries and duplicates

Posted by Ilya Kasnacheev <il...@gmail.com>.
Hello!

I have discussed this issue, and it looks that you should expect duplicate
entries. You can see some entries in ScanQuery and also get them later in
ContinuousQuery callback.

So you have to be ready for this.

Regards,
-- 
Ilya Kasnacheev


пт, 30 нояб. 2018 г. в 19:54, Sobolewski, Krzysztof <
Krzysztof.Sobolewski@gs.com>:

> I will take a look, thanks!
>
> But, upon further investigation, it appears that there is no isolation
> whatsoever between the initial query and the listener in ContinuousQuery.
> The initial query can pick up entries added after it started and which had
> been already sent to the local listener. This way the entry is reported
> twice, and hence the duplicates. This is regardless of the type of the
> initial query (ScanQuery or SqlQuery). This reduces the usefulness of
> ContinuousQuery a lot, because we have to find out a way to rule out these
> duplicates, which is difficult and can incur a significant overhead.
>
> So my follow-up question is this: is it behaving as designed, or is there
> some mechanism to prevent these duplicates from happening?
> -KS
>
>
> From: Ilya Kasnacheev [mailto:ilya.kasnacheev@gmail.com]
> Sent: 30 listopada 2018 14:02
> To: user@ignite.apache.org
> Subject: Re: Continuous queries and duplicates
>
> Hello!
>
> There should be isolation in AI 2.7 as an experimental feature.
>
> Regards,
> --
> Ilya Kasnacheev
>
>
> пн, 26 нояб. 2018 г. в 12:53, Sobolewski, Krzysztof <
> Krzysztof.Sobolewski@gs.com>:
> Thanks. This is a little disappointing. ScanQuery would probably work, but
> it’s not as efficient (can’t use indexes etc.). Are there any plans to
> enable isolation on SqlQuery?
> -KS
>
> From: Ilya Kasnacheev [mailto:ilya.kasnacheev@gmail.com]
> Sent: 26 listopada 2018 08:01
> To: user@ignite.apache.org
> Subject: Re: Continuous queries and duplicates
>
> Hello!
>
> SQL queries have no isolation currently so it is not possible to avoid the
> problem that you described. You could try switching to ScanQuery, see if it
> helps; or learning to deal with duplicates.
>
> Regards,
> --
> Ilya Kasnacheev
>
>
> пт, 23 нояб. 2018 г. в 19:25, Sobolewski, Krzysztof <
> Krzysztof.Sobolewski@gs.com>:
> Hi,
>
> I'm wanting to use a ContinuousQuery and there is a slight issue with how
> it transitions from the initial query to the notifications phase. It turns
> out that if there are additions to the cache happening while the continuous
> query runs, an entry may be reported twice - once by the initial query and
> once by the listener. This is confirmed by experiment, BTW :) The initial
> query in this case is an SqlQuery.
>
> So my question is: is this intentional? Or is it a bug? Is there something
> I can do to mitigate this? Is this is an issue of isolation level?
>
> Thanks a lot for any pointers :)
> -KS
>
> ________________________________
>
> Your Personal Data: We may collect and process information about you that
> may be subject to data protection laws. For more information about how we
> use and disclose your personal data, how we protect your information, our
> legal basis to use your information, your rights and who you can contact,
> please refer to: www.gs.com/privacy-notices<
> http://www.gs.com/privacy-notices>
>
> ________________________________________
>
> Your Personal Data: We may collect and process information about you that
> may be subject to data protection laws. For more information about how we
> use and disclose your personal data, how we protect your information, our
> legal basis to use your information, your rights and who you can contact,
> please refer to: www.gs.com/privacy-notices
>
> ________________________________
>
> Your Personal Data: We may collect and process information about you that
> may be subject to data protection laws. For more information about how we
> use and disclose your personal data, how we protect your information, our
> legal basis to use your information, your rights and who you can contact,
> please refer to: www.gs.com/privacy-notices<
> http://www.gs.com/privacy-notices>
>

RE: Continuous queries and duplicates

Posted by "Sobolewski, Krzysztof" <Kr...@gs.com>.
I will take a look, thanks!

But, upon further investigation, it appears that there is no isolation whatsoever between the initial query and the listener in ContinuousQuery. The initial query can pick up entries added after it started and which had been already sent to the local listener. This way the entry is reported twice, and hence the duplicates. This is regardless of the type of the initial query (ScanQuery or SqlQuery). This reduces the usefulness of ContinuousQuery a lot, because we have to find out a way to rule out these duplicates, which is difficult and can incur a significant overhead.

So my follow-up question is this: is it behaving as designed, or is there some mechanism to prevent these duplicates from happening?
-KS


From: Ilya Kasnacheev [mailto:ilya.kasnacheev@gmail.com]
Sent: 30 listopada 2018 14:02
To: user@ignite.apache.org
Subject: Re: Continuous queries and duplicates

Hello!

There should be isolation in AI 2.7 as an experimental feature.

Regards,
--
Ilya Kasnacheev


пн, 26 нояб. 2018 г. в 12:53, Sobolewski, Krzysztof <Kr...@gs.com>:
Thanks. This is a little disappointing. ScanQuery would probably work, but it’s not as efficient (can’t use indexes etc.). Are there any plans to enable isolation on SqlQuery?
-KS

From: Ilya Kasnacheev [mailto:ilya.kasnacheev@gmail.com]
Sent: 26 listopada 2018 08:01
To: user@ignite.apache.org
Subject: Re: Continuous queries and duplicates

Hello!

SQL queries have no isolation currently so it is not possible to avoid the problem that you described. You could try switching to ScanQuery, see if it helps; or learning to deal with duplicates.

Regards,
--
Ilya Kasnacheev


пт, 23 нояб. 2018 г. в 19:25, Sobolewski, Krzysztof <Kr...@gs.com>:
Hi,

I'm wanting to use a ContinuousQuery and there is a slight issue with how it transitions from the initial query to the notifications phase. It turns out that if there are additions to the cache happening while the continuous query runs, an entry may be reported twice - once by the initial query and once by the listener. This is confirmed by experiment, BTW :) The initial query in this case is an SqlQuery.

So my question is: is this intentional? Or is it a bug? Is there something I can do to mitigate this? Is this is an issue of isolation level?

Thanks a lot for any pointers :)
-KS

________________________________

Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>

________________________________________

Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices

________________________________

Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>

Re: Continuous queries and duplicates

Posted by Ilya Kasnacheev <il...@gmail.com>.
Hello!

There should be isolation in AI 2.7 as an experimental feature.

Regards,
-- 
Ilya Kasnacheev


пн, 26 нояб. 2018 г. в 12:53, Sobolewski, Krzysztof <
Krzysztof.Sobolewski@gs.com>:

> Thanks. This is a little disappointing. ScanQuery would probably work, but
> it’s not as efficient (can’t use indexes etc.). Are there any plans to
> enable isolation on SqlQuery?
>
> -KS
>
>
>
> *From:* Ilya Kasnacheev [mailto:ilya.kasnacheev@gmail.com]
> *Sent:* 26 listopada 2018 08:01
> *To:* user@ignite.apache.org
> *Subject:* Re: Continuous queries and duplicates
>
>
>
> Hello!
>
>
>
> SQL queries have no isolation currently so it is not possible to avoid the
> problem that you described. You could try switching to ScanQuery, see if it
> helps; or learning to deal with duplicates.
>
>
>
> Regards,
>
> --
>
> Ilya Kasnacheev
>
>
>
>
>
> пт, 23 нояб. 2018 г. в 19:25, Sobolewski, Krzysztof <
> Krzysztof.Sobolewski@gs.com>:
>
> Hi,
>
> I'm wanting to use a ContinuousQuery and there is a slight issue with how
> it transitions from the initial query to the notifications phase. It turns
> out that if there are additions to the cache happening while the continuous
> query runs, an entry may be reported twice - once by the initial query and
> once by the listener. This is confirmed by experiment, BTW :) The initial
> query in this case is an SqlQuery.
>
> So my question is: is this intentional? Or is it a bug? Is there something
> I can do to mitigate this? Is this is an issue of isolation level?
>
> Thanks a lot for any pointers :)
> -KS
>
> ________________________________
>
> Your Personal Data: We may collect and process information about you that
> may be subject to data protection laws. For more information about how we
> use and disclose your personal data, how we protect your information, our
> legal basis to use your information, your rights and who you can contact,
> please refer to: www.gs.com/privacy-notices<
> http://www.gs.com/privacy-notices>
>
>
> ------------------------------
>
> Your Personal Data: We may collect and process information about you that
> may be subject to data protection laws. For more information about how we
> use and disclose your personal data, how we protect your information, our
> legal basis to use your information, your rights and who you can contact,
> please refer to: www.gs.com/privacy-notices
>

RE: Continuous queries and duplicates

Posted by "Sobolewski, Krzysztof" <Kr...@gs.com>.
Thanks. This is a little disappointing. ScanQuery would probably work, but it’s not as efficient (can’t use indexes etc.). Are there any plans to enable isolation on SqlQuery?
-KS

From: Ilya Kasnacheev [mailto:ilya.kasnacheev@gmail.com]
Sent: 26 listopada 2018 08:01
To: user@ignite.apache.org
Subject: Re: Continuous queries and duplicates

Hello!

SQL queries have no isolation currently so it is not possible to avoid the problem that you described. You could try switching to ScanQuery, see if it helps; or learning to deal with duplicates.

Regards,
--
Ilya Kasnacheev


пт, 23 нояб. 2018 г. в 19:25, Sobolewski, Krzysztof <Kr...@gs.com>>:
Hi,

I'm wanting to use a ContinuousQuery and there is a slight issue with how it transitions from the initial query to the notifications phase. It turns out that if there are additions to the cache happening while the continuous query runs, an entry may be reported twice - once by the initial query and once by the listener. This is confirmed by experiment, BTW :) The initial query in this case is an SqlQuery.

So my question is: is this intentional? Or is it a bug? Is there something I can do to mitigate this? Is this is an issue of isolation level?

Thanks a lot for any pointers :)
-KS

________________________________

Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices<http://www.gs.com/privacy-notices><http://www.gs.com/privacy-notices>

________________________________

Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>

Re: Continuous queries and duplicates

Posted by Ilya Kasnacheev <il...@gmail.com>.
Hello!

SQL queries have no isolation currently so it is not possible to avoid the
problem that you described. You could try switching to ScanQuery, see if it
helps; or learning to deal with duplicates.

Regards,
-- 
Ilya Kasnacheev


пт, 23 нояб. 2018 г. в 19:25, Sobolewski, Krzysztof <
Krzysztof.Sobolewski@gs.com>:

> Hi,
>
> I'm wanting to use a ContinuousQuery and there is a slight issue with how
> it transitions from the initial query to the notifications phase. It turns
> out that if there are additions to the cache happening while the continuous
> query runs, an entry may be reported twice - once by the initial query and
> once by the listener. This is confirmed by experiment, BTW :) The initial
> query in this case is an SqlQuery.
>
> So my question is: is this intentional? Or is it a bug? Is there something
> I can do to mitigate this? Is this is an issue of isolation level?
>
> Thanks a lot for any pointers :)
> -KS
>
> ________________________________
>
> Your Personal Data: We may collect and process information about you that
> may be subject to data protection laws. For more information about how we
> use and disclose your personal data, how we protect your information, our
> legal basis to use your information, your rights and who you can contact,
> please refer to: www.gs.com/privacy-notices<
> http://www.gs.com/privacy-notices>
>