You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@accumulo.apache.org by z11373 <z1...@outlook.com> on 2015/10/07 21:55:22 UTC

Re: another question on summing combiner

Revisit this topic, if I go with option #2, i.e. having a batch job to fix
the stats table, now I am not really sure if it will work, since the stats
table already have summing combiner enabled, hence the batch job can't just
update the value since it'll be incorrect.
For example:

Current stats table contains:
foo     | 2
bar     | 3
test    | 1

The batch job scan the main table, and going to update the stats table, let
say the actual stats is foo=1, bar=4, test=1, hence the final stats table
would become:
foo     | 3
bar     | 7
test    | 2

It'd be correct if it removes the summing combiner from the table, but then
another process (not the batch job) may update particular key, overwriting
the correct value (updated from batch job). We can't tolerate the system is
offline, otherwise we can refresh the stats during that downtime. Any idea
on how to solve this problem?

Unfortunately there is an inherent problem with summing combiner, i.e. when
adding same key to main table, it'll behave just like 'update' when the same
key already exist, but my current logic will add <key>|1 to the stats table,
so if we have many 'update', then some values in stats table will be far
off. Similar case for deleting, it will be no-op for main table if the key
doesn't exist, but the app logic will add <key>|-1 to the stats table. This
is the reason why we're thinking to have a batch job to 'fix' the stats
table, but that also has its own problem :-(


Thanks,
Z






--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/another-question-on-summing-combiner-tp15238p15351.html
Sent from the Developers mailing list archive at Nabble.com.

Re: another question on summing combiner

Posted by Dylan Hutchison <dh...@uw.edu>.
Hi Z,

The batch method you and Josh worked out at first will work.  I was
illustrating another method which uses conditional writes/deletes as an
alternative.  It's hard to say which performs better without knowing your
workload specifics.

Applying the conditional write method to your scenario, the client that
"retries" the delete-then-add operation would not write the "-1" to the
stats tabe with the delete operation because the keys are already deleted
in the main table.  This is due to the conditional mutation being rejected,
and the client never erroneously making the "retried" -1 write.  The client
would resume with the insert phase, writing back to the main table and a
"+1" to the stats table as intended.

Cheers, Dylan

On Fri, Oct 23, 2015 at 1:13 PM, z11373 <z1...@outlook.com> wrote:

> Hi Dylan,
> Right now we don't perform check (read) before performing an update. Below
> is a simple scenario.
>
> Main table is initially empty, then client sends request which translates
> to
> inserting the data, i.e.
> Main table:
> A
> B
> C
> D
>
> Stats table:
> A 1
> B 1
> C 1
> D 1
>
> Let say its next request is to delete C.
> Main table:
> A
> B
> D
>
> Stats table:
> A 1
> B 1
> C 0 (1 + -1)
> D 1
>
> Next request is to update B and D (the request got translated to delete B
> and D, and insert B and D), but let say it somehow failed in between the
> delete and insert operations, so the tables would look like:
> Main table:
> A
>
> Stats table:
> A 1
> B 0
> C 0
> D 0
>
> Client is fault-tolerant, and retry the entire request, so now the tables
> would look like:
> Main table:
> A
> B
> D
>
> Stats table:
> A 1
> B 0 (-1 + 1)
> C 0
> D 0 (-1 + 1)
>
>
> As you see above, the end state for Main table is correct, because the
> retry
> will do the 'update', but unfortunately not for the Stats table.
> The idea I mentioned last time was to have a batch job that scans the whole
> Main table to get the 'truth' data, and update Stats table accordingly, but
> in order to update 'accordingly', it first has to read the current value in
> Stats table (due to combiner), which affects performance.
>
>
> Thanks,
> Z
>
>
>
>
>
> --
> View this message in context:
> http://apache-accumulo.1065345.n5.nabble.com/another-question-on-summing-combiner-tp15238p15412.html
> Sent from the Developers mailing list archive at Nabble.com.
>

Re: another question on summing combiner

Posted by z11373 <z1...@outlook.com>.
Hi Dylan,
Right now we don't perform check (read) before performing an update. Below
is a simple scenario.

Main table is initially empty, then client sends request which translates to
inserting the data, i.e.
Main table:
A
B
C
D

Stats table:
A 1
B 1
C 1
D 1

Let say its next request is to delete C.
Main table:
A
B
D

Stats table:
A 1
B 1
C 0 (1 + -1)
D 1

Next request is to update B and D (the request got translated to delete B
and D, and insert B and D), but let say it somehow failed in between the
delete and insert operations, so the tables would look like:
Main table:
A

Stats table:
A 1
B 0
C 0
D 0

Client is fault-tolerant, and retry the entire request, so now the tables
would look like:
Main table:
A
B
D

Stats table:
A 1
B 0 (-1 + 1)
C 0
D 0 (-1 + 1)


As you see above, the end state for Main table is correct, because the retry
will do the 'update', but unfortunately not for the Stats table.
The idea I mentioned last time was to have a batch job that scans the whole
Main table to get the 'truth' data, and update Stats table accordingly, but
in order to update 'accordingly', it first has to read the current value in
Stats table (due to combiner), which affects performance.


Thanks,
Z





--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/another-question-on-summing-combiner-tp15238p15412.html
Sent from the Developers mailing list archive at Nabble.com.

Re: another question on summing combiner

Posted by Dylan Hutchison <dh...@uw.edu>.
Hi Z,

It seems you have a fairly common use case: performing an update if and
only if a certain row does or does not exist.  Here's another option you
could try, and if it works (or doesn't work), please let us know!  Of
course, if you're comfortable with the batch solution, that is fine too.

*Adding an item x*
Conditionally write x to your main table, asserting that x does not exist.
+ If x is written (indicating that x did not previously exist in the main
table), then write a 1 to the stats table unconditionally.  ** Use of bloom
filters on the main table can speed this path up.  It sounds like this is
the more common path for you.
+ If x fails to write (indicating that x already existed in the main
table), then do not write to the stats table.

*Deleting an item x*
Conditionally delete x from your main table, asserting that x does exist.
+ If x is deleted, then write a -1 to your stats table unconditionally.
+ If x never existed, then do not write to your stats table. ** bloom
filters may speed this path up

Regards, Dylan


On Tue, Oct 20, 2015 at 7:33 AM, z11373 <z1...@outlook.com> wrote:

> Thanks Josh! I decided to leave the stats using normal combiner for now,
> the
> stats skew may not be that bad if it does happen.
> In the future, I am thinking to have a batch job that will update the stats
> correctly, it will be time intensive, but it should be ok since it'll
> likely
> run only once a day.
> Back to previous example below.
>
> Current stats table contains:
> foo     | 2
> bar     | 3
> test    | 1
>
> The batch job scan the main table, and going to update the stats table, let
> say the actual stats is foo=1, bar=4, test=1, it will first reads the
> values
> of existing stats above, and then 'calculate' the final result correctly,
> so
> it will just update stats table as:
> foo     | -1
> bar     | 1
>
> After this operation, the values in the stats table will end up correctly
> :-)
> foo     | 1
> bar     | 4
> test    | 1
>
>
>
>
>
> --
> View this message in context:
> http://apache-accumulo.1065345.n5.nabble.com/another-question-on-summing-combiner-tp15238p15398.html
> Sent from the Developers mailing list archive at Nabble.com.
>

Re: another question on summing combiner

Posted by z11373 <z1...@outlook.com>.
Thanks Josh! I decided to leave the stats using normal combiner for now, the
stats skew may not be that bad if it does happen.
In the future, I am thinking to have a batch job that will update the stats
correctly, it will be time intensive, but it should be ok since it'll likely
run only once a day.
Back to previous example below.

Current stats table contains: 
foo     | 2 
bar     | 3 
test    | 1 
 
The batch job scan the main table, and going to update the stats table, let
say the actual stats is foo=1, bar=4, test=1, it will first reads the values
of existing stats above, and then 'calculate' the final result correctly, so
it will just update stats table as: 
foo     | -1 
bar     | 1

After this operation, the values in the stats table will end up correctly
:-)
foo     | 1 
bar     | 4 
test    | 1





--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/another-question-on-summing-combiner-tp15238p15398.html
Sent from the Developers mailing list archive at Nabble.com.

Re: another question on summing combiner

Posted by Josh Elser <jo...@gmail.com>.
If you were doing a batch job to just recompute the stats, I'd probably 
make a new table and then rename it, replacing your old stats table. 
This can also be problematic in making sure clients that are still 
writing data will correctly write to the new table. Can you quiesce 
ingest temporarily?

In short, this is hard to do correctly (and there are edge cases that 
could potentially happen that make the table inaccurate at a very low 
probability). Have you considered just running the system for a while 
and seeing how skewed your stats are?

It kind of sounds like the easier problem to solve is whether or not 
some record exists in your system and then you can know definitively 
whether or not you need to even process that record again (much less 
update the stats table).

z11373 wrote:
> Revisit this topic, if I go with option #2, i.e. having a batch job to fix
> the stats table, now I am not really sure if it will work, since the stats
> table already have summing combiner enabled, hence the batch job can't just
> update the value since it'll be incorrect.
> For example:
>
> Current stats table contains:
> foo     | 2
> bar     | 3
> test    | 1
>
> The batch job scan the main table, and going to update the stats table, let
> say the actual stats is foo=1, bar=4, test=1, hence the final stats table
> would become:
> foo     | 3
> bar     | 7
> test    | 2
>
> It'd be correct if it removes the summing combiner from the table, but then
> another process (not the batch job) may update particular key, overwriting
> the correct value (updated from batch job). We can't tolerate the system is
> offline, otherwise we can refresh the stats during that downtime. Any idea
> on how to solve this problem?
>
> Unfortunately there is an inherent problem with summing combiner, i.e. when
> adding same key to main table, it'll behave just like 'update' when the same
> key already exist, but my current logic will add<key>|1 to the stats table,
> so if we have many 'update', then some values in stats table will be far
> off. Similar case for deleting, it will be no-op for main table if the key
> doesn't exist, but the app logic will add<key>|-1 to the stats table. This
> is the reason why we're thinking to have a batch job to 'fix' the stats
> table, but that also has its own problem :-(
>
>
> Thanks,
> Z
>
>
>
>
>
>
> --
> View this message in context: http://apache-accumulo.1065345.n5.nabble.com/another-question-on-summing-combiner-tp15238p15351.html
> Sent from the Developers mailing list archive at Nabble.com.

Re: another question on summing combiner

Posted by z11373 <z1...@outlook.com>.
Anyone? I can elaborate more if my question was not clear.

Thanks,
Z



--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/another-question-on-summing-combiner-tp15238p15358.html
Sent from the Developers mailing list archive at Nabble.com.