You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cassandra.apache.org by Benjamin Roth <be...@jaumo.com> on 2016/12/02 10:27:51 UTC

CASSANDRA-12888: Streaming and MVs

As I haven't received a single reply on that, I went over to implement and
test it on my own with our production cluster. I had a real pain with
bringing up a new node, so I had to move on.

Result:
Works like a charm. I ran many dtests that relate in any way with storage,
stream, bootstrap, ... with good results.
The bootstrap finished in under 5:30h, not a single error log during
bootstrap. Also afterwards, repairs run smooth, cluster seems to operate
quite well.

I still need:

   - Reviews (see 12888, 12905, 12984)
   - Some opinion if I did the CDC case right. IMHO CDC is not required on
   bootstrap and we don't need to send the mutations through the write path
   just to write the commit log. This will also break incremental repairs.
   Instead for CDC the sstables are streamed like normal but mutations are
   written to commitlog additionally. The worst I see is that the node crashes
   and the commitlogs for that repair streams are replayed leading to
   duplicate writes, which is not really crucial and not a regular case. Any
   better ideas?
   - Docs have to be updated (12985) if patch is accepted

I really appreciate ANY feedback. IMHO the impact of that fixes is immense
and maybe will be a huge step to get MVs production ready.

Thank you very much,
Benjamin


---------- Forwarded message ----------
From: Benjamin Roth <be...@jaumo.com>
Date: 2016-11-29 17:04 GMT+01:00
Subject: Streaming and MVs
To: dev@cassandra.apache.org


I don't know where else to discuss this issue, so I post it here.

I am trying to get CS to run stable with MVs since the beginning of july.
Normal reads + write do work as expected but when it comes to repairs or
bootstrapping it still feels far far away from what I would call fast and
stable. The other day I just wanted to bootstrap a new node. I tried it 2
times.
First time the bootstrap failed due to WTEs. I fixed this issue by not
timing out in streams but then it turned out that the bootstrap (load
roughly 250-300 GB) didn't even finish in 24h. What if I really had a
problem and had to get up some nodes fast? No way!

I think the root cause of it all is the way streams are handled on tables
with MVs.
Sending them to the regular write path implies so many bottlenecks and
sometimes also redundant writes. Let me explain:

1. Bootstrap
During a bootstrap, all ranges from all KS and all CFs that will belong to
the new node will be streamed. MVs are treated like all other CFs and all
ranges that will move to the new node will also be streamed during
bootstrap.
Sending streams of the base tables through the write path will have the
following negative impacts:

   - Writes are sent to the commit log. Not necessary. When node is stopped
   during bootstrap, bootstrap will simply start over. No need to recover from
   commit logs. Non-MV tables won't have a CL anyway
   - MV mutations will not be applied instantly but send to the batch log.
   This is of course necessary during the range movement (if PK of MV differs
   from base table) but what happens: The batchlog will be completely flooded.
   This leads to ridiculously large batchlogs (I observerd BLs with 60GB
   size), zillions of compactions and quadrillions of tombstones. This is a
   pure resource killer, especially because BL uses a CF as a queue.
   - Applying every mutation separately causes read-before-writes during MV
   mutation. This is of course an order of magnitude slower than simply
   streaming down an SSTable. This effect becomes even worse while bootstrap
   progresses and creates more and more (uncompacted) SSTables. Many of them
   wont ever be compacted because the batchlog eats all the resources
   available for compaction
   - Streaming down the MV tables AND applying the mutations of the
   basetables leads to redundant writes. Redundant writes are local if PK of
   the MV == PK of the base table and - even worse - remote if not. Remote MV
   updates will impact nodes that aren't even part of the bootstrap.
   - CDC should also not be necessary during bootstrap, should it? TBD

2. Repair
Negative impact is similar to bootstrap but, ...

   - Sending repairs through write path will not mark the streamed tables
   as repaired. See CASSANDRA-12888. Doing NOT so will instantly solve that
   issue. Much simpler with any other solution
   - It will change the "repair design" a bit. Repairing a base table will
   not automatically repair the MV. But is this bad at all? To be honest as a
   newbie it was very hard for me to understand what I had to do to be sure
   that everything is repaired correctly. Recently I was told NOT to repair MV
   CFs but only to repair the base tables. This means one cannot just call
   "nodetool repair $keyspace" - this is complicated, not transparent and it
   sucks. I changed the behaviour in my own branch and let run the dtests for
   MVs. 2 tests failed:
      - base_replica_repair_test of course failes due to the design change
      - really_complex_repair_test fails because it intentionally times out
      the batch log. IMHO this is a bearable situation. It is comparable to
      resurrected tombstones when running a repair after GCGS expired. You also
      would not expect this to be magically fixed. gcgs default is 10
days and I
      can expect that anybody also repairs its MVs during that period, not only
      the base table

3. Rebuild
Same like bootstrap, isn't it?

Did I forget any cases?
What do you think?

-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 <07161%203048806> · Fax +49 7161 304880-1
<07161%203048801>
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer



-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: CASSANDRA-12888: Streaming and MVs

Posted by Benjamin Roth <be...@jaumo.com>.
Grmpf! 1000+ consecutive must be wrong. I guess I mixed sth up. But it
repaired over and over again for 1 or 2 days.

2016-12-07 9:01 GMT+01:00 Benjamin Roth <be...@jaumo.com>:

> Hi Paolo,
>
> First of all thanks for your review!
>
> I had the same concerns as you but I thought it is beeing handled
> correctly (which does in some situations) but I found one that creates the
> inconsistencies you mentioned. That is kind of split brain syndrom, when
> multiple nodes fail between repairs. See here: https://cl.ly/3t0X1c0q1L1h.
>
> I am not happy about it but I support your decision. We should then add
> another dtest to test this scenario as existing dtests don't.
>
> Some issues unfortunately remain:
> - 12888 is not resolved
> - MV repairs may be still f**** slow. Imagine an inconsistency of a single
> cell (also may be due to a validation race condition, see CASSANDRA-12991)
> on a big partition. I had issues with reaper and a 30min timeout leading to
> 1000+ (yes!) consecutive repairs of a single subrange because it always
> timed out and I recognized very late. When I deployed 12888 on my system,
> this remaining subrange was repaired in a snap
> - I guess rebuild works the same as repair and has to go through the write
> path, right?
>
> => The MV repair may induce so much overhead that it is maybe cheaper to
> kill and replace a inconsistent node than to repair it. But that may
> introduce inconsistencies again. All in all it is not perfect. All this
> does not really un-frustrate me a 100%.
>
> Do you have any more thoughts?
>
> Unfortunately I have very little time these days as my second child was
> born on monday. So thanks for your support so far. Maybe I have some ideas
> on this issues during the next days and I will work on that ticket probably
> next week to come to a solution that is at least deployable. I'd also
> appreciate your opinion on CASSANDRA-12991.
>
> 2016-12-07 2:53 GMT+01:00 Paulo Motta <pa...@gmail.com>:
>
>> Hello Benjamin,
>>
>> Thanks for your effort on this investigation! For bootstraps and range
>> transfers, I think we can indeed simplify and stream base tables and MVs
>> as
>> ordinary tables, unless there is some caveat I'm missing (I didn't find
>> any
>> special case for bootstrap/range transfers on CASSANDRA-6477 or in the MV
>> design doc, please correct me if I'm wrong).
>>
>> Regarding repair of base tables, applying mutations via the write path is
>> a
>> matter of correctness, given that the base table updates needs to
>> potentially remove previously referenced keys in the views, so repairing
>> only the base table may leave unreferenced keys in the views, breaking the
>> MV contract. Furthermore, these unreferenced keys may be propagated to
>> other replicas and never removed if you repair only the view. If you don't
>> do overwrites in the base table, this is probably not a problem but the DB
>> cannot ensure this (at least not before CASSANDRA-9779). Furthermore, as
>> you already noticed repairing only the base table is probably faster so I
>> don't see a reason to repair the base and MVs separately since this is
>> potentially more costly. I believe your frustration is mostly due to the
>> bug described on CASSANDRA-12905, but after that and CASSANDRA-12888 are
>> fixed repair on base table should work just fine.
>>
>> Based on this, I propose:
>> - Fix CASSANDRA-12905 with your original patch that retries acquiring the
>> MV lock instead of throwing WriteTimeoutException during streaming, since
>> this is blocking 3.10.
>> - Fix CASSANDRA-12888 by doing sstable-based streaming for base tables
>> while still applying MV updates in the paired replicas.
>> - Create new ticket to use ordinary streaming for non-repair MV stream
>> sessions and keep current behavior for MV streaming originating from
>> repair.
>> - Create new ticket to include only the base tables and not MVs in
>> keyspace-level repair, since repairing the base already repairs the views
>> to avoid people shooting themselves in the foot.
>>
>> Please let me know what do you think. Any suggestions or feedback is
>> appreciated.
>>
>> Cheers,
>>
>> Paulo
>>
>> 2016-12-02 8:27 GMT-02:00 Benjamin Roth <be...@jaumo.com>:
>>
>> > As I haven't received a single reply on that, I went over to implement
>> and
>> > test it on my own with our production cluster. I had a real pain with
>> > bringing up a new node, so I had to move on.
>> >
>> > Result:
>> > Works like a charm. I ran many dtests that relate in any way with
>> storage,
>> > stream, bootstrap, ... with good results.
>> > The bootstrap finished in under 5:30h, not a single error log during
>> > bootstrap. Also afterwards, repairs run smooth, cluster seems to operate
>> > quite well.
>> >
>> > I still need:
>> >
>> >    - Reviews (see 12888, 12905, 12984)
>> >    - Some opinion if I did the CDC case right. IMHO CDC is not required
>> on
>> >    bootstrap and we don't need to send the mutations through the write
>> path
>> >    just to write the commit log. This will also break incremental
>> repairs.
>> >    Instead for CDC the sstables are streamed like normal but mutations
>> are
>> >    written to commitlog additionally. The worst I see is that the node
>> > crashes
>> >    and the commitlogs for that repair streams are replayed leading to
>> >    duplicate writes, which is not really crucial and not a regular case.
>> > Any
>> >    better ideas?
>> >    - Docs have to be updated (12985) if patch is accepted
>> >
>> > I really appreciate ANY feedback. IMHO the impact of that fixes is
>> immense
>> > and maybe will be a huge step to get MVs production ready.
>> >
>> > Thank you very much,
>> > Benjamin
>> >
>> >
>> > ---------- Forwarded message ----------
>> > From: Benjamin Roth <be...@jaumo.com>
>> > Date: 2016-11-29 17:04 GMT+01:00
>> > Subject: Streaming and MVs
>> > To: dev@cassandra.apache.org
>> >
>> >
>> > I don't know where else to discuss this issue, so I post it here.
>> >
>> > I am trying to get CS to run stable with MVs since the beginning of
>> july.
>> > Normal reads + write do work as expected but when it comes to repairs or
>> > bootstrapping it still feels far far away from what I would call fast
>> and
>> > stable. The other day I just wanted to bootstrap a new node. I tried it
>> 2
>> > times.
>> > First time the bootstrap failed due to WTEs. I fixed this issue by not
>> > timing out in streams but then it turned out that the bootstrap (load
>> > roughly 250-300 GB) didn't even finish in 24h. What if I really had a
>> > problem and had to get up some nodes fast? No way!
>> >
>> > I think the root cause of it all is the way streams are handled on
>> tables
>> > with MVs.
>> > Sending them to the regular write path implies so many bottlenecks and
>> > sometimes also redundant writes. Let me explain:
>> >
>> > 1. Bootstrap
>> > During a bootstrap, all ranges from all KS and all CFs that will belong
>> to
>> > the new node will be streamed. MVs are treated like all other CFs and
>> all
>> > ranges that will move to the new node will also be streamed during
>> > bootstrap.
>> > Sending streams of the base tables through the write path will have the
>> > following negative impacts:
>> >
>> >    - Writes are sent to the commit log. Not necessary. When node is
>> stopped
>> >    during bootstrap, bootstrap will simply start over. No need to
>> recover
>> > from
>> >    commit logs. Non-MV tables won't have a CL anyway
>> >    - MV mutations will not be applied instantly but send to the batch
>> log.
>> >    This is of course necessary during the range movement (if PK of MV
>> > differs
>> >    from base table) but what happens: The batchlog will be completely
>> > flooded.
>> >    This leads to ridiculously large batchlogs (I observerd BLs with 60GB
>> >    size), zillions of compactions and quadrillions of tombstones. This
>> is a
>> >    pure resource killer, especially because BL uses a CF as a queue.
>> >    - Applying every mutation separately causes read-before-writes
>> during MV
>> >    mutation. This is of course an order of magnitude slower than simply
>> >    streaming down an SSTable. This effect becomes even worse while
>> > bootstrap
>> >    progresses and creates more and more (uncompacted) SSTables. Many of
>> > them
>> >    wont ever be compacted because the batchlog eats all the resources
>> >    available for compaction
>> >    - Streaming down the MV tables AND applying the mutations of the
>> >    basetables leads to redundant writes. Redundant writes are local if
>> PK
>> > of
>> >    the MV == PK of the base table and - even worse - remote if not.
>> Remote
>> > MV
>> >    updates will impact nodes that aren't even part of the bootstrap.
>> >    - CDC should also not be necessary during bootstrap, should it? TBD
>> >
>> > 2. Repair
>> > Negative impact is similar to bootstrap but, ...
>> >
>> >    - Sending repairs through write path will not mark the streamed
>> tables
>> >    as repaired. See CASSANDRA-12888. Doing NOT so will instantly solve
>> that
>> >    issue. Much simpler with any other solution
>> >    - It will change the "repair design" a bit. Repairing a base table
>> will
>> >    not automatically repair the MV. But is this bad at all? To be honest
>> > as a
>> >    newbie it was very hard for me to understand what I had to do to be
>> sure
>> >    that everything is repaired correctly. Recently I was told NOT to
>> > repair MV
>> >    CFs but only to repair the base tables. This means one cannot just
>> call
>> >    "nodetool repair $keyspace" - this is complicated, not transparent
>> and
>> > it
>> >    sucks. I changed the behaviour in my own branch and let run the
>> dtests
>> > for
>> >    MVs. 2 tests failed:
>> >       - base_replica_repair_test of course failes due to the design
>> change
>> >       - really_complex_repair_test fails because it intentionally times
>> out
>> >       the batch log. IMHO this is a bearable situation. It is
>> comparable to
>> >       resurrected tombstones when running a repair after GCGS expired.
>> You
>> > also
>> >       would not expect this to be magically fixed. gcgs default is 10
>> > days and I
>> >       can expect that anybody also repairs its MVs during that period,
>> not
>> > only
>> >       the base table
>> >
>> > 3. Rebuild
>> > Same like bootstrap, isn't it?
>> >
>> > Did I forget any cases?
>> > What do you think?
>> >
>> > --
>> > Benjamin Roth
>> > Prokurist
>> >
>> > Jaumo GmbH · www.jaumo.com
>> > Wehrstraße 46 · 73035 Göppingen · Germany
>> > Phone +49 7161 304880-6 <07161%203048806> · Fax +49 7161 304880-1
>> > <07161%203048801>
>> > AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>> >
>> >
>> >
>> > --
>> > Benjamin Roth
>> > Prokurist
>> >
>> > Jaumo GmbH · www.jaumo.com
>> > Wehrstraße 46 · 73035 Göppingen · Germany
>> > Phone +49 7161 304880-6 · Fax +49 7161 304880-1
>> > AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>> >
>>
>
>
>
> --
> Benjamin Roth
> Prokurist
>
> Jaumo GmbH · www.jaumo.com
> Wehrstraße 46 · 73035 Göppingen · Germany
> Phone +49 7161 304880-6 <07161%203048806> · Fax +49 7161 304880-1
> <07161%203048801>
> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>



-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: CASSANDRA-12888: Streaming and MVs

Posted by Benjamin Roth <be...@jaumo.com>.
Hi Paolo,

First of all thanks for your review!

I had the same concerns as you but I thought it is beeing handled correctly
(which does in some situations) but I found one that creates the
inconsistencies you mentioned. That is kind of split brain syndrom, when
multiple nodes fail between repairs. See here: https://cl.ly/3t0X1c0q1L1h.

I am not happy about it but I support your decision. We should then add
another dtest to test this scenario as existing dtests don't.

Some issues unfortunately remain:
- 12888 is not resolved
- MV repairs may be still f**** slow. Imagine an inconsistency of a single
cell (also may be due to a validation race condition, see CASSANDRA-12991)
on a big partition. I had issues with reaper and a 30min timeout leading to
1000+ (yes!) consecutive repairs of a single subrange because it always
timed out and I recognized very late. When I deployed 12888 on my system,
this remaining subrange was repaired in a snap
- I guess rebuild works the same as repair and has to go through the write
path, right?

=> The MV repair may induce so much overhead that it is maybe cheaper to
kill and replace a inconsistent node than to repair it. But that may
introduce inconsistencies again. All in all it is not perfect. All this
does not really un-frustrate me a 100%.

Do you have any more thoughts?

Unfortunately I have very little time these days as my second child was
born on monday. So thanks for your support so far. Maybe I have some ideas
on this issues during the next days and I will work on that ticket probably
next week to come to a solution that is at least deployable. I'd also
appreciate your opinion on CASSANDRA-12991.

2016-12-07 2:53 GMT+01:00 Paulo Motta <pa...@gmail.com>:

> Hello Benjamin,
>
> Thanks for your effort on this investigation! For bootstraps and range
> transfers, I think we can indeed simplify and stream base tables and MVs as
> ordinary tables, unless there is some caveat I'm missing (I didn't find any
> special case for bootstrap/range transfers on CASSANDRA-6477 or in the MV
> design doc, please correct me if I'm wrong).
>
> Regarding repair of base tables, applying mutations via the write path is a
> matter of correctness, given that the base table updates needs to
> potentially remove previously referenced keys in the views, so repairing
> only the base table may leave unreferenced keys in the views, breaking the
> MV contract. Furthermore, these unreferenced keys may be propagated to
> other replicas and never removed if you repair only the view. If you don't
> do overwrites in the base table, this is probably not a problem but the DB
> cannot ensure this (at least not before CASSANDRA-9779). Furthermore, as
> you already noticed repairing only the base table is probably faster so I
> don't see a reason to repair the base and MVs separately since this is
> potentially more costly. I believe your frustration is mostly due to the
> bug described on CASSANDRA-12905, but after that and CASSANDRA-12888 are
> fixed repair on base table should work just fine.
>
> Based on this, I propose:
> - Fix CASSANDRA-12905 with your original patch that retries acquiring the
> MV lock instead of throwing WriteTimeoutException during streaming, since
> this is blocking 3.10.
> - Fix CASSANDRA-12888 by doing sstable-based streaming for base tables
> while still applying MV updates in the paired replicas.
> - Create new ticket to use ordinary streaming for non-repair MV stream
> sessions and keep current behavior for MV streaming originating from
> repair.
> - Create new ticket to include only the base tables and not MVs in
> keyspace-level repair, since repairing the base already repairs the views
> to avoid people shooting themselves in the foot.
>
> Please let me know what do you think. Any suggestions or feedback is
> appreciated.
>
> Cheers,
>
> Paulo
>
> 2016-12-02 8:27 GMT-02:00 Benjamin Roth <be...@jaumo.com>:
>
> > As I haven't received a single reply on that, I went over to implement
> and
> > test it on my own with our production cluster. I had a real pain with
> > bringing up a new node, so I had to move on.
> >
> > Result:
> > Works like a charm. I ran many dtests that relate in any way with
> storage,
> > stream, bootstrap, ... with good results.
> > The bootstrap finished in under 5:30h, not a single error log during
> > bootstrap. Also afterwards, repairs run smooth, cluster seems to operate
> > quite well.
> >
> > I still need:
> >
> >    - Reviews (see 12888, 12905, 12984)
> >    - Some opinion if I did the CDC case right. IMHO CDC is not required
> on
> >    bootstrap and we don't need to send the mutations through the write
> path
> >    just to write the commit log. This will also break incremental
> repairs.
> >    Instead for CDC the sstables are streamed like normal but mutations
> are
> >    written to commitlog additionally. The worst I see is that the node
> > crashes
> >    and the commitlogs for that repair streams are replayed leading to
> >    duplicate writes, which is not really crucial and not a regular case.
> > Any
> >    better ideas?
> >    - Docs have to be updated (12985) if patch is accepted
> >
> > I really appreciate ANY feedback. IMHO the impact of that fixes is
> immense
> > and maybe will be a huge step to get MVs production ready.
> >
> > Thank you very much,
> > Benjamin
> >
> >
> > ---------- Forwarded message ----------
> > From: Benjamin Roth <be...@jaumo.com>
> > Date: 2016-11-29 17:04 GMT+01:00
> > Subject: Streaming and MVs
> > To: dev@cassandra.apache.org
> >
> >
> > I don't know where else to discuss this issue, so I post it here.
> >
> > I am trying to get CS to run stable with MVs since the beginning of july.
> > Normal reads + write do work as expected but when it comes to repairs or
> > bootstrapping it still feels far far away from what I would call fast and
> > stable. The other day I just wanted to bootstrap a new node. I tried it 2
> > times.
> > First time the bootstrap failed due to WTEs. I fixed this issue by not
> > timing out in streams but then it turned out that the bootstrap (load
> > roughly 250-300 GB) didn't even finish in 24h. What if I really had a
> > problem and had to get up some nodes fast? No way!
> >
> > I think the root cause of it all is the way streams are handled on tables
> > with MVs.
> > Sending them to the regular write path implies so many bottlenecks and
> > sometimes also redundant writes. Let me explain:
> >
> > 1. Bootstrap
> > During a bootstrap, all ranges from all KS and all CFs that will belong
> to
> > the new node will be streamed. MVs are treated like all other CFs and all
> > ranges that will move to the new node will also be streamed during
> > bootstrap.
> > Sending streams of the base tables through the write path will have the
> > following negative impacts:
> >
> >    - Writes are sent to the commit log. Not necessary. When node is
> stopped
> >    during bootstrap, bootstrap will simply start over. No need to recover
> > from
> >    commit logs. Non-MV tables won't have a CL anyway
> >    - MV mutations will not be applied instantly but send to the batch
> log.
> >    This is of course necessary during the range movement (if PK of MV
> > differs
> >    from base table) but what happens: The batchlog will be completely
> > flooded.
> >    This leads to ridiculously large batchlogs (I observerd BLs with 60GB
> >    size), zillions of compactions and quadrillions of tombstones. This
> is a
> >    pure resource killer, especially because BL uses a CF as a queue.
> >    - Applying every mutation separately causes read-before-writes during
> MV
> >    mutation. This is of course an order of magnitude slower than simply
> >    streaming down an SSTable. This effect becomes even worse while
> > bootstrap
> >    progresses and creates more and more (uncompacted) SSTables. Many of
> > them
> >    wont ever be compacted because the batchlog eats all the resources
> >    available for compaction
> >    - Streaming down the MV tables AND applying the mutations of the
> >    basetables leads to redundant writes. Redundant writes are local if PK
> > of
> >    the MV == PK of the base table and - even worse - remote if not.
> Remote
> > MV
> >    updates will impact nodes that aren't even part of the bootstrap.
> >    - CDC should also not be necessary during bootstrap, should it? TBD
> >
> > 2. Repair
> > Negative impact is similar to bootstrap but, ...
> >
> >    - Sending repairs through write path will not mark the streamed tables
> >    as repaired. See CASSANDRA-12888. Doing NOT so will instantly solve
> that
> >    issue. Much simpler with any other solution
> >    - It will change the "repair design" a bit. Repairing a base table
> will
> >    not automatically repair the MV. But is this bad at all? To be honest
> > as a
> >    newbie it was very hard for me to understand what I had to do to be
> sure
> >    that everything is repaired correctly. Recently I was told NOT to
> > repair MV
> >    CFs but only to repair the base tables. This means one cannot just
> call
> >    "nodetool repair $keyspace" - this is complicated, not transparent and
> > it
> >    sucks. I changed the behaviour in my own branch and let run the dtests
> > for
> >    MVs. 2 tests failed:
> >       - base_replica_repair_test of course failes due to the design
> change
> >       - really_complex_repair_test fails because it intentionally times
> out
> >       the batch log. IMHO this is a bearable situation. It is comparable
> to
> >       resurrected tombstones when running a repair after GCGS expired.
> You
> > also
> >       would not expect this to be magically fixed. gcgs default is 10
> > days and I
> >       can expect that anybody also repairs its MVs during that period,
> not
> > only
> >       the base table
> >
> > 3. Rebuild
> > Same like bootstrap, isn't it?
> >
> > Did I forget any cases?
> > What do you think?
> >
> > --
> > Benjamin Roth
> > Prokurist
> >
> > Jaumo GmbH · www.jaumo.com
> > Wehrstraße 46 · 73035 Göppingen · Germany
> > Phone +49 7161 304880-6 <07161%203048806> · Fax +49 7161 304880-1
> > <07161%203048801>
> > AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
> >
> >
> >
> > --
> > Benjamin Roth
> > Prokurist
> >
> > Jaumo GmbH · www.jaumo.com
> > Wehrstraße 46 · 73035 Göppingen · Germany
> > Phone +49 7161 304880-6 · Fax +49 7161 304880-1
> > AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
> >
>



-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: CASSANDRA-12888: Streaming and MVs

Posted by Paulo Motta <pa...@gmail.com>.
Hello Benjamin,

Thanks for your effort on this investigation! For bootstraps and range
transfers, I think we can indeed simplify and stream base tables and MVs as
ordinary tables, unless there is some caveat I'm missing (I didn't find any
special case for bootstrap/range transfers on CASSANDRA-6477 or in the MV
design doc, please correct me if I'm wrong).

Regarding repair of base tables, applying mutations via the write path is a
matter of correctness, given that the base table updates needs to
potentially remove previously referenced keys in the views, so repairing
only the base table may leave unreferenced keys in the views, breaking the
MV contract. Furthermore, these unreferenced keys may be propagated to
other replicas and never removed if you repair only the view. If you don't
do overwrites in the base table, this is probably not a problem but the DB
cannot ensure this (at least not before CASSANDRA-9779). Furthermore, as
you already noticed repairing only the base table is probably faster so I
don't see a reason to repair the base and MVs separately since this is
potentially more costly. I believe your frustration is mostly due to the
bug described on CASSANDRA-12905, but after that and CASSANDRA-12888 are
fixed repair on base table should work just fine.

Based on this, I propose:
- Fix CASSANDRA-12905 with your original patch that retries acquiring the
MV lock instead of throwing WriteTimeoutException during streaming, since
this is blocking 3.10.
- Fix CASSANDRA-12888 by doing sstable-based streaming for base tables
while still applying MV updates in the paired replicas.
- Create new ticket to use ordinary streaming for non-repair MV stream
sessions and keep current behavior for MV streaming originating from repair.
- Create new ticket to include only the base tables and not MVs in
keyspace-level repair, since repairing the base already repairs the views
to avoid people shooting themselves in the foot.

Please let me know what do you think. Any suggestions or feedback is
appreciated.

Cheers,

Paulo

2016-12-02 8:27 GMT-02:00 Benjamin Roth <be...@jaumo.com>:

> As I haven't received a single reply on that, I went over to implement and
> test it on my own with our production cluster. I had a real pain with
> bringing up a new node, so I had to move on.
>
> Result:
> Works like a charm. I ran many dtests that relate in any way with storage,
> stream, bootstrap, ... with good results.
> The bootstrap finished in under 5:30h, not a single error log during
> bootstrap. Also afterwards, repairs run smooth, cluster seems to operate
> quite well.
>
> I still need:
>
>    - Reviews (see 12888, 12905, 12984)
>    - Some opinion if I did the CDC case right. IMHO CDC is not required on
>    bootstrap and we don't need to send the mutations through the write path
>    just to write the commit log. This will also break incremental repairs.
>    Instead for CDC the sstables are streamed like normal but mutations are
>    written to commitlog additionally. The worst I see is that the node
> crashes
>    and the commitlogs for that repair streams are replayed leading to
>    duplicate writes, which is not really crucial and not a regular case.
> Any
>    better ideas?
>    - Docs have to be updated (12985) if patch is accepted
>
> I really appreciate ANY feedback. IMHO the impact of that fixes is immense
> and maybe will be a huge step to get MVs production ready.
>
> Thank you very much,
> Benjamin
>
>
> ---------- Forwarded message ----------
> From: Benjamin Roth <be...@jaumo.com>
> Date: 2016-11-29 17:04 GMT+01:00
> Subject: Streaming and MVs
> To: dev@cassandra.apache.org
>
>
> I don't know where else to discuss this issue, so I post it here.
>
> I am trying to get CS to run stable with MVs since the beginning of july.
> Normal reads + write do work as expected but when it comes to repairs or
> bootstrapping it still feels far far away from what I would call fast and
> stable. The other day I just wanted to bootstrap a new node. I tried it 2
> times.
> First time the bootstrap failed due to WTEs. I fixed this issue by not
> timing out in streams but then it turned out that the bootstrap (load
> roughly 250-300 GB) didn't even finish in 24h. What if I really had a
> problem and had to get up some nodes fast? No way!
>
> I think the root cause of it all is the way streams are handled on tables
> with MVs.
> Sending them to the regular write path implies so many bottlenecks and
> sometimes also redundant writes. Let me explain:
>
> 1. Bootstrap
> During a bootstrap, all ranges from all KS and all CFs that will belong to
> the new node will be streamed. MVs are treated like all other CFs and all
> ranges that will move to the new node will also be streamed during
> bootstrap.
> Sending streams of the base tables through the write path will have the
> following negative impacts:
>
>    - Writes are sent to the commit log. Not necessary. When node is stopped
>    during bootstrap, bootstrap will simply start over. No need to recover
> from
>    commit logs. Non-MV tables won't have a CL anyway
>    - MV mutations will not be applied instantly but send to the batch log.
>    This is of course necessary during the range movement (if PK of MV
> differs
>    from base table) but what happens: The batchlog will be completely
> flooded.
>    This leads to ridiculously large batchlogs (I observerd BLs with 60GB
>    size), zillions of compactions and quadrillions of tombstones. This is a
>    pure resource killer, especially because BL uses a CF as a queue.
>    - Applying every mutation separately causes read-before-writes during MV
>    mutation. This is of course an order of magnitude slower than simply
>    streaming down an SSTable. This effect becomes even worse while
> bootstrap
>    progresses and creates more and more (uncompacted) SSTables. Many of
> them
>    wont ever be compacted because the batchlog eats all the resources
>    available for compaction
>    - Streaming down the MV tables AND applying the mutations of the
>    basetables leads to redundant writes. Redundant writes are local if PK
> of
>    the MV == PK of the base table and - even worse - remote if not. Remote
> MV
>    updates will impact nodes that aren't even part of the bootstrap.
>    - CDC should also not be necessary during bootstrap, should it? TBD
>
> 2. Repair
> Negative impact is similar to bootstrap but, ...
>
>    - Sending repairs through write path will not mark the streamed tables
>    as repaired. See CASSANDRA-12888. Doing NOT so will instantly solve that
>    issue. Much simpler with any other solution
>    - It will change the "repair design" a bit. Repairing a base table will
>    not automatically repair the MV. But is this bad at all? To be honest
> as a
>    newbie it was very hard for me to understand what I had to do to be sure
>    that everything is repaired correctly. Recently I was told NOT to
> repair MV
>    CFs but only to repair the base tables. This means one cannot just call
>    "nodetool repair $keyspace" - this is complicated, not transparent and
> it
>    sucks. I changed the behaviour in my own branch and let run the dtests
> for
>    MVs. 2 tests failed:
>       - base_replica_repair_test of course failes due to the design change
>       - really_complex_repair_test fails because it intentionally times out
>       the batch log. IMHO this is a bearable situation. It is comparable to
>       resurrected tombstones when running a repair after GCGS expired. You
> also
>       would not expect this to be magically fixed. gcgs default is 10
> days and I
>       can expect that anybody also repairs its MVs during that period, not
> only
>       the base table
>
> 3. Rebuild
> Same like bootstrap, isn't it?
>
> Did I forget any cases?
> What do you think?
>
> --
> Benjamin Roth
> Prokurist
>
> Jaumo GmbH · www.jaumo.com
> Wehrstraße 46 · 73035 Göppingen · Germany
> Phone +49 7161 304880-6 <07161%203048806> · Fax +49 7161 304880-1
> <07161%203048801>
> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>
>
>
> --
> Benjamin Roth
> Prokurist
>
> Jaumo GmbH · www.jaumo.com
> Wehrstraße 46 · 73035 Göppingen · Germany
> Phone +49 7161 304880-6 · Fax +49 7161 304880-1
> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>