You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pulsar.apache.org by Michael Marshall <mm...@apache.org> on 2022/04/14 17:26:49 UTC

Pulsar Community Meeting Notes 2022/04/14, (8:30 AM PST)

Hi Pulsar Community,

Below are the meeting notes from today's community meeting.

Disclaimer: I am the primary author of these notes. I took the notes
while participating in the meeting discussions. It is possible that I
missed or misunderstood information. If something is misattributed or
misrepresented, please send a correction to this list and consider
updating the Google doc.

Source google doc:
https://docs.google.com/document/d/19dXkVXeU2q_nHmkG8zURjKnYlvD96TbKf5KjYyASsOE

Thanks,
Michael

2022/04/14, (8:30 AM PST)
-   Attendees:
-   Matteo Merli
-   Enrico Olivelli
-   Andrey Yegorov
-   Michael Marshall
-   Dave Fisher
-   Lari Hotari
-   Massimiliano Mirelli
-   Chris Bartholomew
-   Hang Chen
-   Aaron Williams
-   Nicolò Boschi
-   Leolinchen
-   Penghui Li

-   Discussions

-   Enrico: 2.10 release process. Took a while. Do we want to talk
about this? For 2.11, we should try to apply the new process. Matteo:
3 months from now we can release 2.11, we’ll create the branch in 2
months. Matteo plans to set a date (by discussion on the mailing list)
and wants more scrutiny on the mailing list. Dave: we should slow down
cherry picking to 2.8 and 2.9, as well. Enrico: we are finding many
fixes though, and for example, 2.8 has many users and many bug fixes.
The cherry picked commits are all bug fixes. Michael: we should add
some documentation about this to help new committers. Matteo: this
documentation would help inform contributors too. Dave: where should
we put this? Website? Matteo: we could also put it in the PR template.

-   Michael: is 2.7.5 the last 2.7 release? Matteo: could keep it open
for security bug fixes, like log4shell type fixes. Lari: 2.7.5 rc 1
has test failures, so we’ll need an rc 2. The tests that are failing
on 2.7.5 are passing on 2.7.4. Matteo: thinking through LTS and the
cost of users to do the upgrades. There is a tension between shipping
new features and how frequently users have to upgrade. One issue: the
upgrade/downgrade compatibility is only guaranteed for one minor
version. An LTS could help to support those users without adding
features. We could offer guarantees from one LTS to the next LTS. We’d
define support so users could stick with a version without worrying
about getting left behind. What if we did 3.0 and 4.0 and so on are
LTS, then 3.x is just for features? The guarantee then is that you can
go 3.x to 4.0. Dave: what about for current users using the 2.x
versions? Matteo: we can discuss how to deal with existing versions,
but we also need to figure out our preferred long term solution for
how to work in the future. Dave: I like the idea of guaranteeing
upgrade paths. Matteo: we could try to set a timeline for major
releases, not just for minor releases, e.g. every 2 years for a major
release. Discusses reasons for major releases and the nuance for how
we could use this. Dave: are bookkeeper upgrade and transactions the
major upgrade? Matteo: I didn’t have any feature in mind. I want to
give people an upgrade path and create clarity. Michael: clarifies
that you could upgrade from 3.0 to 4.0 then downgrade and it’d work.
Matteo: yes. Feature defaults won’t be able to change because of this.
Dave: relates well to creating a road map and telling people what is
coming. Enrico: creating a road map is very hard in open source. We
commit things that people contribute. In the ASF projects that I work,
contributions are hard to predict. Matteo: I agree it is hard to know.
These major releases would be loosely timed. For example, auto
partitioning is a major feature, but it is a bunch of work.
Unpredictability is bad for the users. Michael: and you don’t want to
create a hard upgrade path. Is it possible to use geo-replication (or
something like it) to migrate clusters to simplify upgrades? Matteo:
there was a green-blue deployment work in progress proposal to spin up
a new cluster to slow migrate producers and consumers to new cluster.
The coordination would be topic termination to switch new cluster. Not
sure that it is a general solution. Michael: how would breaking
changes work for the major version upgrade? Matteo: we would do a
compatibility layer. Also, the pulsar protocol hasn’t broken, and we
version the api in such a way that the broker/client determine if the
peer supports that feature.

-   PRs

-   Lari: Merged PR (https://github.com/apache/pulsar/pull/15067) to
fix ManagedCursorImpl’s mark delete update logic, but asked for
Matteo’s review. Lari plans to add more tests in the coming weeks to
catch regressions associated with the change.

-   Andrey: https://github.com/apache/pulsar/pull/15142 WIP pulsar +
bk 4.15-ish. Requests review of preliminary work, mentions that there
is a test failure he’s still investigating. Switched CI to use
Bookkeeper 4.16-SNAPSHOT to identify needed changes. Worked on tests
that broke. Some test classes were copied from bookkeeper, so he
replaced those with copy/pasted new ones. The work is iterative, and
there are still tests failing. Discussion with Matteo about tradeoffs
for test base classes and ways to improve testing classes. Matteo says
don’t worry about synchronizing tests between Pulsar/Bookkeeper. The
test utilities in bookkeeper are different. Pulsar testing assumes
that bookkeeper works and are meant to test usage of bookkeeper.
Matteo: how far do you think you are from completion? Andrey: hard to
say, tests are passing locally, but failing on remote CI.

-   Hang: https://github.com/apache/pulsar/issues/15111 Bookie lost
data when skip write journal, Hang Chen says he has seen this many
times in production. Enrico: if you don’t write to journal, this is a
possible behavior. The next bookkeeper release will include a code
change. Andrey: if you want to run without journal, increase write
quorum. Matteo: use different racks to increase durability and
decrease chance of catastrophic failure. Enrico: there are some
problems in bk protocol, even if you have multiple replicas, you are
going to lose data. 4.15 includes a change to the protocol for how the
bookkeeper responds. This improves a fix for a specific edge case. The
only fix is to upgrade. Andrey: reminder that 4.15 is in the process
of being released. Matteo: is there any failure that happened during
this time? Hang Chen: no failure during this time. Enrico: during
recovery, the recovery tries to find missing entries in the ledger.
Went on to discuss technical details of the improvement for 4.15.
Matteo: the error appears strange, and the missing entries don’t seem
to make sense. Mentions that rebuilding the index could be helpful.
(Missed some technical details about bookkeeper, see issue for more
context and discussion.)

Re: Pulsar Community Meeting Notes 2022/04/14, (8:30 AM PST)

Posted by Matteo Merli <ma...@gmail.com>.
Thanks Michael for sending out the notes. Recording is available here:
https://streamnative.zoom.us/rec/share/Eg2E7WfSOfPaHMdSphlrP-fN2NBjh4aT06eVTxv6TbBk4ujTltCcPNvq9kwHqMT4.mBdaRHY5eUXJM5bz
Passcode: .H?wa4WM


--
Matteo Merli
<ma...@gmail.com>

On Thu, Apr 14, 2022 at 10:27 AM Michael Marshall <mm...@apache.org> wrote:
>
> Hi Pulsar Community,
>
> Below are the meeting notes from today's community meeting.
>
> Disclaimer: I am the primary author of these notes. I took the notes
> while participating in the meeting discussions. It is possible that I
> missed or misunderstood information. If something is misattributed or
> misrepresented, please send a correction to this list and consider
> updating the Google doc.
>
> Source google doc:
> https://docs.google.com/document/d/19dXkVXeU2q_nHmkG8zURjKnYlvD96TbKf5KjYyASsOE
>
> Thanks,
> Michael
>
> 2022/04/14, (8:30 AM PST)
> -   Attendees:
> -   Matteo Merli
> -   Enrico Olivelli
> -   Andrey Yegorov
> -   Michael Marshall
> -   Dave Fisher
> -   Lari Hotari
> -   Massimiliano Mirelli
> -   Chris Bartholomew
> -   Hang Chen
> -   Aaron Williams
> -   Nicolò Boschi
> -   Leolinchen
> -   Penghui Li
>
> -   Discussions
>
> -   Enrico: 2.10 release process. Took a while. Do we want to talk
> about this? For 2.11, we should try to apply the new process. Matteo:
> 3 months from now we can release 2.11, we’ll create the branch in 2
> months. Matteo plans to set a date (by discussion on the mailing list)
> and wants more scrutiny on the mailing list. Dave: we should slow down
> cherry picking to 2.8 and 2.9, as well. Enrico: we are finding many
> fixes though, and for example, 2.8 has many users and many bug fixes.
> The cherry picked commits are all bug fixes. Michael: we should add
> some documentation about this to help new committers. Matteo: this
> documentation would help inform contributors too. Dave: where should
> we put this? Website? Matteo: we could also put it in the PR template.
>
> -   Michael: is 2.7.5 the last 2.7 release? Matteo: could keep it open
> for security bug fixes, like log4shell type fixes. Lari: 2.7.5 rc 1
> has test failures, so we’ll need an rc 2. The tests that are failing
> on 2.7.5 are passing on 2.7.4. Matteo: thinking through LTS and the
> cost of users to do the upgrades. There is a tension between shipping
> new features and how frequently users have to upgrade. One issue: the
> upgrade/downgrade compatibility is only guaranteed for one minor
> version. An LTS could help to support those users without adding
> features. We could offer guarantees from one LTS to the next LTS. We’d
> define support so users could stick with a version without worrying
> about getting left behind. What if we did 3.0 and 4.0 and so on are
> LTS, then 3.x is just for features? The guarantee then is that you can
> go 3.x to 4.0. Dave: what about for current users using the 2.x
> versions? Matteo: we can discuss how to deal with existing versions,
> but we also need to figure out our preferred long term solution for
> how to work in the future. Dave: I like the idea of guaranteeing
> upgrade paths. Matteo: we could try to set a timeline for major
> releases, not just for minor releases, e.g. every 2 years for a major
> release. Discusses reasons for major releases and the nuance for how
> we could use this. Dave: are bookkeeper upgrade and transactions the
> major upgrade? Matteo: I didn’t have any feature in mind. I want to
> give people an upgrade path and create clarity. Michael: clarifies
> that you could upgrade from 3.0 to 4.0 then downgrade and it’d work.
> Matteo: yes. Feature defaults won’t be able to change because of this.
> Dave: relates well to creating a road map and telling people what is
> coming. Enrico: creating a road map is very hard in open source. We
> commit things that people contribute. In the ASF projects that I work,
> contributions are hard to predict. Matteo: I agree it is hard to know.
> These major releases would be loosely timed. For example, auto
> partitioning is a major feature, but it is a bunch of work.
> Unpredictability is bad for the users. Michael: and you don’t want to
> create a hard upgrade path. Is it possible to use geo-replication (or
> something like it) to migrate clusters to simplify upgrades? Matteo:
> there was a green-blue deployment work in progress proposal to spin up
> a new cluster to slow migrate producers and consumers to new cluster.
> The coordination would be topic termination to switch new cluster. Not
> sure that it is a general solution. Michael: how would breaking
> changes work for the major version upgrade? Matteo: we would do a
> compatibility layer. Also, the pulsar protocol hasn’t broken, and we
> version the api in such a way that the broker/client determine if the
> peer supports that feature.
>
> -   PRs
>
> -   Lari: Merged PR (https://github.com/apache/pulsar/pull/15067) to
> fix ManagedCursorImpl’s mark delete update logic, but asked for
> Matteo’s review. Lari plans to add more tests in the coming weeks to
> catch regressions associated with the change.
>
> -   Andrey: https://github.com/apache/pulsar/pull/15142 WIP pulsar +
> bk 4.15-ish. Requests review of preliminary work, mentions that there
> is a test failure he’s still investigating. Switched CI to use
> Bookkeeper 4.16-SNAPSHOT to identify needed changes. Worked on tests
> that broke. Some test classes were copied from bookkeeper, so he
> replaced those with copy/pasted new ones. The work is iterative, and
> there are still tests failing. Discussion with Matteo about tradeoffs
> for test base classes and ways to improve testing classes. Matteo says
> don’t worry about synchronizing tests between Pulsar/Bookkeeper. The
> test utilities in bookkeeper are different. Pulsar testing assumes
> that bookkeeper works and are meant to test usage of bookkeeper.
> Matteo: how far do you think you are from completion? Andrey: hard to
> say, tests are passing locally, but failing on remote CI.
>
> -   Hang: https://github.com/apache/pulsar/issues/15111 Bookie lost
> data when skip write journal, Hang Chen says he has seen this many
> times in production. Enrico: if you don’t write to journal, this is a
> possible behavior. The next bookkeeper release will include a code
> change. Andrey: if you want to run without journal, increase write
> quorum. Matteo: use different racks to increase durability and
> decrease chance of catastrophic failure. Enrico: there are some
> problems in bk protocol, even if you have multiple replicas, you are
> going to lose data. 4.15 includes a change to the protocol for how the
> bookkeeper responds. This improves a fix for a specific edge case. The
> only fix is to upgrade. Andrey: reminder that 4.15 is in the process
> of being released. Matteo: is there any failure that happened during
> this time? Hang Chen: no failure during this time. Enrico: during
> recovery, the recovery tries to find missing entries in the ledger.
> Went on to discuss technical details of the improvement for 4.15.
> Matteo: the error appears strange, and the missing entries don’t seem
> to make sense. Mentions that rebuilding the index could be helpful.
> (Missed some technical details about bookkeeper, see issue for more
> context and discussion.)