You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pulsar.apache.org by Apache Pulsar Slack <ap...@gmail.com> on 2020/04/24 09:11:03 UTC
Slack digest for #dev - 2020-04-24

2020-04-23 14:43:20 UTC - Jianfeng Qiao: Anyone know, why op.addOpCount is assigned twice in create() and initiate() of OpAddEntry.java?
----
2020-04-23 15:22:03 UTC - Addison Higham: is there a doc on how to retrigger github tests
----
2020-04-23 15:23:29 UTC - Addison Higham: is it just `/pulsarbot run-failure-checks`?
----
2020-04-23 15:34:17 UTC - Yuvaraj Loganathan: Yes
----
2020-04-23 16:01:47 UTC - Patrik Kleindl: @Patrik Kleindl has joined the channel
----
2020-04-23 16:35:17 UTC - Sijie Guo: <https://github.com/apache/pulsar-test-infra/blob/master/pulsarbot/README.md>
----
2020-04-23 16:43:37 UTC - Addison Higham: one of the biggest remaining issues I know see when running in k8s:
- a bookie is lost/replaced (I mostly see this when trying to do bookie maintainance/restarts)
- bookie rescheduled
- broker never gets notified of new bookie
- broker doesn't have enough healthy bookies to form a new ensemble

I added <https://github.com/apache/pulsar/pull/6800> and plan on setting the interval to be like every 60 seconds. Doesn't seem extreme. Obviously the real question though: what is going wrong such that either the ZK watch isn't firing or is the ZK watch just be dropped somewhere? ZK watches seem critical to the health of pulsar, so my biggest concern is if it is something systematic, perhaps I have larger issues.
----
2020-04-23 16:47:56 UTC - Addison Higham: @Sijie Guo ^^  I watched your pulsar on k8s youtube video so I wonder if you have some context here, have you observed this problem?
----
2020-04-23 16:51:23 UTC - Chris Bartholomew: @Addison Higham I have definitely seen this, but have not figured out exactly why. Restarting the broker clears the problem.
----
2020-04-23 16:53:58 UTC - Addison Higham: yeah, I think I know now how to work around it also in an automated fashion, but it does raise for me some concerns that it could be a more systematic issue
----
2020-04-23 16:54:25 UTC - Chris Bartholomew: I run broker health checks as liveness probes to workaround this
----
2020-04-23 16:56:33 UTC - Addison Higham: yeah, that is what I am doing now as well, but still don't like how much downtime that can be, so going to see if I can use that getBookieInfo change to drop the  time down more
----
2020-04-23 16:57:22 UTC - Addison Higham: (my biggest annoyance with broker checks as liveness probes is it makes the logs so dang noisy... but that is a whole other problem, need to get pulsar doing more structured logging)
----
2020-04-23 17:19:45 UTC - Sijie Guo: @Addison Higham:

Did you see any errors in the broker log? One of the possibilities is from this issue - <https://github.com/apache/bookkeeper/pull/2301>

Since bookkeeper is deployed using statefulset, the pod DNS is only resolvable when the pod is ready (if you have readiness probe for bookie pod, it will wait until the readiness succeed). So in some cases, you will see NPE when resolving network address and cause bookies are not added to network topologies.

&gt; my biggest annoyance with broker checks as liveness probes
I usually don’t recommend using broker health check as liveness probes. As it can potentially bring down the whole cluster. The liveness of a broker shouldn’t depend on the health of an entire bookkeeper cluster. With that being said, broker should stand up even bookkeeper cluster is not writable.
----
2020-04-23 17:25:19 UTC - Addison Higham: aha, yeah so I did add a readiness probe as well, I will go dig in and see if I see that error. As far as broker with liveness probe, I also thought it might not be ideal, but until I can have the brokers better self heal, at least it gets me back to being able to serve writes.
----
2020-04-23 17:31:21 UTC - Addison Higham: @Sijie Guo have you tried using `publishNotReadyAddresses` on the service to fix that issue?
----
2020-04-23 18:51:48 UTC - Addison Higham: huh so `publishNotReadyAddresses` that exception, but it didn't fix the issue because of a cached DNS entry, it kept trying to hit the old IP
----
2020-04-23 19:28:03 UTC - matt_innerspace.io: 2.5.1 possible bug?  Seems there was a change in the `<https://pulsar.apache.org/admin/v3/functions/{tenant}/{namespace}/{functionName}>` function, where the method signature changed, specifically the `functionConfig` parameter, which switched from a String (in 2.4.0) to an object (2.5.1) in `FunctionsApiV3Resource.java` as shown below in 2.5.1:
```    @POST
    @Path("/{tenant}/{namespace}/{functionName}")
    @Consumes(MediaType.MULTIPART_FORM_DATA)
    public void registerFunction(final @PathParam("tenant") String tenant,
                                 final @PathParam("namespace") String namespace,
                                 final @PathParam("functionName") String functionName,
                                 final @FormDataParam("data") InputStream uploadedInputStream,
                                 final @FormDataParam("data") FormDataContentDisposition fileDetail,
                                 final @FormDataParam("url") String functionPkgUrl,
                                 final @FormDataParam("functionConfig") FunctionConfig functionConfig) {```
POSTs that worked before now return `400, Bad Request, b'{"reason":"Function config is not provided"}'`

The result of this is it's seemingly impossible to register a function via the REST API (from python).  Perhaps I'm missing something?
----
2020-04-23 19:43:40 UTC - matt_innerspace.io: logged here - <https://github.com/apache/pulsar/issues/6809>
----
2020-04-23 20:11:16 UTC - Devin G. Bost: What’s the most graceful way to handle error messages from a sink?
Wanting opinions.
----
2020-04-23 20:13:21 UTC - Sijie Guo: `publishNotReadyAddresses` is one solution. You can tune the java dns cache ttl time.
----
2020-04-23 22:10:08 UTC - Sijie Guo: it is `string` in v2 admin.
----
2020-04-23 22:10:39 UTC - Sijie Guo: it is changed to FunctionConfig in v3 endpoint
----
2020-04-23 22:10:55 UTC - Sijie Guo: v2 or v3 is not related to pulsar versions.
----
2020-04-23 22:11:05 UTC - Sijie Guo: it is the version for http endpoint
----
2020-04-23 22:26:51 UTC - Sijie Guo: I replied here <https://github.com/apache/pulsar/issues/6809#issuecomment-618703430>
----