You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pulsar.apache.org by Apache Pulsar Slack <ap...@gmail.com> on 2019/08/18 09:11:02 UTC

Slack digest for #general - 2019-08-18

2019-08-17 13:21:36 UTC - Chris Bartholomew: @Kendall Magesh-Davis Did you also delete the PersistentVolume that stores the bookie data?
----
2019-08-17 13:28:33 UTC - Kendall Magesh-Davis: No. Here’s a summary of commands:
```
kubectl cordon &lt;nodeX&gt;
kubectl drain &lt;nodeX&gt; --ignore-daemonsets --delete-local-data 
aws autoscaling \       
  terminate-instance-in-auto-scaling-group \
  --no-should-decrement-desired-capacity --instance-id=&lt;nodeX instanceid&gt;
```
----
2019-08-17 13:29:04 UTC - Kendall Magesh-Davis: Essentially that K8s cluster was running low on resources, and I bumped up the instance type and was forcing it to replace one node with the new bigger instance.
----
2019-08-17 13:40:21 UTC - Chris Bartholomew: In your values.yaml file that you used with helm, is persistence set to "yes"?
----
2019-08-17 13:56:56 UTC - Ming Fang: This issue sounds similar to <https://github.com/apache/pulsar/issues/3121>
----
2019-08-17 13:59:22 UTC - Kendall Magesh-Davis: No, it doesn’t appear so.
```bookkeeper:
  component: bookkeeper
  replicaCount: 3
  updateStrategy:
    type: OnDelete
  podManagementPolicy: OrderedReady
  default-pool
  annotations:
    <http://prometheus.io/scrape|prometheus.io/scrape>: "true"
    <http://prometheus.io/port|prometheus.io/port>: "8000"
  tolarations: []
  gracePeriod: 0
  resources:
    requests:
      memory: 128Mi
      cpu: 0.2
  volumes:
    journal:
      name: journal
      size: 5Gi
    ledgers:
      name: ledgers
      size: 5Gi
  configData:
    PULSAR_MEM: "\"-Xms128m -Xmx256m -XX:MaxDirectMemorySize=128m -Dio.netty.leakDetectionLevel=disabled -Dio.netty.recycler.linkCapacity=1024 -XX:+UseG1GC -XX:MaxGCPauseMillis=10 -XX:+ParallelRefProcEnabled -XX:+UnlockExperimentalVMOptions -XX:+AggressiveOpts -XX:+DoEscapeAnalysis -XX:ParallelGCThreads=32 -XX:ConcGCThreads=32 -XX:G1NewSizePercent=50 -XX:+DisableExplicitGC -XX:-ResizePLAB -XX:+ExitOnOutOfMemoryError -XX:+PerfDisableSharedMem -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintHeapAtGC -verbosegc -XX:G1LogLevel=finest\""
    dbStorage_writeCacheMaxSizeMb: "32"
    dbStorage_readAheadCacheMaxSizeMb: "32"
    journalMaxSizeMB: "2048"
    statsProviderClass: org.apache.bookkeeper.stats.prometheus.PrometheusMetricsProvider
    useHostNameAsBookieID: "true"
  service:
    annotations:
      publishNotReadyAddresses: "true"
    ports:
    - name: server
      port: 3181
  pdb:
    usePolicy: yes
    maxUnavailable: 1```
----
2019-08-17 14:00:01 UTC - Ming Fang: It looks like when the Bookie starts up, it checks zk and compares the “cookie” to local storage. If they don’t match or locla is empty then errors out.  My workaround is to delete the node in zk.
----
2019-08-17 14:11:03 UTC - Chris Bartholomew: In the public helm chart, this is a global setting, so its not in the bookkeeper section. Right up near the top, you will see: ```## If persistence is enabled, components that have state will
## be deployed with PersistentVolumeClaims, otherwise, for test
## purposes, they will be deployed with emptyDir
persistence: no
```
----
2019-08-17 14:14:09 UTC - Chris Bartholomew: Right @Ming Fang. @Kendall Magesh-Davis, if persistence is not enabled in the helm config, the bookie data is lost if the node is reset. This means that Zookeeper knows about the bookie, but the bookie is missing all its data, so the sanity check fails. The error you are seeing is consistent with the data directories being empty. In fact, I have reproduced this exact error by wiping the data dirs and restarting the bookkeeper pod.  Deleting the node in zk is probably the only way to recover this. In production you would want to run with persistence enabled. That way PersistentVolume claims are mounted into /pulsar/data/bookkeeper directory so it will recover in the event of node failure.
----
2019-08-17 14:17:33 UTC - Kendall Magesh-Davis: Thanks for the help @Ming Fang and @Chris Bartholomew :thumbsup: I will test it out and let you know if I run into anything else.
----
2019-08-17 14:22:48 UTC - Way Dev: @Way Dev has joined the channel
----
2019-08-17 14:39:51 UTC - Ming Fang: There’s a similar problem on the Broker side too. <https://github.com/apache/pulsar/issues/4964>
----
2019-08-17 15:23:46 UTC - Chris Bartholomew: For reference, you can delete the node in zookeeper logging into one of your pods, starting the zookeeper shell, finding the cookie, and then deleting it. This set of steps worked for me: ```kubectl exec -it &lt;pod&gt; /bin/bash
bin/pulsar zookeeper-shell
ls /ledgers/cookies
delete /ledgers/cookies/&lt;cookie-for-bad-bookie&gt;
```
----
2019-08-17 15:25:10 UTC - Ming Fang: I wonder if there’s a downside for the Bookie startup to do this automatically
----
2019-08-17 15:34:15 UTC - Chris Bartholomew: This is a bad state to be in. It means you've lost all your data for this particular bookie, so it all need to be re-replicated. Ideally, the data would have been preserved and the sanity check passes.
----
2019-08-17 15:35:38 UTC - Ming Fang: I agree this is a bad state. But I think have to manually delete zk nodes makes the situation even worse
----
2019-08-17 15:38:17 UTC - Chris Bartholomew: For development/test purpose, I agree. In production, you would want to make sure the data directories are not ephemeral.
----
2019-08-17 15:43:53 UTC - Ming Fang: Even if the volume is not ephemeral there is a still a chance we can loose the volume for whatever reason. If the volume is lost, then yes it’s bad.  But if I have to then manually recover by editing zk, then that makes it worse
+1 : Vladimir Shchur
----
2019-08-17 16:07:03 UTC - Chris Bartholomew: You can just increase the replica count by 1. That one will be considered a new bookie. It will initialize without any intervention and will restore quorum for your cluster. If you can recover the failed volume, great. If not, you can delete the bookie from zk. BTW, there is some info on this in the Bookkeeper admin docs (<https://bookkeeper.apache.org/docs/4.9.2/admin/bookies/>) under the heading "Missing disks or directories".
----
2019-08-17 21:49:38 UTC - Kendall Magesh-Davis: Ahh, right on. The answer remains the same :smile:
```
namespaceCreate: yes
persistence: no```
----
2019-08-17 22:32:44 UTC - Ming Fang: I was able get Bookkeeper to restart by adding `--force` to the `metaformat` command, <https://github.com/mingfang/terraform-provider-k8s/blob/d91a9cef710c21fbdf346629709d829222992041/modules/pulsar/bookkeeper/main.tf#L40>
----
2019-08-18 02:20:48 UTC - bright: @bright has joined the channel
----
2019-08-18 02:21:17 UTC - GaoHang: @GaoHang has joined the channel
----
2019-08-18 02:26:47 UTC - 18505565928m0: @18505565928m0 has joined the channel
----
2019-08-18 02:27:48 UTC - 308027245: @308027245 has joined the channel
----
2019-08-18 02:29:54 UTC - liangliliang: @liangliliang has joined the channel
----
2019-08-18 02:30:12 UTC - anonymitaet: @anonymitaet has joined the channel
----
2019-08-18 02:39:10 UTC - lingya: @lingya has joined the channel
----
2019-08-18 02:41:26 UTC - pxhssg: @pxhssg has joined the channel
----
2019-08-18 03:32:56 UTC - chengyanan1008: @chengyanan1008 has joined the channel
----
2019-08-18 03:42:15 UTC - Zurich: @Zurich has joined the channel
----
2019-08-18 03:43:02 UTC - Ming Fang: I’m unable to localrun a source outside of a kubernetes cluster via the ingress controller to the pulsar proxy.
`./bin/pulsar-admin sources localrun --name generator --destinationTopicName generator_test -a ./connectors/pulsar-io-data-generator-2.4.0.nar --broker-service-url <pulsar://192.168.2.249:6650>`

It prints this error and continues to run but is not doing any work

localrun log

03:45:26.029 [pulsar-client-io-1-1] INFO  org.apache.pulsar.client.impl.ConnectionPool - [[id: 0x8ca52491, L:/250.2.43.2:43526 - R:192.168.2.249/192.168.2.249:6650]] Connected to server
03:45:26.061 [pulsar-client-io-1-1] WARN  org.apache.pulsar.client.impl.ClientCnx - [id: 0x8ca52491, L:/250.2.43.2:43526 - R:192.168.2.249/192.168.2.249:6650] Received error from server: org.apache.pulsar.client.api.PulsarClientException: Disconnected from server at pulsar-0.pulsar.example.svc.cluster.local/250.2.216.6:6650
03:45:26.061 [pulsar-client-io-1-1] WARN  org.apache.pulsar.client.impl.ClientCnx - [id: 0x8ca52491, L:/250.2.43.2:43526 - R:192.168.2.249/192.168.2.249:6650] Received unknown request id from server: 0

proxy log

03:45:26.033 [pulsar-discovery-io-2-1] INFO  org.apache.pulsar.proxy.server.ProxyConnection - [/250.2.146.2:37384] New connection opened
03:45:26.044 [pulsar-discovery-io-2-1] INFO  org.apache.pulsar.proxy.server.ProxyConnection - [/250.2.146.2:37384] complete connection, init proxy handler. authenticated with none role null, hasProxyToBrokerUrl: false
03:45:26.058 [pulsar-discovery-io-2-1] INFO  org.apache.pulsar.client.impl.ConnectionPool - [[id: 0x9ca51b7c, L:/250.2.216.5:48600 - R:pulsar-0.pulsar.example.svc.cluster.local/250.2.216.6:6650]] Connected to server
03:45:26.061 [pulsar-discovery-io-2-1] INFO  org.apache.pulsar.client.impl.ClientCnx - [id: 0x9ca51b7c, L:/250.2.216.5:48600 ! R:pulsar-0.pulsar.example.svc.cluster.local/250.2.216.6:6650] Disconnected
03:45:26.062 [pulsar-discovery-io-2-1] WARN  org.apache.pulsar.proxy.server.LookupProxyHandler - [/250.2.146.2:37384] Failed to get schema : org.apache.pulsar.client.api.PulsarClientException: Disconnected from server at pulsar-0.pulsar.example.svc.cluster.local/250.2.216.6:6650


Before I open an issue, can someone confirm if this setup should work?
----
2019-08-18 05:03:38 UTC - Ming Fang: I’m guessing proxy_to_broker_url needs to be set
----
2019-08-18 05:26:31 UTC - Sijie Guo: are you able to produce or consume messages from <pulsar://192.168.2.249:6650> ?

a localrun pulsar function has no difference from a consumer and a producer. so it is better to check if the pulsar setup is good first.
----
2019-08-18 05:41:36 UTC - Ming Fang: Yes I’m able to produce and consume

`./bin/pulsar-client --url <pulsar://192.168.2.249:6650> produce my-topic --messages "hello-pulsar"`
05:40:54.612 [main] INFO  org.apache.pulsar.client.impl.PulsarClientImpl - Client closing. URL: <pulsar://192.168.2.249:6650>
05:40:54.624 [pulsar-client-io-1-1] INFO  org.apache.pulsar.client.impl.ProducerImpl - [my-topic] [local-0-26] Closed Producer
05:40:54.628 [main] INFO  org.apache.pulsar.client.cli.PulsarClientTool - 1 messages successfully produced
----
2019-08-18 05:42:56 UTC - Ming Fang: When I run produce, this is on the proxy log
`05:42:06.340 [pulsar-discovery-io-2-1] INFO  org.apache.pulsar.proxy.server.ProxyConnection - [/250.2.89.5:41190] complete connection, init proxy handler. authenticated with none role null, hasProxyToBrokerUrl: true `
----
2019-08-18 05:43:50 UTC - Ming Fang: Compare to localrun, hasProxyToBrokerUrl: true and not false
----
2019-08-18 05:49:49 UTC - Sijie Guo: hasProxyToBrokerUrl just means the connection is in different state. hasProxyToBrokerUrl == false means it is still looking up the topic metadata, hasProxyToBrokerUrl == true means it already knows the topic metadata and know which target broker to connect to.
----
2019-08-18 05:50:01 UTC - Sijie Guo: I don’t think it is the cause for localrun fails to start.
----
2019-08-18 05:50:13 UTC - Sijie Guo: `Failed to get schema : org.apache.pulsar.client.api.PulsarClientException: `
----
2019-08-18 05:50:29 UTC - Sijie Guo: this error message from localrun is the cause, I guess.
----
2019-08-18 05:50:38 UTC - Sijie Guo: Which function are your running?
----
2019-08-18 05:51:21 UTC - Ming Fang: I’m following the sql tutorial and trying to produce sample data using connectors/pulsar-io-data-generator-2.4.0.nar
----
2019-08-18 05:53:28 UTC - Sijie Guo: ok. let me check
pray : Ming Fang
----
2019-08-18 06:12:05 UTC - Sijie Guo: Oh I see. It seems that there is a bug when pulsar proxy forwarding the get schema request
----
2019-08-18 06:12:57 UTC - Ming Fang: Thanks for taking the time to debug this
----
2019-08-18 06:15:28 UTC - Sijie Guo: I am creating a pull request for it.
100 : Ming Fang
----
2019-08-18 06:19:57 UTC - Ming Fang: Amazing turnaround time. Thanks!
----
2019-08-18 06:24:33 UTC - Ming Fang: I have a related question about localrun. Instead of the tcp proxy, can I run it thru websocket?
`./bin/pulsar-admin sources localrun --name generator --destinationTopicName generator_test -a ./connectors/pulsar-io-data-generator-2.4.0.nar --broker-service-url <ws://pulsar-websocket.192.168.2.249.nip.io:80>`
----
2019-08-18 06:26:24 UTC - Sijie Guo: @Ming Fang unfortunately no. the function is using the java client which talks to the binary protocol port.
----
2019-08-18 06:27:21 UTC - Sijie Guo: Can you share your thoughts behind websocket? We can discuss to see if this is a good feature to add the support to pulsar functions.
----
2019-08-18 06:28:36 UTC - Ming Fang: Good to know. I’m using Pulsar to build solution that is almost like IoT. I want to have the datasources coming from the internet via websocket
----
2019-08-18 06:31:59 UTC - Ming Fang: I think getting functions to work over websocket is very powerful. Imagine a javascript function runtime where many browsers run the functions.  That’ll be serverless at internet scale
----
2019-08-18 06:35:38 UTC - Sijie Guo: yeah it is making sense to support websocket for javascript functions.
----
2019-08-18 06:36:05 UTC - Sijie Guo: I thought you were talking about the java/python functions :slightly_smiling_face:
----
2019-08-18 06:38:28 UTC - Ming Fang: I can see java/python functions can benefit also.  Imagine a Raspberry PI running java/python to collect and process data from the field and then send results or even raw data back to a Pulsar cluster in the cloud.  While TCP will work most of the time, there are some restricted networks that only allows HTTP
----
2019-08-18 06:40:59 UTC - Sijie Guo: make sense. in this case, we might just need to have a java/python client that talks to websocket. that java function and python function can work seamlessly
----
2019-08-18 06:41:27 UTC - Sijie Guo: Do you want to create feature requests for them? We can lookup into them in future releases.
----
2019-08-18 06:43:14 UTC - Ming Fang: Yes I’ll submit a feature request. Do you think just one request will do for java/python?
----
2019-08-18 06:43:56 UTC - Ming Fang: And do you think the javascript function runtime + websocket is an entirely different feature request?
----
2019-08-18 06:44:01 UTC - Sijie Guo: One request is fine. We can use that as a master issue for tracking the sub tasks.
----
2019-08-18 06:44:21 UTC - Sijie Guo: yea javascript function support should be an entirely different feature request :slightly_smiling_face:
----
2019-08-18 06:44:40 UTC - Sijie Guo: it is a new language runtime.
----
2019-08-18 06:45:03 UTC - Ming Fang: ok I’ll do it in the morning. It’s almost 3am in NYC.  Thanks for your help. Now I can sleep :slightly_smiling_face:
----
2019-08-18 06:45:19 UTC - Sijie Guo: ah. good night :slightly_smiling_face:
----
2019-08-18 06:45:23 UTC - Sijie Guo: thank you
----
2019-08-18 06:45:33 UTC - Ming Fang: :+1:
----
2019-08-18 06:46:17 UTC - Ming Fang: Btw what time zone are you in?
----
2019-08-18 06:48:55 UTC - Sijie Guo: San Francisco :slightly_smiling_face:
----
2019-08-18 06:50:43 UTC - Ming Fang: Ic good night 
night_with_stars : Sijie Guo
----