You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pulsar.apache.org by Apache Pulsar Slack <ap...@gmail.com> on 2018/01/31 17:39:37 UTC

Slack digest for #general - 2018-01-31

2018-01-30 18:52:44 UTC - Matteo Merli: Created <https://github.com/apache/incubator-pulsar/issues/1149> for the max speed override
----
2018-01-30 19:27:48 UTC - Jaebin Yoon: I'm running into the issue with consumer direct memory size related to receiverQueueSize. Since that queue size is per partition base, when the number of partition changes and somehow overall consumers are slow, all of consumers can get the out of direct memory exceptions. When everything flows nicely it is fine but when there is a bump, it causes all consumers to be in the bad state. I don't think it's reasonable to keep adjusting queue size based on the number of partitions so I think there should be high level queue size per consumer not per partition.
----
2018-01-30 19:28:24 UTC - Matteo Merli: Yes, already have a fix for that : <https://github.com/apache/incubator-pulsar/pull/1123>
----
2018-01-30 19:28:38 UTC - Jaebin Yoon: oh nice! :slightly_smiling_face:
----
2018-01-30 19:28:56 UTC - Matteo Merli: Waiting for Jenkins to be in mood to pass the CI for merging
----
2018-01-30 19:29:12 UTC - Jaebin Yoon: ok. great to hear that!
----
2018-01-30 19:49:29 UTC - Jaebin Yoon: @Sijie Guo what's the reasonable performance of a bookie on a machine 10G interface, 16 core, 122G machine ?
----
2018-01-30 19:49:39 UTC - Jaebin Yoon: in terms of MB per sec?
----
2018-01-30 19:50:25 UTC - Matteo Merli: write throughput I’ve seen it mostly limited by Journal IO performace in the order or 130 to 200 MBytes /s
----
2018-01-30 19:50:38 UTC - Jaebin Yoon: i see.
----
2018-01-30 19:50:44 UTC - Matteo Merli: (depending on message size, batching enabled or so)
----
2018-01-30 19:51:06 UTC - Jaebin Yoon: yeah it depends on many things but I just wanted to know the ballpark expectation.
----
2018-01-30 19:51:12 UTC - Jaebin Yoon: thanks @Matteo Merli
----
2018-01-30 19:53:59 UTC - Jaebin Yoon: It goes really bad when there are many slow consumers (I've seen worst with consumers keep asking old data) but when it flows smoothly I don't see much read request to bookies at all since most of them are served out of broker cache.
----
2018-01-30 19:55:18 UTC - Sijie Guo: @Jaebin Yoon you mean publishing is bad when there are many slow consumers?
----
2018-01-30 19:56:21 UTC - Jaebin Yoon: at that time, i was using the same RAID0 so yeah it affected the publishing side and broker in general.
----
2018-01-30 19:57:03 UTC - Jaebin Yoon: I haven't pushed that after I separated the journals to ramdisk.
----
2018-01-30 19:58:52 UTC - Sijie Guo: oh I see. wondering how many slow consumers were there, is it possible to get some network related metrics when this happens again? it would help us to understand whether the problem comes from (e.g. is network overwhelmed?).
----
2018-01-30 20:00:46 UTC - Jaebin Yoon: at that time, I just ran a couple of consumers with lots of partitions (4800). I think @Matteo Merli tried to reproduce the condition before
----
2018-01-30 20:01:42 UTC - Sijie Guo: oh i see. will check with @Matteo Merli
----
2018-01-30 20:02:13 UTC - Sijie Guo: thanks @Jaebin Yoon
----
2018-01-30 20:02:20 UTC - Jaebin Yoon: thank YOU :slightly_smiling_face:
----
2018-01-30 20:07:58 UTC - Jaebin Yoon: @Jaebin Yoon uploaded a file: <https://apache-pulsar.slack.com/files/U8CM86831/F914RKZL2/-.txt|Untitled>
----
2018-01-30 20:08:20 UTC - Jaebin Yoon: I keep getting this error on consumers side.. haven't seen this before. any idea?
----
2018-01-30 20:10:38 UTC - Sijie Guo: @Jaebin Yoon checking
----
2018-01-30 20:14:14 UTC - Sijie Guo: do all consumers hit this error?
----
2018-01-30 20:45:34 UTC - Jaebin Yoon: I'm sorry.. I got pulled for lunch. :slightly_smiling_face: Some of them were fine. Let me try again to see if I can see those errors again.
----
2018-01-30 21:18:30 UTC - Jaebin Yoon: hmm I don't see them any more. When that happened, not all but multiple of consumers got into that state.
----
2018-01-30 21:44:43 UTC - Sijie Guo: @Jaebin Yoon: @Matteo Merli and me are looking into this one to see what can be the possible cause here.
----
2018-01-30 21:54:02 UTC - Jaebin Yoon: Thanks @Sijie Guo
----
2018-01-30 22:51:44 UTC - Matteo Merli: @Jaebin YoonWe haven’t found a direct explanation for the exception. I’m going to add some debug info printed in logs for when this happen to be able to debug further. In the meantime, if you see this same error again, please get a `tcpdump` capture (on port 6650) from the client. That would immensely help to find root cause.
----
2018-01-30 22:56:03 UTC - Jaebin Yoon: @Matteo Merli Yeah this was the first time for me as well.. I'll try to get the tcpdump when it happens again.
----
2018-01-30 22:58:41 UTC - Jaebin Yoon: I manually tossed the bundles on the hot brokers and those bundles were picked up by less busy brokers.. so I could scale up as I wanted. It would've been really nice if I could see this happen automatically though.
----
2018-01-30 23:47:24 UTC - Matteo Merli: :+1: Yes, that is the intended behavior, once the 1Gbps/10Gbps config is resolved
----
2018-01-30 23:48:16 UTC - Matteo Merli: Do you guys know any way to automatically get NIC info in EC2? Or just rely on manual configuration?
----
2018-01-31 01:26:26 UTC - Jaebin Yoon: There is currently no way to get that info from VM directly.
----
2018-01-31 01:26:51 UTC - Jaebin Yoon: AWS doesn't provide that. so it should be in the configuration
----
2018-01-31 02:23:04 UTC - Matteo Merli: Ok. @Jaebin Yoon I have a question on your bookie deployment. Is that embedded like broker? Or it is using the regular scripts and config files? Can you share all the flags you’re passing to JVM ?
----
2018-01-31 02:26:03 UTC - Jaebin Yoon: It's embedded in a simple wrapper. Here is the params from the running jvm : 
```/usr/lib/jvm/java-8-oracle/bin/java -server -Dsnappy.bufferSize=32768 -Dlog4j.configuration=file:///apps/pulsarbookie/conf/log4j.properties -XX:+UseCompressedOops -XX:+DisableExplicitGC -Xms24g -Xmx24g -XX:MaxDirectMemorySize=16g -verbose:gc -Xloggc:/logs/pulsarbookie/pulsarbookie-gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=30 -XX:GCLogFileSize=10M -XX:+PreserveFramePointer -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Djava.awt.headless=true -Dnetflix.environment=prod -Dnetflixconfiguration.nextGen=true -Dnetflix.ec2_instance_type=d2.4xlarge -Dnetflix.datacenter=cloud -Djute.maxbuffer=10485760 -Djava.net.preferIPv4Stack=true -XX:+UseG1GC -XX:MaxGCPauseMillis=10 -XX:+ParallelRefProcEnabled -XX:+UnlockExperimentalVMOptions -XX:+AggressiveOpts -XX:+DoEscapeAnalysis -XX:ParallelGCThreads=32 -XX:ConcGCThreads=32 -XX:G1NewSizePercent=50 -XX:+DisableExplicitGC -XX:-ResizePLAB -Dio.netty.leakDetectionLevel=disabled -Dio.netty.recycler.maxCapacity.default=1000 -Dio.netty.recycler.linkCapacity=1024 -cp /apps/pulsarbookie/lib/*:/apps/pulsarbookie/conf com.netflix.pulsarbookie.PulsarBookie```
----
2018-01-31 02:26:34 UTC - Matteo Merli: :+1:
----
2018-01-31 05:10:15 UTC - 宮田 泰宏: @宮田 泰宏 has joined the channel
----