You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2020/05/27 21:46:02 UTC

[GitHub] [pulsar] ckdarby opened a new issue #7058: Pulsar on EBS having poor performance

ckdarby opened a new issue #7058:
URL: https://github.com/apache/pulsar/issues/7058


   **Describe the bug**
   A clear and concise description of what the bug is.
   
   **To Reproduce**
   Steps to reproduce the behavior:
   1. We're using Pulsar helm install on AWS EKS https://github.com/apache/pulsar-helm-chart/commit/6e9ad25ba322f6f0fc7c11c66fb88faa6d0218db
   2. Our values.yaml overrides look like this:
   
   ```yaml
   pulsar:
     namespace: cory-ebs-test
     components:
       pulsar_manager: false # UI is outdated and won't load without errors
     auth:
       authentication:
         enabled: true
     bookkeeper:
       resources:
         requests:
           memory: 11560Mi
           cpu: 1.5
       volumes:
         journal:
           size: 100Gi
         ledgers:
           size: 5Ti
       configData:
         # `BOOKIE_MEM` is used for `bookie shell`
         BOOKIE_MEM: >
           "
           -Xms1280m
           -Xmx10800m
           -XX:MaxDirectMemorySize=10800m
           "
         # we use `bin/pulsar` for starting bookie daemons
         PULSAR_MEM: >
           "
           -Xms10800m
           -Xmx10800m
           -XX:MaxDirectMemorySize=10800m
           "
         # configure the memory settings based on jvm memory settings
         dbStorage_writeCacheMaxSizeMb: "2500" #pulsar docs say 25%
         dbStorage_readAheadCacheMaxSizeMb: "2500" #pulsar docs say 25%
         dbStorage_rocksDB_writeBufferSizeMB: "64" #pulsar docs had 64
         dbStorage_rocksDB_blockCacheSize: "1073741824" #pulsar docs say 10%
         readBufferSizeBytes: "8096" #attempted doubling
     autorecovery:
       resources:
         requests:
           memory: 2048Mi
           cpu: 1
       configData:
         BOOKIE_MEM: >
           "
           -Xms1500m -Xmx1500m
           "
     broker:
       resources:
         requests:
           memory: 4096Mi
           cpu: 1
       configData:
         PULSAR_MEM: >
           "
           -Xms1024m -Xmx4096m -XX:MaxDirectMemorySize=4096m
           -Dio.netty.leakDetectionLevel=disabled
           -Dio.netty.recycler.linkCapacity=1024
           -XX:+ParallelRefProcEnabled
           -XX:+UnlockExperimentalVMOptions
           -XX:+DoEscapeAnalysis
           -XX:ParallelGCThreads=4
           -XX:ConcGCThreads=4
           -XX:G1NewSizePercent=50
           -XX:+DisableExplicitGC
           -XX:-ResizePLAB
           -XX:+ExitOnOutOfMemoryError
           -XX:+PerfDisableSharedMem
           "
     proxy:
       resources:
         requests:
           memory: 4096Mi
           cpu: 1
       configData:
         PULSAR_MEM: >
           "
           -Xms1024m -Xmx4096m -XX:MaxDirectMemorySize=4096m
           -Dio.netty.leakDetectionLevel=disabled
           -Dio.netty.recycler.linkCapacity=1024
           -XX:+ParallelRefProcEnabled
           -XX:+UnlockExperimentalVMOptions
           -XX:+DoEscapeAnalysis
           -XX:ParallelGCThreads=4
           -XX:ConcGCThreads=4
           -XX:G1NewSizePercent=50
           -XX:+DisableExplicitGC
           -XX:-ResizePLAB
           -XX:+ExitOnOutOfMemoryError
           -XX:+PerfDisableSharedMem
           "
       service:
         annotations:
           service.beta.kubernetes.io/aws-load-balancer-type: nlb
           external-dns.alpha.kubernetes.io/hostname: pulsar.internal.ckdarby
     toolset:
       resources:
         requests:
           memory: 1028Mi
           cpu: 1
       configData:
         PULSAR_MEM: >
           "
           -Xms640m
           -Xmx1028m
           -XX:MaxDirectMemorySize=1028m
           "
     grafana:
       service:
         annotations:
           external-dns.alpha.kubernetes.io/hostname: grafana.internal.ckdarby
       admin:
         user: admin
         password: 12345
   ```
   
   3. Produce message to multi-partioned topic:
   - Partitioned by 8
   - Average message size is ~1.5 KB
   - Set retention as 7 days
   - We're storing ~ 2-8 TB of retention at times
   
   4. Attempt to consume message with the offset set as earliest (thus skipping any rocksdb read cache, going to the backlog):
   
   Have tried Flink Pulsar connector
   Running with the Pulsar's perf reader from the toolset pod on a single partition topic
   
   ```json
   {
     "confFile" : "/pulsar/conf/client.conf",
     "topic" : [ "persistent://public/cory/test-ebs-partition-5" ],
     "numTopics" : 1,
     "rate" : 0.0,
     "startMessageId" : "earliest",
     "receiverQueueSize" : 1000,
     "maxConnections" : 100,
     "statsIntervalSeconds" : 0,
     "serviceURL" : "pulsar://cory-ebs-test-pulsar-proxy:6650/",
     "authPluginClassName" : "org.apache.pulsar.client.impl.auth.AuthenticationToken",
     "authParams" : "file:///pulsar/tokens/client/token",
     "useTls" : false,
     "tlsTrustCertsFilePath" : ""
   }
   ```
   
   
   5. Check Grafana, EBS graphs, etc
   - See really poor performance from Pulsar, 60-100 mbyte/s on the partition
   - Don't see any bottlenecks
   
   **Expected behavior**
   Pulsar is getting 60-100 mbyte/s reads off each partition.
   Would expect closer to what bookie is actually able to read off EBS at 200-300 mbyte/s
   
   
   **Additional context**
   Here is a real example of everything I could pull, perf reader starts at 18:31:17 UTC & ends at 18:46:37 UTC. All the graphs are during that time and in UTC.
   
   **Perf Reader Output***
   ```text
   18:31:17.389 [main] INFO  org.apache.pulsar.testclient.PerformanceReader - Read throughput: 58250.685  msg/s -- 647.672 Mbit/s
   18:31:27.389 [main] INFO  org.apache.pulsar.testclient.PerformanceReader - Read throughput: 58523.641  msg/s -- 667.659 Mbit/s
   18:31:37.390 [main] INFO  org.apache.pulsar.testclient.PerformanceReader - Read throughput: 61314.984  msg/s -- 688.519 Mbit/s
   18:31:47.390 [main] INFO  org.apache.pulsar.testclient.PerformanceReader - Read throughput: 64920.905  msg/s -- 748.406 Mbit/s
   18:31:57.390 [main] INFO  org.apache.pulsar.testclient.PerformanceReader - Read throughput: 64340.229  msg/s -- 732.601 Mbit/s
   ...
   18:42:17.416 [main] INFO  org.apache.pulsar.testclient.PerformanceReader - Read throughput: 64034.036  msg/s -- 723.160 Mbit/s
   18:42:27.419 [main] INFO  org.apache.pulsar.testclient.PerformanceReader - Read throughput: 63048.031  msg/s -- 700.458 Mbit/s
   18:42:37.421 [main] INFO  org.apache.pulsar.testclient.PerformanceReader - Read throughput: 69958.533  msg/s -- 817.095 Mbit/s
   18:42:47.422 [main] INFO  org.apache.pulsar.testclient.PerformanceReader - Read throughput: 69898.133  msg/s -- 827.770 Mbit/s
   18:42:57.422 [main] INFO  org.apache.pulsar.testclient.PerformanceReader - Read throughput: 62989.179  msg/s -- 726.990 Mbit/s
   18:43:07.422 [main] INFO  org.apache.pulsar.testclient.PerformanceReader - Read throughput: 63500.736  msg/s -- 728.683 Mbit/s
   ...
   18:45:37.430 [main] INFO  org.apache.pulsar.testclient.PerformanceReader - Read throughput: 55052.395  msg/s -- 645.263 Mbit/s
   18:45:47.431 [main] INFO  org.apache.pulsar.testclient.PerformanceReader - Read throughput: 72004.353  msg/s -- 804.856 Mbit/s
   18:45:57.431 [main] INFO  org.apache.pulsar.testclient.PerformanceReader - Read throughput: 86224.170  msg/s -- 954.399 Mbit/s
   18:46:07.431 [main] INFO  org.apache.pulsar.testclient.PerformanceReader - Read throughput: 80231.708  msg/s -- 905.096 Mbit/s
   18:46:17.432 [main] INFO  org.apache.pulsar.testclient.PerformanceReader - Read throughput: 73065.824  msg/s -- 864.556 Mbit/s
   ```
   
   
   
   **Bookie reading directly from EBS**
   Flushed disk cache before & this is before running the perf reader
   ![Selection_292](https://user-images.githubusercontent.com/220283/83058097-d40bd200-a025-11ea-8f9e-bd058ba0468a.png)
   
   
   **EC2 instances**
   Amount: 13
   Type: r5.large
   AZ: All in us-west-2c
   All within Kubernetes
   
   
   **EBS**
   ![Selection_009](https://user-images.githubusercontent.com/220283/83074778-b9932200-a040-11ea-92b5-0499eb19c32a.png)
   
   **Grafana Overview**
   ![Selection_008](https://user-images.githubusercontent.com/220283/83074863-e34c4900-a040-11ea-921c-e1ece4471452.png)
   
   **JVM**
   
   Bookie
   ![Selection_002](https://user-images.githubusercontent.com/220283/83074952-070f8f00-a041-11ea-82b0-cc87438c7dd9.png)
   
   Broker
   ![Selection_003](https://user-images.githubusercontent.com/220283/83074964-0ecf3380-a041-11ea-8265-156c292671c7.png)
   
   Recovery
   ![Selection_004](https://user-images.githubusercontent.com/220283/83074971-155dab00-a041-11ea-88b6-c9b13411f91d.png)
   
   Zookeeper
   ![Selection_005](https://user-images.githubusercontent.com/220283/83074988-1b538c00-a041-11ea-9200-3df3b9d121bf.png)
   
   **Bookie**
   ![Selection_006](https://user-images.githubusercontent.com/220283/83075039-31614c80-a041-11ea-8865-8fe7a2bb7acd.png)
   
   ![Selection_007](https://user-images.githubusercontent.com/220283/83075048-358d6a00-a041-11ea-88ad-6b9cb5894d73.png)
   
   **Specifically public/cory/test-ebs-partition-5**
   ![Selection_001](https://user-images.githubusercontent.com/220283/83075150-5f469100-a041-11ea-9ee7-b078fb65b99d.png)
    
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] ashwallace commented on issue #7058: Pulsar on EBS having poor performance

Posted by GitBox <gi...@apache.org>.

ashwallace commented on issue #7058:
URL: https://github.com/apache/pulsar/issues/7058#issuecomment-635024332


   EC2 r5.large has a baseline EBS throughput of ~81 MB/s at 128 KB IO sizes. Looks like you're at around that number more or less.
   Because of this fact, you may not be a benefit by using a provisioned IOPS EBS volume type, nor a benefit from striping multiple GP2 together.
   
   Therefore your options are:
   
   1) Horizontally scale your pulsar cluster by adding another node, to achieve a higher aggregate performance for the whole cluster - this might put you in a better reliability/availability position thanks to mirroring.
   
   2) Otherwise you should change the AWS EC2 instance size or type, due to the baseline performance of the r5.large.
   Review the *second* table on this page to find an instance more suitable for your needs: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-optimized.html
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] ckdarby commented on issue #7058: Pulsar on EBS having poor performance

Posted by GitBox <gi...@apache.org>.

ckdarby commented on issue #7058:
URL: https://github.com/apache/pulsar/issues/7058#issuecomment-636275897


   > Should be big enough to hold a significant portion of the index database
   @sijie This is what is stored in /pulsar/data/bookkeeper/ledgers/current/locations correct?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] ashwallace edited a comment on issue #7058: Pulsar on EBS having poor performance

Posted by GitBox <gi...@apache.org>.

ashwallace edited a comment on issue #7058:
URL: https://github.com/apache/pulsar/issues/7058#issuecomment-637255249


   Great work!
   I interpreted the graph as a single EC2 instance, but it must have been the aggregate of all 13 instances.
   Make sure to resize EC2 again to optimize cost.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] ckdarby commented on issue #7058: Pulsar on EBS having poor performance

Posted by GitBox <gi...@apache.org>.

ckdarby commented on issue #7058:
URL: https://github.com/apache/pulsar/issues/7058#issuecomment-635048127


   @ashwallace But that doesn't explain why when on the pod itself using `dd` I'm able to access the EBS at 200 mbyte/s.
   
   I will change the cluster to one that has throughput has baseline and check results.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] ashwallace commented on issue #7058: Pulsar on EBS having poor performance

Posted by GitBox <gi...@apache.org>.

ashwallace commented on issue #7058:
URL: https://github.com/apache/pulsar/issues/7058#issuecomment-637255249


   Great work!
   I must have interpreted the graph as a single EC2 instance, but it must have been the aggregate of all 13 instances/
   Make sure to resize EC2 again to optimize cost.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] sijie commented on issue #7058: Pulsar on EBS having poor performance

Posted by GitBox <gi...@apache.org>.

sijie commented on issue #7058:
URL: https://github.com/apache/pulsar/issues/7058#issuecomment-636240424


   @ckdarby 
   
   for reading from backlog:
   
   1) tune rocksdb block cache size for holding the entry index in memory.
   
   ```
   # Size of RocksDB block-cache. For best performance, this cache
   # should be big enough to hold a significant portion of the index
   # database which can reach ~2GB in some cases
   # Default is to use 10% of the direct memory size
   dbStorage_rocksDB_blockCacheSize=
   ```
   
   2) in broker size, increase batch read size.
   
   ```
   dispatcherMaxReadBatchSize
   ```
   
   Then check the io wait metrics on the ledgers directory to see if there is any bottleneck.
   
   If there is bottleneck, try to use multiple directories on ledger disk. This can increase parallelism but you need to make sure you also have multiple partitions. Because the entries of same partition will always go to same ledger directory.
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] ckdarby commented on issue #7058: Pulsar on EBS having poor performance

Posted by GitBox <gi...@apache.org>.

ckdarby commented on issue #7058:
URL: https://github.com/apache/pulsar/issues/7058#issuecomment-636369850


   @sijie  Firstly, thank you for everything, :bow: .
   
   /pulsar/data/bookkeeper/ledgers/current/locations are very small. Don't see any of the bookies with ledger/current/locations any larger than 50 MByte total dir and I have dbStorage_rocksDB_blockCacheSize set to 1 GB at the moment.
   
   Happy to report that I changed:
   dispatcherMaxReadBatchSize=10000
   
   Saw mixed results of decent and extremely good.
   
   Extremely good results such as:
   ```
   03:26:05.111 [main] INFO  org.apache.pulsar.testclient.PerformanceReader - Read throughput: 243598.835  msg/s -- 2787.767 Mbit/s
   03:26:15.111 [main] INFO  org.apache.pulsar.testclient.PerformanceReader - Read throughput: 242916.023  msg/s -- 2779.953 Mbit/s
   03:26:25.112 [main] INFO  org.apache.pulsar.testclient.PerformanceReader - Read throughput: 241419.952  msg/s -- 2762.832 Mbit/s
   03:26:35.112 [main] INFO  org.apache.pulsar.testclient.PerformanceReader - Read throughput: 236405.226  msg/s -- 2705.443 Mbit/s
   ```
   
   The extremely good results are probably when the broker & bookie happen to be on the same node or all three, proxy, broker, and bookie happen to be on the same node.
   
   Other results on different partitions (didn't record exactly), looked something like:
   ```
   Read throughput: 100k  msg/s -- 1250 Mbit/s
   ```
   
   This is probably when a proxy, broker and bookie are all on different nodes or at least the broker & bookie are on different nodes.
   
   Still much better performance throughput off of EBS than before :+1: 
   
   Thanks for everything :)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] ashwallace edited a comment on issue #7058: Pulsar on EBS having poor performance

Posted by GitBox <gi...@apache.org>.

ashwallace edited a comment on issue #7058:
URL: https://github.com/apache/pulsar/issues/7058#issuecomment-635024332


   @ckdarby @sijie 
   EC2 r5.large has a baseline EBS throughput of ~81 MB/s at 128 KB IO sizes. Looks like you're at around that number more or less.
   Because of this fact, you may not benefit from using a provisioned IOPS EBS volume type, nor a benefit from striping multiple GP2 together, nor bookkeeper config tweaks.
   
   Therefore your options are:
   
   1) Horizontally scale bookkeeper by adding another node, to achieve a higher aggregate performance for the whole cluster - this might put you in a better reliability/availability position thanks to mirroring.
   
   2) Otherwise you should change the AWS EC2 instance size or type, due to the baseline performance limit of the r5.large.
   Review the *second* table on this page to find an instance more suitable for your needs: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-optimized.html
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] ckdarby edited a comment on issue #7058: Pulsar on EBS having poor performance

Posted by GitBox <gi...@apache.org>.

ckdarby edited a comment on issue #7058:
URL: https://github.com/apache/pulsar/issues/7058#issuecomment-635048127


   @ashwallace But that doesn't explain why when on the pod itself using `dd` I'm able to access the EBS at 200 mbyte/s.
   
   I will change the cluster to one that has throughput at a higher baseline and check results.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] ashwallace edited a comment on issue #7058: Pulsar on EBS having poor performance

Posted by GitBox <gi...@apache.org>.

ashwallace edited a comment on issue #7058:
URL: https://github.com/apache/pulsar/issues/7058#issuecomment-635024332


   @ckdarby @sijie 
   EC2 r5.large has a baseline EBS throughput of ~81 MB/s at 128 KB IO sizes. Looks like you're at around that number more or less.
   Because of this fact, you may not benefit from using a provisioned IOPS EBS volume type, nor a benefit from striping multiple GP2 together.
   
   Therefore your options are:
   
   1) Horizontally scale bookkeeper by adding another node, to achieve a higher aggregate performance for the whole cluster - this might put you in a better reliability/availability position thanks to mirroring.
   
   2) Otherwise you should change the AWS EC2 instance size or type, due to the baseline performance limit of the r5.large.
   Review the *second* table on this page to find an instance more suitable for your needs: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-optimized.html
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] ckdarby commented on issue #7058: Pulsar on EBS having poor performance

Posted by GitBox <gi...@apache.org>.

ckdarby commented on issue #7058:
URL: https://github.com/apache/pulsar/issues/7058#issuecomment-635684753


   @ashwallace Upgrade to c5.4xlarge
   
   ```text
   23:47:19.989 [main] INFO  org.apache.pulsar.testclient.PerformanceReader - Read throughput: 59245.241  msg/s -- 678.398 Mbit/s
   23:47:29.989 [main] INFO  org.apache.pulsar.testclient.PerformanceReader - Read throughput: 56308.718  msg/s -- 648.628 Mbit/s
   23:47:39.996 [main] INFO  org.apache.pulsar.testclient.PerformanceReader - Read throughput: 52245.146  msg/s -- 593.771 Mbit/s
   23:47:49.996 [main] INFO  org.apache.pulsar.testclient.PerformanceReader - Read throughput: 59516.311  msg/s -- 657.495 Mbit/s
   23:47:59.996 [main] INFO  org.apache.pulsar.testclient.PerformanceReader - Read throughput: 49080.654  msg/s -- 538.890 Mbit/s
   23:48:09.996 [main] INFO  org.apache.pulsar.testclient.PerformanceReader - Read throughput: 61823.387  msg/s -- 683.237 Mbit/s
   ```
   
   Still seeing the same kind of issue, tomorrow I can rerun the test and repost all the graphs.
   
   @sijie I'm thinking this pretty much rules out EBS/AWS as an issue. I saw the same thing even when I took Ash's advice to get better EBS baseline throughput. The per ops tend to be on average 150 KiB. Is there any specific params on bookie that determines how much it reads off the ledger into the read cache at a time?
   
   This mostly now just looks like a bookie tuning thing with prefetch when the cache is missed on backlog and it isn't being very aggressive. Let me know or pull in someone if there is someone better suited on the Pulsar side.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] ckdarby commented on issue #7058: Pulsar on EBS having poor performance

Posted by GitBox <gi...@apache.org>.

ckdarby commented on issue #7058:
URL: https://github.com/apache/pulsar/issues/7058#issuecomment-635081278


   @ashwallace Thanks for everything. I'll nuke the cluster & change the type with a baseline higher and let you know.
   
   Here are my other comments...
   
   >In your graphs, bookie is also exceeding the baseline (bursting) often too (150MBs peak observed).
   That view you're referencing is the bookie overview view, it includes all four bookies.
   
   What it looks like just on specifically "public/cory/test-ebs-partition-5" topic only which is where the pulsar perf reader was against, can be seen [here](https://user-images.githubusercontent.com/220283/83075150-5f469100-a041-11ea-9ee7-b078fb65b99d.png).
   
   Much less than 150 MB/s :(.
   
   >Generally, no application will attain the performance that a benchmarking tool would
   Of course, but I was hoping for even half which would be closer to ~100 MB/s
   
   >Interestingly enough, the pulsar documentation itself recommends i3 instance types which have their own NVMe disks
   
   This is the non-kubernetes bare metal way. Kubernetes if you're using AWS's managed version called EKS has a default pvc of EBS gp2. With Kubernetes typically I try to avoid specialised nodes like the i3 type because then you need to limit what pods are going on them because you only want pods that will leverage the NVME SSDs.
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] sijie commented on issue #7058: Pulsar on EBS having poor performance

Posted by GitBox <gi...@apache.org>.

sijie commented on issue #7058:
URL: https://github.com/apache/pulsar/issues/7058#issuecomment-636364920


   yes that's correct.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] ckdarby closed issue #7058: Pulsar on EBS having poor performance

Posted by GitBox <gi...@apache.org>.

ckdarby closed issue #7058:
URL: https://github.com/apache/pulsar/issues/7058


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] sijie commented on issue #7058: Pulsar on EBS having poor performance

Posted by GitBox <gi...@apache.org>.

sijie commented on issue #7058:
URL: https://github.com/apache/pulsar/issues/7058#issuecomment-637999749


   @ckdarby awesome!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] ckdarby commented on issue #7058: Pulsar on EBS having poor performance

Posted by GitBox <gi...@apache.org>.

ckdarby commented on issue #7058:
URL: https://github.com/apache/pulsar/issues/7058#issuecomment-635011079


   Talked to @cdbartholomew today and he mentioned that he might be having what appears to be the same issue, but in Azure.
   
   Read performance is nowhere close to the disk capability.
   
   @sijie If there is anything else you need me to pull from the cluster just let me know and I'll grab the info or retry with whatever configuration you can suggest.
   
   I'll reach out to my AWS Solutions Architect to find out if they're aware of any obvious optimisation tweaks that can be done at the node level when interacting with EBS.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] ckdarby commented on issue #7058: Pulsar on EBS having poor performance

Posted by GitBox <gi...@apache.org>.

ckdarby commented on issue #7058:
URL: https://github.com/apache/pulsar/issues/7058#issuecomment-634959473


   @sijie I was finally able to compile the issue that I mentioned on Slack.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] ckdarby commented on issue #7058: Pulsar on EBS having poor performance

Posted by GitBox <gi...@apache.org>.

ckdarby commented on issue #7058:
URL: https://github.com/apache/pulsar/issues/7058#issuecomment-637851015


   100% solved, ran against production today.
   
   We decreased dispatcherMaxReadBatchSize=10000 to 500 for our Flink job because reading from all 8 partitions was OOM'ing the brokers with the memory allocation we had set.
   
   Running in production with graphs from Flink, this is 4x what we were getting before and we're seeing matching graphs in Pulsar's Grafana with 4x throughput.
   
   ![image (1)](https://user-images.githubusercontent.com/220283/83577167-8dc0e200-a501-11ea-9232-c69bb960e6cc.png)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] ashwallace commented on issue #7058: Pulsar on EBS having poor performance

Posted by GitBox <gi...@apache.org>.

ashwallace commented on issue #7058:
URL: https://github.com/apache/pulsar/issues/7058#issuecomment-635066612


   > 
   > 
   > @ashwallace But that doesn't explain why when on the pod itself using `dd` I'm able to access the EBS at 200 mbyte/s.
   > 
   > I will change the cluster to one that has throughput at a higher baseline and check results.
   
   @ckdarby All EC2's and EBS have a burst period (approx 30 mins a day - timing and availability of bursting depends on usage pattern). DD only has a single job to do on disk, so you can see that it is exceeding the baseline (bursting) without any tweaking. Pulsar is a bit more sophisticated so it will never look like DD performance.
   
   In your graphs, bookie is also exceeding the baseline (bursting) often too (150MBs peak observed). Your workload storage utilization appears pretty consistent though, so it is also likely you are consuming burst as it becomes available quite quickly, therefore not sustaining those higher throughputs. These are promising indicators that you may benefit from an EC2 instance with a higher baseline.
   
   Side notes: Generally, no application will attain the performance that a benchmarking tool would, simply because apps have more things to do, and unit of data that apps operate with may not align to the peak testing scenarios. Database servers get close but i wouldn't expect pub sub messaging to. For this reason you should not over-focus on attaining these peaks, and throughput shouldn't be your only performance goal. Generally speaking, smaller IO sizes typically have lower(better) response time and but lower (worse) throughput, but inversely larger IO has higher (better) throughput but higher(worse) response time. Which is best for your actual production needs? Don't forget the application metrics: i.e. how many messages/sec do you actually need on your worse day and work backwards from there. 
   
   Interestingly enough, the pulsar documentation itself recommends i3 instance types which have their own NVMe disks - which will go well beyond EBS performance. https://pulsar.apache.org/docs/v2.0.1-incubating/deployment/aws-cluster/
   
   I hope this helps you find the best instance size/instance count/cost ratio for your needs. Keen to hear back your next findings.
   
   
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] ckdarby edited a comment on issue #7058: Pulsar on EBS having poor performance

Posted by GitBox <gi...@apache.org>.

ckdarby edited a comment on issue #7058:
URL: https://github.com/apache/pulsar/issues/7058#issuecomment-636275897


   > Should be big enough to hold a significant portion of the index database
   
   @sijie This is what is stored in /pulsar/data/bookkeeper/ledgers/current/locations correct?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] sijie commented on issue #7058: Pulsar on EBS having poor performance

Posted by GitBox <gi...@apache.org>.

sijie commented on issue #7058:
URL: https://github.com/apache/pulsar/issues/7058#issuecomment-637184154


   @ckdarby Glad that you have sorted things out.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org