You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2020/05/27 21:46:02 UTC
[GitHub] [pulsar] ckdarby opened a new issue #7058: Pulsar on EBS having poor performance
ckdarby opened a new issue #7058:
URL: https://github.com/apache/pulsar/issues/7058
**Describe the bug**
A clear and concise description of what the bug is.
**To Reproduce**
Steps to reproduce the behavior:
1. We're using Pulsar helm install on AWS EKS https://github.com/apache/pulsar-helm-chart/commit/6e9ad25ba322f6f0fc7c11c66fb88faa6d0218db
2. Our values.yaml overrides look like this:
```yaml
pulsar:
namespace: cory-ebs-test
components:
pulsar_manager: false # UI is outdated and won't load without errors
auth:
authentication:
enabled: true
bookkeeper:
resources:
requests:
memory: 11560Mi
cpu: 1.5
volumes:
journal:
size: 100Gi
ledgers:
size: 5Ti
configData:
# `BOOKIE_MEM` is used for `bookie shell`
BOOKIE_MEM: >
"
-Xms1280m
-Xmx10800m
-XX:MaxDirectMemorySize=10800m
"
# we use `bin/pulsar` for starting bookie daemons
PULSAR_MEM: >
"
-Xms10800m
-Xmx10800m
-XX:MaxDirectMemorySize=10800m
"
# configure the memory settings based on jvm memory settings
dbStorage_writeCacheMaxSizeMb: "2500" #pulsar docs say 25%
dbStorage_readAheadCacheMaxSizeMb: "2500" #pulsar docs say 25%
dbStorage_rocksDB_writeBufferSizeMB: "64" #pulsar docs had 64
dbStorage_rocksDB_blockCacheSize: "1073741824" #pulsar docs say 10%
readBufferSizeBytes: "8096" #attempted doubling
autorecovery:
resources:
requests:
memory: 2048Mi
cpu: 1
configData:
BOOKIE_MEM: >
"
-Xms1500m -Xmx1500m
"
broker:
resources:
requests:
memory: 4096Mi
cpu: 1
configData:
PULSAR_MEM: >
"
-Xms1024m -Xmx4096m -XX:MaxDirectMemorySize=4096m
-Dio.netty.leakDetectionLevel=disabled
-Dio.netty.recycler.linkCapacity=1024
-XX:+ParallelRefProcEnabled
-XX:+UnlockExperimentalVMOptions
-XX:+DoEscapeAnalysis
-XX:ParallelGCThreads=4
-XX:ConcGCThreads=4
-XX:G1NewSizePercent=50
-XX:+DisableExplicitGC
-XX:-ResizePLAB
-XX:+ExitOnOutOfMemoryError
-XX:+PerfDisableSharedMem
"
proxy:
resources:
requests:
memory: 4096Mi
cpu: 1
configData:
PULSAR_MEM: >
"
-Xms1024m -Xmx4096m -XX:MaxDirectMemorySize=4096m
-Dio.netty.leakDetectionLevel=disabled
-Dio.netty.recycler.linkCapacity=1024
-XX:+ParallelRefProcEnabled
-XX:+UnlockExperimentalVMOptions
-XX:+DoEscapeAnalysis
-XX:ParallelGCThreads=4
-XX:ConcGCThreads=4
-XX:G1NewSizePercent=50
-XX:+DisableExplicitGC
-XX:-ResizePLAB
-XX:+ExitOnOutOfMemoryError
-XX:+PerfDisableSharedMem
"
service:
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: nlb
external-dns.alpha.kubernetes.io/hostname: pulsar.internal.ckdarby
toolset:
resources:
requests:
memory: 1028Mi
cpu: 1
configData:
PULSAR_MEM: >
"
-Xms640m
-Xmx1028m
-XX:MaxDirectMemorySize=1028m
"
grafana:
service:
annotations:
external-dns.alpha.kubernetes.io/hostname: grafana.internal.ckdarby
admin:
user: admin
password: 12345
```
3. Produce message to multi-partioned topic:
- Partitioned by 8
- Average message size is ~1.5 KB
- Set retention as 7 days
- We're storing ~ 2-8 TB of retention at times
4. Attempt to consume message with the offset set as earliest (thus skipping any rocksdb read cache, going to the backlog):
Have tried Flink Pulsar connector
Running with the Pulsar's perf reader from the toolset pod on a single partition topic
```json
{
"confFile" : "/pulsar/conf/client.conf",
"topic" : [ "persistent://public/cory/test-ebs-partition-5" ],
"numTopics" : 1,
"rate" : 0.0,
"startMessageId" : "earliest",
"receiverQueueSize" : 1000,
"maxConnections" : 100,
"statsIntervalSeconds" : 0,
"serviceURL" : "pulsar://cory-ebs-test-pulsar-proxy:6650/",
"authPluginClassName" : "org.apache.pulsar.client.impl.auth.AuthenticationToken",
"authParams" : "file:///pulsar/tokens/client/token",
"useTls" : false,
"tlsTrustCertsFilePath" : ""
}
```
5. Check Grafana, EBS graphs, etc
- See really poor performance from Pulsar, 60-100 mbyte/s on the partition
- Don't see any bottlenecks
**Expected behavior**
Pulsar is getting 60-100 mbyte/s reads off each partition.
Would expect closer to what bookie is actually able to read off EBS at 200-300 mbyte/s
**Additional context**
Here is a real example of everything I could pull, perf reader starts at 18:31:17 UTC & ends at 18:46:37 UTC. All the graphs are during that time and in UTC.
**Perf Reader Output***
```text
18:31:17.389 [main] INFO org.apache.pulsar.testclient.PerformanceReader - Read throughput: 58250.685 msg/s -- 647.672 Mbit/s
18:31:27.389 [main] INFO org.apache.pulsar.testclient.PerformanceReader - Read throughput: 58523.641 msg/s -- 667.659 Mbit/s
18:31:37.390 [main] INFO org.apache.pulsar.testclient.PerformanceReader - Read throughput: 61314.984 msg/s -- 688.519 Mbit/s
18:31:47.390 [main] INFO org.apache.pulsar.testclient.PerformanceReader - Read throughput: 64920.905 msg/s -- 748.406 Mbit/s
18:31:57.390 [main] INFO org.apache.pulsar.testclient.PerformanceReader - Read throughput: 64340.229 msg/s -- 732.601 Mbit/s
...
18:42:17.416 [main] INFO org.apache.pulsar.testclient.PerformanceReader - Read throughput: 64034.036 msg/s -- 723.160 Mbit/s
18:42:27.419 [main] INFO org.apache.pulsar.testclient.PerformanceReader - Read throughput: 63048.031 msg/s -- 700.458 Mbit/s
18:42:37.421 [main] INFO org.apache.pulsar.testclient.PerformanceReader - Read throughput: 69958.533 msg/s -- 817.095 Mbit/s
18:42:47.422 [main] INFO org.apache.pulsar.testclient.PerformanceReader - Read throughput: 69898.133 msg/s -- 827.770 Mbit/s
18:42:57.422 [main] INFO org.apache.pulsar.testclient.PerformanceReader - Read throughput: 62989.179 msg/s -- 726.990 Mbit/s
18:43:07.422 [main] INFO org.apache.pulsar.testclient.PerformanceReader - Read throughput: 63500.736 msg/s -- 728.683 Mbit/s
...
18:45:37.430 [main] INFO org.apache.pulsar.testclient.PerformanceReader - Read throughput: 55052.395 msg/s -- 645.263 Mbit/s
18:45:47.431 [main] INFO org.apache.pulsar.testclient.PerformanceReader - Read throughput: 72004.353 msg/s -- 804.856 Mbit/s
18:45:57.431 [main] INFO org.apache.pulsar.testclient.PerformanceReader - Read throughput: 86224.170 msg/s -- 954.399 Mbit/s
18:46:07.431 [main] INFO org.apache.pulsar.testclient.PerformanceReader - Read throughput: 80231.708 msg/s -- 905.096 Mbit/s
18:46:17.432 [main] INFO org.apache.pulsar.testclient.PerformanceReader - Read throughput: 73065.824 msg/s -- 864.556 Mbit/s
```
**Bookie reading directly from EBS**
Flushed disk cache before & this is before running the perf reader
![Selection_292](https://user-images.githubusercontent.com/220283/83058097-d40bd200-a025-11ea-8f9e-bd058ba0468a.png)
**EC2 instances**
Amount: 13
Type: r5.large
AZ: All in us-west-2c
All within Kubernetes
**EBS**
![Selection_009](https://user-images.githubusercontent.com/220283/83074778-b9932200-a040-11ea-92b5-0499eb19c32a.png)
**Grafana Overview**
![Selection_008](https://user-images.githubusercontent.com/220283/83074863-e34c4900-a040-11ea-921c-e1ece4471452.png)
**JVM**
Bookie
![Selection_002](https://user-images.githubusercontent.com/220283/83074952-070f8f00-a041-11ea-82b0-cc87438c7dd9.png)
Broker
![Selection_003](https://user-images.githubusercontent.com/220283/83074964-0ecf3380-a041-11ea-8265-156c292671c7.png)
Recovery
![Selection_004](https://user-images.githubusercontent.com/220283/83074971-155dab00-a041-11ea-88b6-c9b13411f91d.png)
Zookeeper
![Selection_005](https://user-images.githubusercontent.com/220283/83074988-1b538c00-a041-11ea-9200-3df3b9d121bf.png)
**Bookie**
![Selection_006](https://user-images.githubusercontent.com/220283/83075039-31614c80-a041-11ea-8865-8fe7a2bb7acd.png)
![Selection_007](https://user-images.githubusercontent.com/220283/83075048-358d6a00-a041-11ea-88ad-6b9cb5894d73.png)
**Specifically public/cory/test-ebs-partition-5**
![Selection_001](https://user-images.githubusercontent.com/220283/83075150-5f469100-a041-11ea-9ee7-b078fb65b99d.png)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] ashwallace commented on issue #7058: Pulsar on EBS having poor performance
Posted by GitBox <gi...@apache.org>.
ashwallace commented on issue #7058:
URL: https://github.com/apache/pulsar/issues/7058#issuecomment-635024332
EC2 r5.large has a baseline EBS throughput of ~81 MB/s at 128 KB IO sizes. Looks like you're at around that number more or less.
Because of this fact, you may not be a benefit by using a provisioned IOPS EBS volume type, nor a benefit from striping multiple GP2 together.
Therefore your options are:
1) Horizontally scale your pulsar cluster by adding another node, to achieve a higher aggregate performance for the whole cluster - this might put you in a better reliability/availability position thanks to mirroring.
2) Otherwise you should change the AWS EC2 instance size or type, due to the baseline performance of the r5.large.
Review the *second* table on this page to find an instance more suitable for your needs: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-optimized.html
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] ckdarby commented on issue #7058: Pulsar on EBS having poor performance
Posted by GitBox <gi...@apache.org>.
ckdarby commented on issue #7058:
URL: https://github.com/apache/pulsar/issues/7058#issuecomment-636275897
> Should be big enough to hold a significant portion of the index database
@sijie This is what is stored in /pulsar/data/bookkeeper/ledgers/current/locations correct?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] ashwallace edited a comment on issue #7058: Pulsar on EBS having poor performance
Posted by GitBox <gi...@apache.org>.
ashwallace edited a comment on issue #7058:
URL: https://github.com/apache/pulsar/issues/7058#issuecomment-637255249
Great work!
I interpreted the graph as a single EC2 instance, but it must have been the aggregate of all 13 instances.
Make sure to resize EC2 again to optimize cost.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] ckdarby commented on issue #7058: Pulsar on EBS having poor performance
Posted by GitBox <gi...@apache.org>.
ckdarby commented on issue #7058:
URL: https://github.com/apache/pulsar/issues/7058#issuecomment-635048127
@ashwallace But that doesn't explain why when on the pod itself using `dd` I'm able to access the EBS at 200 mbyte/s.
I will change the cluster to one that has throughput has baseline and check results.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] ashwallace commented on issue #7058: Pulsar on EBS having poor performance
Posted by GitBox <gi...@apache.org>.
ashwallace commented on issue #7058:
URL: https://github.com/apache/pulsar/issues/7058#issuecomment-637255249
Great work!
I must have interpreted the graph as a single EC2 instance, but it must have been the aggregate of all 13 instances/
Make sure to resize EC2 again to optimize cost.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] sijie commented on issue #7058: Pulsar on EBS having poor performance
Posted by GitBox <gi...@apache.org>.
sijie commented on issue #7058:
URL: https://github.com/apache/pulsar/issues/7058#issuecomment-636240424
@ckdarby
for reading from backlog:
1) tune rocksdb block cache size for holding the entry index in memory.
```
# Size of RocksDB block-cache. For best performance, this cache
# should be big enough to hold a significant portion of the index
# database which can reach ~2GB in some cases
# Default is to use 10% of the direct memory size
dbStorage_rocksDB_blockCacheSize=
```
2) in broker size, increase batch read size.
```
dispatcherMaxReadBatchSize
```
Then check the io wait metrics on the ledgers directory to see if there is any bottleneck.
If there is bottleneck, try to use multiple directories on ledger disk. This can increase parallelism but you need to make sure you also have multiple partitions. Because the entries of same partition will always go to same ledger directory.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] ckdarby commented on issue #7058: Pulsar on EBS having poor performance
Posted by GitBox <gi...@apache.org>.
ckdarby commented on issue #7058:
URL: https://github.com/apache/pulsar/issues/7058#issuecomment-636369850
@sijie Firstly, thank you for everything, :bow: .
/pulsar/data/bookkeeper/ledgers/current/locations are very small. Don't see any of the bookies with ledger/current/locations any larger than 50 MByte total dir and I have dbStorage_rocksDB_blockCacheSize set to 1 GB at the moment.
Happy to report that I changed:
dispatcherMaxReadBatchSize=10000
Saw mixed results of decent and extremely good.
Extremely good results such as:
```
03:26:05.111 [main] INFO org.apache.pulsar.testclient.PerformanceReader - Read throughput: 243598.835 msg/s -- 2787.767 Mbit/s
03:26:15.111 [main] INFO org.apache.pulsar.testclient.PerformanceReader - Read throughput: 242916.023 msg/s -- 2779.953 Mbit/s
03:26:25.112 [main] INFO org.apache.pulsar.testclient.PerformanceReader - Read throughput: 241419.952 msg/s -- 2762.832 Mbit/s
03:26:35.112 [main] INFO org.apache.pulsar.testclient.PerformanceReader - Read throughput: 236405.226 msg/s -- 2705.443 Mbit/s
```
The extremely good results are probably when the broker & bookie happen to be on the same node or all three, proxy, broker, and bookie happen to be on the same node.
Other results on different partitions (didn't record exactly), looked something like:
```
Read throughput: 100k msg/s -- 1250 Mbit/s
```
This is probably when a proxy, broker and bookie are all on different nodes or at least the broker & bookie are on different nodes.
Still much better performance throughput off of EBS than before :+1:
Thanks for everything :)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] ashwallace edited a comment on issue #7058: Pulsar on EBS having poor performance
Posted by GitBox <gi...@apache.org>.
ashwallace edited a comment on issue #7058:
URL: https://github.com/apache/pulsar/issues/7058#issuecomment-635024332
@ckdarby @sijie
EC2 r5.large has a baseline EBS throughput of ~81 MB/s at 128 KB IO sizes. Looks like you're at around that number more or less.
Because of this fact, you may not benefit from using a provisioned IOPS EBS volume type, nor a benefit from striping multiple GP2 together, nor bookkeeper config tweaks.
Therefore your options are:
1) Horizontally scale bookkeeper by adding another node, to achieve a higher aggregate performance for the whole cluster - this might put you in a better reliability/availability position thanks to mirroring.
2) Otherwise you should change the AWS EC2 instance size or type, due to the baseline performance limit of the r5.large.
Review the *second* table on this page to find an instance more suitable for your needs: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-optimized.html
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] ckdarby edited a comment on issue #7058: Pulsar on EBS having poor performance
Posted by GitBox <gi...@apache.org>.
ckdarby edited a comment on issue #7058:
URL: https://github.com/apache/pulsar/issues/7058#issuecomment-635048127
@ashwallace But that doesn't explain why when on the pod itself using `dd` I'm able to access the EBS at 200 mbyte/s.
I will change the cluster to one that has throughput at a higher baseline and check results.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] ashwallace edited a comment on issue #7058: Pulsar on EBS having poor performance
Posted by GitBox <gi...@apache.org>.
ashwallace edited a comment on issue #7058:
URL: https://github.com/apache/pulsar/issues/7058#issuecomment-635024332
@ckdarby @sijie
EC2 r5.large has a baseline EBS throughput of ~81 MB/s at 128 KB IO sizes. Looks like you're at around that number more or less.
Because of this fact, you may not benefit from using a provisioned IOPS EBS volume type, nor a benefit from striping multiple GP2 together.
Therefore your options are:
1) Horizontally scale bookkeeper by adding another node, to achieve a higher aggregate performance for the whole cluster - this might put you in a better reliability/availability position thanks to mirroring.
2) Otherwise you should change the AWS EC2 instance size or type, due to the baseline performance limit of the r5.large.
Review the *second* table on this page to find an instance more suitable for your needs: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-optimized.html
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] ckdarby commented on issue #7058: Pulsar on EBS having poor performance
Posted by GitBox <gi...@apache.org>.
ckdarby commented on issue #7058:
URL: https://github.com/apache/pulsar/issues/7058#issuecomment-635684753
@ashwallace Upgrade to c5.4xlarge
```text
23:47:19.989 [main] INFO org.apache.pulsar.testclient.PerformanceReader - Read throughput: 59245.241 msg/s -- 678.398 Mbit/s
23:47:29.989 [main] INFO org.apache.pulsar.testclient.PerformanceReader - Read throughput: 56308.718 msg/s -- 648.628 Mbit/s
23:47:39.996 [main] INFO org.apache.pulsar.testclient.PerformanceReader - Read throughput: 52245.146 msg/s -- 593.771 Mbit/s
23:47:49.996 [main] INFO org.apache.pulsar.testclient.PerformanceReader - Read throughput: 59516.311 msg/s -- 657.495 Mbit/s
23:47:59.996 [main] INFO org.apache.pulsar.testclient.PerformanceReader - Read throughput: 49080.654 msg/s -- 538.890 Mbit/s
23:48:09.996 [main] INFO org.apache.pulsar.testclient.PerformanceReader - Read throughput: 61823.387 msg/s -- 683.237 Mbit/s
```
Still seeing the same kind of issue, tomorrow I can rerun the test and repost all the graphs.
@sijie I'm thinking this pretty much rules out EBS/AWS as an issue. I saw the same thing even when I took Ash's advice to get better EBS baseline throughput. The per ops tend to be on average 150 KiB. Is there any specific params on bookie that determines how much it reads off the ledger into the read cache at a time?
This mostly now just looks like a bookie tuning thing with prefetch when the cache is missed on backlog and it isn't being very aggressive. Let me know or pull in someone if there is someone better suited on the Pulsar side.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] ckdarby commented on issue #7058: Pulsar on EBS having poor performance
Posted by GitBox <gi...@apache.org>.
ckdarby commented on issue #7058:
URL: https://github.com/apache/pulsar/issues/7058#issuecomment-635081278
@ashwallace Thanks for everything. I'll nuke the cluster & change the type with a baseline higher and let you know.
Here are my other comments...
>In your graphs, bookie is also exceeding the baseline (bursting) often too (150MBs peak observed).
That view you're referencing is the bookie overview view, it includes all four bookies.
What it looks like just on specifically "public/cory/test-ebs-partition-5" topic only which is where the pulsar perf reader was against, can be seen [here](https://user-images.githubusercontent.com/220283/83075150-5f469100-a041-11ea-9ee7-b078fb65b99d.png).
Much less than 150 MB/s :(.
>Generally, no application will attain the performance that a benchmarking tool would
Of course, but I was hoping for even half which would be closer to ~100 MB/s
>Interestingly enough, the pulsar documentation itself recommends i3 instance types which have their own NVMe disks
This is the non-kubernetes bare metal way. Kubernetes if you're using AWS's managed version called EKS has a default pvc of EBS gp2. With Kubernetes typically I try to avoid specialised nodes like the i3 type because then you need to limit what pods are going on them because you only want pods that will leverage the NVME SSDs.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] sijie commented on issue #7058: Pulsar on EBS having poor performance
Posted by GitBox <gi...@apache.org>.
sijie commented on issue #7058:
URL: https://github.com/apache/pulsar/issues/7058#issuecomment-636364920
yes that's correct.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] ckdarby closed issue #7058: Pulsar on EBS having poor performance
Posted by GitBox <gi...@apache.org>.
ckdarby closed issue #7058:
URL: https://github.com/apache/pulsar/issues/7058
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] sijie commented on issue #7058: Pulsar on EBS having poor performance
Posted by GitBox <gi...@apache.org>.
sijie commented on issue #7058:
URL: https://github.com/apache/pulsar/issues/7058#issuecomment-637999749
@ckdarby awesome!
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] ckdarby commented on issue #7058: Pulsar on EBS having poor performance
Posted by GitBox <gi...@apache.org>.
ckdarby commented on issue #7058:
URL: https://github.com/apache/pulsar/issues/7058#issuecomment-635011079
Talked to @cdbartholomew today and he mentioned that he might be having what appears to be the same issue, but in Azure.
Read performance is nowhere close to the disk capability.
@sijie If there is anything else you need me to pull from the cluster just let me know and I'll grab the info or retry with whatever configuration you can suggest.
I'll reach out to my AWS Solutions Architect to find out if they're aware of any obvious optimisation tweaks that can be done at the node level when interacting with EBS.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] ckdarby commented on issue #7058: Pulsar on EBS having poor performance
Posted by GitBox <gi...@apache.org>.
ckdarby commented on issue #7058:
URL: https://github.com/apache/pulsar/issues/7058#issuecomment-634959473
@sijie I was finally able to compile the issue that I mentioned on Slack.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] ckdarby commented on issue #7058: Pulsar on EBS having poor performance
Posted by GitBox <gi...@apache.org>.
ckdarby commented on issue #7058:
URL: https://github.com/apache/pulsar/issues/7058#issuecomment-637851015
100% solved, ran against production today.
We decreased dispatcherMaxReadBatchSize=10000 to 500 for our Flink job because reading from all 8 partitions was OOM'ing the brokers with the memory allocation we had set.
Running in production with graphs from Flink, this is 4x what we were getting before and we're seeing matching graphs in Pulsar's Grafana with 4x throughput.
![image (1)](https://user-images.githubusercontent.com/220283/83577167-8dc0e200-a501-11ea-9232-c69bb960e6cc.png)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] ashwallace commented on issue #7058: Pulsar on EBS having poor performance
Posted by GitBox <gi...@apache.org>.
ashwallace commented on issue #7058:
URL: https://github.com/apache/pulsar/issues/7058#issuecomment-635066612
>
>
> @ashwallace But that doesn't explain why when on the pod itself using `dd` I'm able to access the EBS at 200 mbyte/s.
>
> I will change the cluster to one that has throughput at a higher baseline and check results.
@ckdarby All EC2's and EBS have a burst period (approx 30 mins a day - timing and availability of bursting depends on usage pattern). DD only has a single job to do on disk, so you can see that it is exceeding the baseline (bursting) without any tweaking. Pulsar is a bit more sophisticated so it will never look like DD performance.
In your graphs, bookie is also exceeding the baseline (bursting) often too (150MBs peak observed). Your workload storage utilization appears pretty consistent though, so it is also likely you are consuming burst as it becomes available quite quickly, therefore not sustaining those higher throughputs. These are promising indicators that you may benefit from an EC2 instance with a higher baseline.
Side notes: Generally, no application will attain the performance that a benchmarking tool would, simply because apps have more things to do, and unit of data that apps operate with may not align to the peak testing scenarios. Database servers get close but i wouldn't expect pub sub messaging to. For this reason you should not over-focus on attaining these peaks, and throughput shouldn't be your only performance goal. Generally speaking, smaller IO sizes typically have lower(better) response time and but lower (worse) throughput, but inversely larger IO has higher (better) throughput but higher(worse) response time. Which is best for your actual production needs? Don't forget the application metrics: i.e. how many messages/sec do you actually need on your worse day and work backwards from there.
Interestingly enough, the pulsar documentation itself recommends i3 instance types which have their own NVMe disks - which will go well beyond EBS performance. https://pulsar.apache.org/docs/v2.0.1-incubating/deployment/aws-cluster/
I hope this helps you find the best instance size/instance count/cost ratio for your needs. Keen to hear back your next findings.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] ckdarby edited a comment on issue #7058: Pulsar on EBS having poor performance
Posted by GitBox <gi...@apache.org>.
ckdarby edited a comment on issue #7058:
URL: https://github.com/apache/pulsar/issues/7058#issuecomment-636275897
> Should be big enough to hold a significant portion of the index database
@sijie This is what is stored in /pulsar/data/bookkeeper/ledgers/current/locations correct?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] sijie commented on issue #7058: Pulsar on EBS having poor performance
Posted by GitBox <gi...@apache.org>.
sijie commented on issue #7058:
URL: https://github.com/apache/pulsar/issues/7058#issuecomment-637184154
@ckdarby Glad that you have sorted things out.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org