You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pulsar.apache.org by Apache Pulsar Slack <ap...@gmail.com> on 2020/03/19 09:11:04 UTC

Slack digest for #general - 2020-03-19

2020-03-18 12:34:11 UTC - ankur marchattiwar: @ankur marchattiwar has joined the channel
----
2020-03-18 16:09:13 UTC - ben gordon: Anyone have any idea when Key_Shared / <https://github.com/apache/pulsar/issues/4077> is targeted for GA?
----
2020-03-18 16:51:25 UTC - Evan Furman: Decreasing `dbStorage_readAheadCacheBatchSize` seems to impact performance negatively
----
2020-03-18 16:51:41 UTC - Chris: Had the same experience
----
2020-03-18 16:51:57 UTC - Evan Furman: Did you try increasing it?
----
2020-03-18 16:52:13 UTC - Evan Furman: the default is 1k i believe
----
2020-03-18 16:52:15 UTC - Chris: About to. Had to wait to restart the bookies.
----
2020-03-18 16:54:15 UTC - Evan Furman: these are the only values i’m setting explicitly different from the published aws k8s templates.

```  dbStorage_writeCacheMaxSizeMb: "4096" # Write cache size (direct memory)
  dbStorage_readAheadCacheMaxSizeMb: "4096" # Read cache size (direct memory)
  dbStorage_rocksDB_blockCacheSize: "4294967296"```
----
2020-03-18 17:25:56 UTC - Evan Furman: The published templates deploy 3 brokers / 2 bookies. I just deployed 3 bookies / 2 brokers and am seeing far better performance.
----
2020-03-18 18:13:44 UTC - Sijie Guo: pulsar manager crawls the metrics from brokers and caches them locally. it keeps the metrics for some time but eventually it would be expired.

but pulsar manager doesn’t use this metrics to display the topics. this sounds like a bug. Do you mind creating a GitHub issue?
----
2020-03-18 18:15:24 UTC - Sijie Guo: The feature has already been used in production for several months. It is already a GA feature. The issue was open before there are some future enhancements not done.

Are you looking for features that are not implemented in Key_Shared subscription?
----
2020-03-18 18:18:49 UTC - Alexander Ursu: Might anyone know the cause of this error on one of my bookies? I'm running a cluster with docker swarm, and recently many of my services restarted, all of which seemed to have recovered except for this one bookie, which keeps failing with this error every time. Any help or tips is appreciated!
```18:13:56.986 [main] INFO  org.apache.bookkeeper.proto.BookieNettyServer - Shutting down BookieNettyServer
18:13:57.117 [main] ERROR org.apache.bookkeeper.server.Main - Failed to build bookie server
org.apache.bookkeeper.bookie.BookieException$InvalidCookieException: Cookie [4
bookieHost: "10.0.3.144:3181"
journalDir: "data/bookkeeper/journal"
ledgerDirs: "1\tdata/bookkeeper/ledgers"
instanceId: "1d1b5b24-c68a-4b6a-82fa-eb5277c03129"
] is not matching with [4
bookieHost: "10.0.3.12:3181"
journalDir: "data/bookkeeper/journal"
ledgerDirs: "1\tdata/bookkeeper/ledgers"
instanceId: "1d1b5b24-c68a-4b6a-82fa-eb5277c03129"
]```
----
2020-03-18 18:20:47 UTC - Sijie Guo: Use `useHostNameAsBookieID` or set `advertisedAddress` for your bookies. Bookies require a stable identifier to use as the bookie id. Otherwise it will cause cookie mismatched.
----
2020-03-18 18:24:18 UTC - Alexander Ursu: ah nice, will try both of these.

this will help in both cases when there is a failure and it tries to recover, and when there is a planned redeploy for an update?
----
2020-03-18 18:25:37 UTC - Sijie Guo: correct
----
2020-03-18 19:12:02 UTC - Pradeesh: @Roman Popenov @Sijie Guo  the pulsar UI doesnt reflect any of the changes we make through pulsar-admin..do you guys know why? its connected to our cluster etc
----
2020-03-18 19:31:12 UTC - Pradeesh: clusters page just keeps loading forever
----
2020-03-18 19:31:54 UTC - Roman Popenov: I assume the manager UI?
----
2020-03-18 19:32:08 UTC - Roman Popenov: What is it saying in the logs of the manager?
----
2020-03-18 19:32:26 UTC - Sijie Guo: I just realized there might some regression about the read-ahead behavior. which version are you guys using right now?
----
2020-03-18 19:32:37 UTC - Chris: 2.5.0
----
2020-03-18 19:32:40 UTC - Evan Furman: same
----
2020-03-18 19:33:00 UTC - Sijie Guo: do you have a screenshot?
----
2020-03-18 19:33:38 UTC - Sijie Guo: okay. let me double check and get back in the afternoon.
heart : Chris
----
2020-03-18 19:33:51 UTC - Evan Furman: ok, thank you @Sijie Guo
----
2020-03-18 19:34:05 UTC - Evan Furman: cc: @Tim Corbett
----
2020-03-18 19:34:40 UTC - Aaron Stockton: sure.
```{
  "msgRateIn": 0,
  "msgThroughputIn": 0,
  "msgRateOut": 965.23360343787,
  "msgThroughputOut": 282848.6791504887,
  "averageMsgSize": 0,
  "storageSize": 17448198360,
  "backlogSize": 17424162672,
  "publishers": [],
  "subscriptions": {
    "public/default/topic-03-16|subscription-shared": {
      "msgRateOut": 0,
      "msgThroughputOut": 0,
      "msgRateRedeliver": 0,
      "msgBacklog": 37531019,
      "blockedSubscriptionOnUnackedMsgs": false,
      "msgDelayed": 0,
      "unackedMessages": 0,
      "type": "Shared",
      "msgRateExpired": 0,
      "lastExpireTimestamp": 0,
      "lastConsumedFlowTimestamp": 1584485354884,
      "lastConsumedTimestamp": 0,
      "lastAckedTimestamp": 0,
      "consumers": [],
      "isReplicated": false
    },
    "public/default/topic-03-16|subscription": {
      "msgRateOut": 965.23360343787,
      "msgThroughputOut": 282848.6791504887,
      "msgRateRedeliver": 0,
      "msgBacklog": 59274720,
      "blockedSubscriptionOnUnackedMsgs": false,
      "msgDelayed": 0,
      "unackedMessages": 185362,
      "type": "Exclusive",
      "activeConsumerName": "test-worker",
      "msgRateExpired": 0,
      "lastExpireTimestamp": 0,
      "lastConsumedFlowTimestamp": 1584559876718,
      "lastConsumedTimestamp": 1584559876763,
      "lastAckedTimestamp": 1584559877267,
      "consumers": [
        {
          "msgRateOut": 965.23360343787,
          "msgThroughputOut": 282848.6791504887,
          "msgRateRedeliver": 0,
          "consumerName": "test-worker",
          "availablePermits": 0,
          "unackedMessages": 185362,
          "blockedConsumerOnUnackedMsgs": false,
          "lastAckedTimestamp": 1584559877267,
          "lastConsumedTimestamp": 1584559876763,
          "metadata": {},
          "connectedSince": "2020-03-18T19:29:38.255Z",
          "address": "/10.100.152.109:43304"
        }
      ],
      "isReplicated": false
    }
  },
  "replication": {},
  "deduplicationStatus": "Disabled",
  "bytesInCounter": 17222433265,
  "msgInCounter": 58586508
}```
----
2020-03-18 20:23:14 UTC - Sijie Guo: Can you also show the result of `topic stats-internal`
----
2020-03-18 20:44:51 UTC - Aaron Stockton: 
----
2020-03-18 21:04:23 UTC - Sijie Guo: ```    "public%2Fdefault%2Ftopic-03-16%7Csubscription": {
      "markDeletePosition": "8007:78849",
      "readPosition": "8055:75358",
      "waitingReadOp": false,
      "pendingReadOps": 0,
      "messagesConsumedCounter": -477611,
      "cursorLedger": 14785,
      "cursorLedgerLastEntry": 244,
      "individuallyDeletedMessages": "[]",
      "lastLedgerSwitchTimestamp": "2020-03-18T19:29:38.238Z",
      "state": "Open",
      "numberOfEntriesSinceFirstNotAckedMessage": 103771,
      "totalNonContiguousDeletedMessagesRange": 0,
      "properties": {}
    }```
----
2020-03-18 21:04:52 UTC - Sijie Guo: markDeletePosition is `8007:78849` and the readPosition is `8055:7538`.
----
2020-03-18 21:05:13 UTC - Sijie Guo: so it seems that the broker doesn’t receive the acks
----
2020-03-18 21:23:10 UTC - Aaron Stockton: so the consumer stats report them as sent, and using a SHARED subscription works fine
----
2020-03-18 21:25:31 UTC - Sijie Guo: okay this sounds like a problem. can you create a github issue for that? @Penghui Li can help looking into it to see if that is a bug or not.
----
2020-03-18 22:52:33 UTC - Pradeesh: 
----
2020-03-18 22:53:00 UTC - Pradeesh: I dont think our cluster is connected …even though it says it is
----
2020-03-18 22:53:12 UTC - Pradeesh: no logs at all if I click on anything in the UI
----
2020-03-18 22:53:19 UTC - Pradeesh: only last logs I see are:
----
2020-03-18 22:53:28 UTC - Pradeesh: ```2020-03-17 22:09:33.450 UTC [49803] ERROR:  relation "users" does not exist at character 141
2020-03-17 22:09:33.450 UTC [49803] STATEMENT:  SELECT access_token AS accessToken, user_id AS userId, name, description, email,phone_number AS phoneNumber, location, company, expire FROM users WHERE access_token = $1
2020-03-17 22:09:33.474 UTC [49803] ERROR:  relation "users" does not exist at character 141
2020-03-17 22:09:33.474 UTC [49803] STATEMENT:  SELECT access_token AS accessToken, user_id AS userId, name, description, email,phone_number AS phoneNumber, location, company, expire FROM users WHERE access_token = $1
2020-03-17 22:09:33.496 UTC [49803] ERROR:  relation "users" does not exist at character 141
2020-03-17 22:09:33.496 UTC [49803] STATEMENT:  SELECT access_token AS accessToken, user_id AS userId, name, description, email,phone_number AS phoneNumber, location, company, expire FROM users WHERE access_token = $1
2020-03-17 22:09:33.519 UTC [49803] ERROR:  relation "users" does not exist at character 141
2020-03-17 22:09:33.519 UTC [49803] STATEMENT:  SELECT access_token AS accessToken, user_id AS userId, name, description, email,phone_number AS phoneNumber, location, company, expire FROM users WHERE access_token = $1
2020-03-17 22:09:33.545 UTC [49803] ERROR:  relation "users" does not exist at character 141
2020-03-17 22:09:33.545 UTC [49803] STATEMENT:  SELECT access_token AS accessToken, user_id AS userId, name, description, email,phone_number AS phoneNumber, location, company, expire FROM users WHERE access_token = $1
2020-03-17 22:09:33.570 UTC [49803] ERROR:  relation "users" does not exist at character 141
2020-03-17 22:09:33.570 UTC [49803] STATEMENT:  SELECT access_token AS accessToken, user_id AS userId, name, description, email,phone_number AS phoneNumber, location, company, expire FROM users WHERE access_token = $1
2020-03-17 22:09:33.594 UTC [49803] ERROR:  relation "users" does not exist at character 141
2020-03-17 22:09:33.594 UTC [49803] STATEMENT:  SELECT access_token AS accessToken, user_id AS userId, name, description, email,phone_number AS phoneNumber, location, company, expire FROM users WHERE access_token = $1
2020-03-17 22:09:33.622 UTC [49803] ERROR:  relation "users" does not exist at character 141
2020-03-17 22:09:33.622 UTC [49803] STATEMENT:  SELECT access_token AS accessToken, user_id AS userId, name, description, email,phone_number AS phoneNumber, location, company, expire FROM users WHERE access_token = $1
2020-03-17 22:09:33.654 UTC [49803] ERROR:  relation "users" does not exist at character 141
2020-03-17 22:09:33.654 UTC [49803] STATEMENT:  SELECT access_token AS accessToken, user_id AS userId, name, description, email,phone_number AS phoneNumber, location, company, expire FROM users WHERE access_token = $1
2020-03-17 22:09:33.686 UTC [49803] ERROR:  relation "users" does not exist at character 141
2020-03-17 22:09:33.686 UTC [49803] STATEMENT:  SELECT access_token AS accessToken, user_id AS userId, name, description, email,phone_number AS phoneNumber, location, company, expire FROM users WHERE access_token = $1
2020-03-17 22:09:33.713 UTC [49803] ERROR:  relation "users" does not exist at character 141
2020-03-17 22:09:33.713 UTC [49803] STATEMENT:  SELECT access_token AS accessToken, user_id AS userId, name, description, email,phone_number AS phoneNumber, location, company, expire FROM users WHERE access_token = $1
2020-03-17 22:09:33.741 UTC [49803] ERROR:  relation "users" does not exist at character 141```
----
2020-03-18 23:02:06 UTC - Sijie Guo: How did you start the pulsar manager?
----
2020-03-18 23:02:27 UTC - Pradeesh: its running in a container
----
2020-03-18 23:03:18 UTC - Pradeesh: I can log into the database etc from the conatiner
----
2020-03-18 23:03:22 UTC - Pradeesh: ```/pulsar-manager # psql -h 172.27.223.56 -U pulsar -p 5432
psql: could not connect to server: Connection refused
	Is the server running on host "172.27.223.56" and accepting
	TCP/IP connections on port 5432?
/pulsar-manager #
/pulsar-manager #
/pulsar-manager #
/pulsar-manager # psql -h 127.0.0.1 -U pulsar -p 5432
psql (11.6)
Type "help" for help.

pulsar=#
pulsar=#
pulsar=#
pulsar=#
pulsar=# \dt
Did not find any relations.
pulsar=#
pulsar=#
pulsar=#
pulsar=# \l
                             List of databases
      Name      | Owner  | Encoding | Collate |  Ctype  | Access privileges
----------------+--------+----------+---------+---------+-------------------
 postgres       | pulsar | UTF8     | C       | C.UTF-8 |
 pulsar         | pulsar | UTF8     | C       | C.UTF-8 |
 pulsar_manager | pulsar | UTF8     | C       | C.UTF-8 | =Tc/pulsar       +
                |        |          |         |         | pulsar=CTc/pulsar
 template0      | pulsar | UTF8     | C       | C.UTF-8 | =c/pulsar        +
                |        |          |         |         | pulsar=CTc/pulsar
 template1      | pulsar | UTF8     | C       | C.UTF-8 | =c/pulsar        +
                |        |          |         |         | pulsar=CTc/pulsar
(5 rows)

pulsar=#```
----
2020-03-18 23:10:48 UTC - Sijie Guo: Which version are you using?
----
2020-03-18 23:58:19 UTC - Aaron Stockton: <https://github.com/apache/pulsar/issues/6552>
----
2020-03-19 00:26:57 UTC - Sijie Guo: how many topics do you use in your test?
----
2020-03-19 00:27:57 UTC - Chris: Single topic, 5 partitions. Haven’t tried raising the partition count actually. 
----
2020-03-19 00:30:41 UTC - Sijie Guo: okay that should be fine. how did you run the test again?
----
2020-03-19 00:30:53 UTC - Sijie Guo: using pulsarperf?
----
2020-03-19 00:31:27 UTC - Evan Furman: We are using dotpulsar but can also replicate with pulsar-perf
----
2020-03-19 00:32:12 UTC - Chris: Java+python but can replicate with perf 
----
2020-03-19 00:35:17 UTC - Evan Furman: The common thread is we are both using the aws kubernetes templates. I haven’t been able to reproduce with the ansible/terraform cluster running directly on ec2
----
2020-03-19 00:36:53 UTC - Chris: Not true actually. I’m using the generic Kubernetes templates. 
----
2020-03-19 00:37:11 UTC - Evan Furman: ah ok, i am using the aws ones
----
2020-03-19 00:37:54 UTC - Sijie Guo: I checked the readahead behavior. I thought it was the regression we had in bookkeeper 4.9.1. but it is already fixed in 4.10.0 which is the version used in pulsar 2.5.0.
----
2020-03-19 00:38:50 UTC - Sijie Guo: @Evan Furman “I haven’t been able to reproduce with the ansible/terraform cluster running directly on ec2”

You mean that you didn’t see the problem on a EC2 cluster?
----
2020-03-19 00:39:00 UTC - Evan Furman: correct
----
2020-03-19 00:39:11 UTC - Evan Furman: not yet at least..
----
2020-03-19 00:39:34 UTC - Evan Furman: and we are using the same testing mechanism
----
2020-03-19 00:42:46 UTC - Sijie Guo: Okay.

Can you guys try the following tests to help me collect a few baseline metrics?

open one terminal for running consumer:
```bin/pulsar-perf consume -s &lt;subscription&gt; &lt;topic&gt;```
open another terminal for running producer:
```bin/pulsar-perf produce -r 1000000 -s 1000 &lt;topic&gt;```
Run this command and give me the output.
----
2020-03-19 00:46:04 UTC - Evan Furman: Sure, I need to rebuild my cluster so will be a bit
----
2020-03-19 00:51:47 UTC - Sijie Guo: thanks
----
2020-03-19 00:53:39 UTC - Chris: 
----
2020-03-19 00:54:05 UTC - Chris: 
----
2020-03-19 01:29:39 UTC - Evan Furman: ```01:27:54.578 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Starting Pulsar perf producer with config: {
  "confFile" : "/pulsar/conf/client.conf",
  "topics" : [ "<persistent://public/default/mp-test-topic>" ],
  "numTestThreads" : 1,
  "msgRate" : 1000000,
  "msgSize" : 1000,
  "numTopics" : 1,
  "numProducers" : 1,
  "serviceURL" : "<pulsar://broker:6650/>",
  "authPluginClassName" : "",
  "authParams" : "",
  "maxOutstanding" : 1000,
  "maxPendingMessagesAcrossPartitions" : 50000,
  "maxConnections" : 100,
  "numMessages" : 0,
  "statsIntervalSeconds" : 0,
  "compression" : "NONE",
  "payloadFilename" : null,
  "payloadDelimiter" : "\\n",
  "batchTimeMillis" : 1.0,
  "batchMaxMessages" : 1000,
  "batchMaxBytes" : 4194304,
  "testTime" : 0,
  "warmupTimeSeconds" : 1.0,
  "tlsTrustCertsFilePath" : "",
  "encKeyName" : null,
  "encKeyFile" : null,
  "delay" : 0,
  "exitOnFailure" : false
}
01:27:54.591 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Dumping latency stats to perf-producer-1584581274591.hgrm
01:27:54.592 [pulsar-perf-producer-exec-1-1] INFO  org.apache.pulsar.testclient.PerformanceProducer - Started performance test thread 0
01:27:55.025 [pulsar-perf-producer-exec-1-1] INFO  org.apache.pulsar.testclient.PerformanceProducer - Adding 1 publishers on topic <persistent://public/default/mp-test-topic>
01:27:55.197 [pulsar-client-io-2-1] INFO  org.apache.pulsar.client.impl.ConnectionPool - [[id: 0x6757c4f9, L:/10.3.19.152:47016 - R:broker.default.svc.cluster.local/172.20.89.30:6650]] Connected to server
01:27:55.386 [pulsar-client-io-2-1] INFO  org.apache.pulsar.client.impl.ConnectionPool - [[id: 0xec8756c1, L:/10.3.19.152:47018 - R:broker.default.svc.cluster.local/172.20.89.30:6650]] Connected to server
01:27:55.461 [pulsar-client-io-2-1] INFO  org.apache.pulsar.client.impl.ConnectionPool - [[id: 0xe29c3703, L:/10.3.19.152:57062 - R:/10.3.19.116:6650]] Connected to server
01:27:55.637 [pulsar-client-io-2-1] INFO  org.apache.pulsar.client.impl.ConnectionPool - [[id: 0x9dcf7dda, L:/10.3.19.152:57064 - R:/10.3.19.116:6650]] Connected to server
01:27:55.640 [pulsar-client-io-2-1] INFO  org.apache.pulsar.client.impl.ProducerImpl - [<persistent://public/default/mp-test-topic>] [null] Creating producer on cnx [id: 0x9dcf7dda, L:/10.3.19.152:57064 - R:/10.3.19.116:6650]
01:27:56.162 [pulsar-client-io-2-1] INFO  org.apache.pulsar.client.impl.ProducerImpl - [<persistent://public/default/mp-test-topic>] [us-east-1-0] Created producer on cnx [id: 0x9dcf7dda, L:/10.3.19.152:57064 - R:/10.3.19.116:6650]
01:27:56.174 [pulsar-perf-producer-exec-1-1] INFO  org.apache.pulsar.testclient.PerformanceProducer - Created 1 producers
01:27:56.196 [pulsar-timer-5-1] WARN  com.scurrilous.circe.checksum.Crc32cIntChecksum - Failed to load Circe JNI library. Falling back to Java based CRC32c provider
01:28:04.631 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:  47532.7  msg/s ---    362.6 Mbit/s --- failure      0.0 msg/s --- Latency: mean:  17.270 ms - med:  16.394 - 95pct:  22.004 - 99pct:  26.122 - 99.9pct:  81.750 - 99.99pct:  84.207 - Max:  99.081
01:28:14.668 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:  63405.9  msg/s ---    483.7 Mbit/s --- failure      0.0 msg/s --- Latency: mean:  15.685 ms - med:  15.414 - 95pct:  20.824 - 99pct:  33.796 - 99.9pct:  74.588 - 99.99pct:  78.595 - Max:  82.599
01:28:24.688 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:  59477.3  msg/s ---    453.8 Mbit/s --- failure      0.0 msg/s --- Latency: mean:  16.713 ms - med:  15.912 - 95pct:  21.470 - 99pct:  36.266 - 99.9pct:  83.922 - 99.99pct:  85.164 - Max:  95.270
^C01:28:27.234 [Thread-1] INFO  org.apache.pulsar.testclient.PerformanceProducer - Aggregated throughput stats --- 1855833 records sent --- 56840.924 msg/s --- 433.662 Mbit/s
01:28:27.245 [Thread-1] INFO  org.apache.pulsar.testclient.PerformanceProducer - Aggregated latency stats --- Latency: mean:  16.533 ms - med:  15.959 - 95pct:  21.564 - 99pct:  32.730 - 99.9pct:  81.398 - 99.99pct:  85.142 - 99.999pct:  85.181 - Max:  99.081```
----
2020-03-19 01:30:06 UTC - Alan Fu: @Alan Fu has joined the channel
----
2020-03-19 01:30:19 UTC - Evan Furman: ```01:28:25.053 [main] INFO  org.apache.pulsar.testclient.PerformanceConsumer - Starting Pulsar performance consumer with config: {
  "confFile" : "/pulsar/conf/client.conf",
  "topic" : [ "<persistent://public/default/mp-test-topic>" ],
  "numTopics" : 1,
  "numConsumers" : 1,
  "subscriberName" : "evan-test",
  "subscriptionType" : "Exclusive",
  "rate" : 0.0,
  "receiverQueueSize" : 1000,
  "replicatedSubscription" : false,
  "acknowledgmentsGroupingDelayMillis" : 100,
  "maxConnections" : 100,
  "statsIntervalSeconds" : 0,
  "serviceURL" : "<pulsar://broker:6650/>",
  "authPluginClassName" : "",
  "authParams" : "",
  "tlsTrustCertsFilePath" : "",
  "encKeyName" : null,
  "encKeyFile" : null
}
01:28:25.567 [main] INFO  org.apache.pulsar.testclient.PerformanceConsumer - Adding 1 consumers on topic <persistent://public/default/mp-test-topic>
01:28:25.778 [pulsar-client-io-1-1] INFO  org.apache.pulsar.client.impl.ConnectionPool - [[id: 0xad1aca80, L:/10.3.19.152:47172 - R:broker.default.svc.cluster.local/172.20.89.30:6650]] Connected to server
01:28:25.848 [pulsar-client-io-1-1] INFO  org.apache.pulsar.client.impl.ConnectionPool - [[id: 0x3a1695db, L:/10.3.19.152:47174 - R:broker.default.svc.cluster.local/172.20.89.30:6650]] Connected to server
01:28:25.856 [pulsar-client-io-1-1] INFO  org.apache.pulsar.client.impl.ConnectionPool - [[id: 0x378b2a30, L:/10.3.19.152:57218 - R:/10.3.19.116:6650]] Connected to server
01:28:25.857 [pulsar-client-io-1-1] INFO  org.apache.pulsar.client.impl.ConsumerImpl - [<persistent://public/default/mp-test-topic][evan-test]> Subscribing to topic on cnx [id: 0x378b2a30, L:/10.3.19.152:57218 - R:/10.3.19.116:6650]
01:28:26.017 [pulsar-client-io-1-1] INFO  org.apache.pulsar.client.impl.ConsumerImpl - [<persistent://public/default/mp-test-topic][evan-test]> Subscribed to topic on /10.3.19.116:6650 -- consumer: 0
01:28:26.017 [main] INFO  org.apache.pulsar.testclient.PerformanceConsumer - Start receiving from 1 consumers on 1 topics
01:28:26.094 [pulsar-client-io-1-1] WARN  com.scurrilous.circe.checksum.Crc32cIntChecksum - Failed to load Circe JNI library. Falling back to Java based CRC32c provider
01:28:36.029 [main] INFO  org.apache.pulsar.testclient.PerformanceConsumer - Throughput received: 7576.940  msg/s -- 57.807 Mbit/s --- Latency: mean: 38.711 ms - med: 27 - 95pct: 140 - 99pct: 167 - 99.9pct: 174 - 99.99pct: 175 - Max: 175
01:28:46.062 [main] INFO  org.apache.pulsar.testclient.PerformanceConsumer - Throughput received: 0.000  msg/s -- 0.000 Mbit/s --- Latency: mean: 0.000 ms - med: 0 - 95pct: 0 - 99pct: 0 - 99.9pct: 0 - 99.99pct: 0 - Max: 0
^C01:28:49.609 [Thread-1] INFO  org.apache.pulsar.testclient.PerformanceConsumer - Aggregated throughput stats --- 75770 records received --- 3211.781 msg/s --- 24.504 Mbit/s
01:28:49.614 [Thread-1] INFO  org.apache.pulsar.testclient.PerformanceConsumer - Aggregated latency stats --- Latency: mean: 38.711 ms - med: 27 - 95pct: 140 - 99pct: 167 - 99.9pct: 174 - 99.99pct: 175 - 99.999pct: 175 - Max: 175```
----
2020-03-19 01:56:03 UTC - Sijie Guo: I see.

Can you do a backlog read test using pulsar-perf?

1. Create a subscription `&lt;subscription&gt;` on `&lt;topic&gt;`
&gt; bin/pulsar-perf consume -s &lt;subscription&gt; &lt;topic&gt;
2. Stop the consumer.

3. Start producing the messages. and let it run at least 5 minutes to accumulate backlog.

```bin/pulsar-perf produce -r 1000000 -s 1000 &lt;topic&gt;```
4. Stop the producer.

5. Start the consumer and let the consumer run for a while and copy the output to me

```bin/pulsar-perf consume -s &lt;subscription&gt; &lt;topic&gt;```

----
2020-03-19 02:14:58 UTC - Chris: At this message size, it'll take quite a few tb to actually reach enough messages to cause a slowdown. I think my peak backlog was &lt; 100k
----
2020-03-19 02:15:08 UTC - Chris: And speeds were fine. I'll attach the file for reference anyway
----
2020-03-19 02:15:41 UTC - Chris: 
----
2020-03-19 02:22:19 UTC - Sijie Guo: okay what is the unit of `I think my peak backlog was &lt; 100k`
----
2020-03-19 02:22:21 UTC - Sijie Guo: messages?
----
2020-03-19 02:23:09 UTC - Sijie Guo: or can you mirror your test using pulsar-perf?
----
2020-03-19 02:24:49 UTC - Chris: Messages, yeah. I think there might've been another issue in there where the producer kept dropping its connection.
----
2020-03-19 02:24:59 UTC - Chris: Haven't seen that one before.
----
2020-03-19 02:32:08 UTC - Sijie Guo: what is the message size of 100k?
----
2020-03-19 02:32:39 UTC - Chris: That's interesting. I wasn't able to reproduce it at all. I'm getting a good 500k msgs/s on the consume side. It also has lit up all the bookies like I'd expect.
----
2020-03-19 02:33:06 UTC - Chris: That was about 100gb, 1000 byte messages, which would say that the backlog calculation is super wrong.
----
2020-03-19 02:34:42 UTC - Sijie Guo: backlog is the number of messages not number of bytes.
----
2020-03-19 02:35:02 UTC - Sijie Guo: you mean you are able to use perf consume to get 500k msgs/s?
----
2020-03-19 02:35:16 UTC - Chris: Correct
----
2020-03-19 02:37:07 UTC - tuteng: I think you can try refresh page
----
2020-03-19 02:37:14 UTC - Sijie Guo: okay. that’s interesting. how did you produce message in your test?
----
2020-03-19 02:37:49 UTC - Chris: With perf produce. Basically, I ran the test you suggested but it didn't reproduce the bug. It seems to have worked just fine.
----
2020-03-19 02:38:40 UTC - Sijie Guo: I mean when you run your own test, how did you produce messages?
----
2020-03-19 02:39:33 UTC - Chris: Ah. Those are inserted by a python process reading real time data. Lots of very small messages, around 100-200 bytes.
----
2020-03-19 02:39:50 UTC - tuteng: If you use latest code, you need init account and password <https://github.com/apache/pulsar-manager>
----
2020-03-19 02:39:54 UTC - Penghui Li: @Aaron Stockton Thanks for your feedback, I will take a look.
----
2020-03-19 02:40:35 UTC - Chris: Since we've got a bit of retention, I just rewind the cursor for pulsar-perf and have it try to consume the backlog.
----
2020-03-19 02:44:15 UTC - Sijie Guo: I see did you enable batching when producing messaging from your python process?
----
2020-03-19 02:46:05 UTC - Chris: If it's enabled by default then yes, otherwise no.
----
2020-03-19 02:51:36 UTC - Sijie Guo: okay - it is enabled by default.

Can you repeat the test I suggested using perf but changing  message size from `1000` to `100`?

```bin/pulsar-perf produce -r 1000000 -s 100 &lt;topic&gt;```
This should mirror your workload using python client. Then we can know if this issue is because of message size .
----
2020-03-19 02:56:53 UTC - Chris: I've tried that in the meantime already, including trying to use it with partitions. It seemed to be alright.
----
2020-03-19 02:58:14 UTC - Chris: I think the important bit I've not tried yet is the full number of messages in the backlog. I've set the perf-producer to send in 200mm into the test topic. That'll take some time, so I'll probably get back with the results tomorrow.
----
2020-03-19 02:59:12 UTC - Sijie Guo: Okay :ok_hand: 
----
2020-03-19 02:59:33 UTC - Sijie Guo: Ping me once you have the results 
----
2020-03-19 03:03:58 UTC - Chris: I've got some less clean results here, using the actual topic that's slow. I've just connected pulsar-perf using our actual subscription. This subscription has a backlog of around 400mm currently.
----
2020-03-19 04:00:08 UTC - Penghui Li: @Aaron Stockton I have pushed a PR for fix this issue, please help take a look. <https://github.com/apache/pulsar/pull/6558>
----
2020-03-19 04:55:26 UTC - Pradeesh: using version 0.1.0
actually this is the error I see continously @Sijie Guo
```2020-03-19 04:54:26.041 UTC [15266] FATAL:  role "postgres" does not exist
2020-03-19 04:54:27.032 UTC [15267] FATAL:  role "postgres" does not exist
2020-03-19 04:54:28.024 UTC [15268] FATAL:  role "postgres" does not exist
2020-03-19 04:54:29.028 UTC [15269] FATAL:  role "postgres" does not exist
2020-03-19 04:54:30.047 UTC [15270] FATAL:  role "postgres" does not exist
2020-03-19 04:54:31.042 UTC [15271] FATAL:  role "postgres" does not exist
2020-03-19 04:54:32.038 UTC [15273] FATAL:  role "postgres" does not exist```
----
2020-03-19 06:12:35 UTC - tuteng: Do you show me the command?
----
2020-03-19 06:13:26 UTC - tuteng: Are you pulsar in cluster mode or standalone mode, enable auth?
----
2020-03-19 06:15:19 UTC - Pushkar Sawant: My cluster is intermittently seeing “Could not send message to broker within given timeout” error. The timeout is set to default(30s).
In the broker logs i see following sequence of logs for the topic. All these entries happen within 1 second. Has anyone experienced? Cluster utilization is well below 25% for Broker, Bookkeeper and Zookeeper.
```1. Create a new producer
2. Ledger closer as it's full
3. Producer closed.
4. New producer created.```
----
2020-03-19 07:27:49 UTC - Sijie Guo: @tuteng he is using 0.1.0. so it seems that the users table is not initialized.
----
2020-03-19 07:28:33 UTC - Sijie Guo: Can you paste some of the log about this?
----
2020-03-19 07:35:02 UTC - tuteng: I seem to understand
Have you used your postgres database? If so, you should first initialize the table structure. <https://github.com/apache/pulsar-manager/blob/branch-0.1.0/docker/init_db.sql>
----
2020-03-19 08:43:29 UTC - Pushkar Sawant: ```Mar 19 05:44:28 prod-pulsar-broker-new-05 pulsar[4198]: 05:44:28.046 [pulsar-io-21-6] INFO  org.apache.pulsar.broker.service.ServerCnx - [/10.1.11.92:52764][<persistent://1663/Salesforce/Lead>] Creating producer. producerId=7926
Mar 19 05:44:28 prod-pulsar-broker-new-05 pulsar[4198]: 05:44:28.047 [ForkJoinPool.commonPool-worker-2] INFO  org.apache.pulsar.broker.service.ServerCnx - [/10.1.11.92:52764] Created new producer: Producer{topic=PersistentTopic{topic=<persistent://1663/Salesforce/Lead>}, client=/10.1.11.92:52764, producerName=prod-pulsar-new-28-96718, producerId=7926}
Mar 19 05:44:28 prod-pulsar-broker-new-05 pulsar[4198]: 05:44:28.063 [bookkeeper-ml-workers-OrderedExecutor-1-0] INFO  org.apache.bookkeeper.mledger.impl.OpAddEntry - [1663/Salesforce/persistent/Lead] Closing ledger 682134 for being full
Mar 19 05:44:28 prod-pulsar-broker-new-05 pulsar[4198]: 05:44:28.072 [pulsar-io-21-6] INFO  org.apache.pulsar.broker.service.ServerCnx - [PersistentTopic{topic=<persistent://1663/Salesforce/Lead}][prod-pulsar-new-28-96718]> Closing producer on cnx /10.1.11.92:52764
Mar 19 05:44:28 prod-pulsar-broker-new-05 pulsar[4198]: 05:44:28.072 [pulsar-io-21-6] INFO  org.apache.pulsar.broker.service.ServerCnx - [PersistentTopic{topic=<persistent://1663/Salesforce/Lead}][prod-pulsar-new-28-96718]> Closed producer on cnx /10.1.11.92:52764
Mar 19 05:44:28 prod-pulsar-broker-new-05 pulsar[4198]: 05:44:28.118 [pulsar-io-21-8] INFO  org.apache.pulsar.broker.service.ServerCnx - [/10.1.10.235:43906][<persistent://1663/Salesforce/OpportunityHistory>] Creating producer. producerId=12529
Mar 19 05:44:28 prod-pulsar-broker-new-05 pulsar[4198]: 05:44:28.120 [ForkJoinPool.commonPool-worker-0] INFO  org.apache.pulsar.broker.service.ServerCnx - [/10.1.10.235:43906] Created new producer: Producer{topic=PersistentTopic{topic=<persistent://1663/Salesforce/OpportunityHistory>}, client=/10.1.10.235:43906, producerName=prod-pulsar-new-28-96719, producerId=12529}
Mar 19 05:44:28 prod-pulsar-broker-new-05 pulsar[4198]: 05:44:28.129 [pulsar-io-21-8] INFO  org.apache.pulsar.broker.service.ServerCnx - [PersistentTopic{topic=<persistent://1663/Salesforce/OpportunityHistory}][prod-pulsar-new-28-96719]> Closing producer on cnx /10.1.10.235:43906
Mar 19 05:44:28 prod-pulsar-broker-new-05 pulsar[4198]: 05:44:28.129 [pulsar-io-21-8] INFO  org.apache.pulsar.broker.service.ServerCnx - [PersistentTopic{topic=<persistent://1663/Salesforce/OpportunityHistory}][prod-pulsar-new-28-96719]> Closed producer on cnx /10.1.10.235:43906
Mar 19 05:44:28 prod-pulsar-broker-new-05 pulsar[4198]: 05:44:28.183 [pulsar-io-21-7] INFO  org.apache.pulsar.broker.service.ServerCnx - [/10.1.14.129:58086][<persistent://1663/Salesforce/Lead>] Creating producer. producerId=8058
Mar 19 05:44:28 prod-pulsar-broker-new-05 pulsar[4198]: 05:44:28.184 [ForkJoinPool.commonPool-worker-0] INFO  org.apache.pulsar.broker.service.ServerCnx - [/10.1.14.129:58086] Created new producer: Producer{topic=PersistentTopic{topic=<persistent://1663/Salesforce/Lead>}, client=/10.1.14.129:58086, producerName=prod-pulsar-new-28-96720, producerId=8058}
Mar 19 05:44:28 prod-pulsar-broker-new-05 pulsar[4198]: 05:44:28.185 [bookkeeper-ml-workers-OrderedExecutor-1-0] WARN  org.apache.bookkeeper.client.BookieWatcherImpl - New ensemble: [10.1.11.229:3181, 10.1.12.46:3181] is not adhering to Placement Policy. quarantinedBookies: []
Mar 19 05:44:28 prod-pulsar-broker-new-05 pulsar[4198]: 05:44:28.188 [pulsar-ordered-OrderedExecutor-5-0-EventThread] INFO  org.apache.bookkeeper.client.LedgerCreateOp - Ensemble: [10.1.11.229:3181, 10.1.12.46:3181] for ledger: 686525
Mar 19 05:44:28 prod-pulsar-broker-new-05 pulsar[4198]: 05:44:28.188 [pulsar-ordered-OrderedExecutor-5-0-EventThread] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [1663/Salesforce/persistent/Lead] Created new ledger 686525
Mar 19 05:44:28 prod-pulsar-broker-new-05 pulsar[4198]: 05:44:28.200 [pulsar-io-21-7] INFO  org.apache.pulsar.broker.service.ServerCnx - [PersistentTopic{topic=<persistent://1663/Salesforce/Lead}][prod-pulsar-new-28-96720]> Closing producer on cnx /10.1.14.129:58086
Mar 19 05:44:28 prod-pulsar-broker-new-05 pulsar[4198]: 05:44:28.200 [pulsar-io-21-7] INFO  org.apache.pulsar.broker.service.ServerCnx - [PersistentTopic{topic=<persistent://1663/Salesforce/Lead}][prod-pulsar-new-28-96720]> Closed producer on cnx /10.1.14.129:58086```
----
2020-03-19 08:43:36 UTC - Pushkar Sawant: Affected topic: <persistent://1663/Salesforce/Lead>
----
2020-03-19 09:02:57 UTC - Sijie Guo: @tuteng do we document this somewhere?
----
2020-03-19 09:07:37 UTC - tuteng: Yes, we have a simple document <https://github.com/apache/pulsar-manager/blob/master/src/README.md#use-custom-databases>
----
2020-03-19 09:08:33 UTC - tuteng: But, this needs to be improved.
----