You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pulsar.apache.org by Apache Pulsar Slack <ap...@gmail.com> on 2018/05/29 09:11:02 UTC

Slack digest for #general - 2018-05-29

2018-05-28 09:14:20 UTC - Idan: @Ali Ahmed great. thanks. ill go there and share results
----
2018-05-28 09:14:26 UTC - Idan: btw: we must have persistent and durable
----
2018-05-28 09:14:34 UTC - Idan: all messages must me kept
----
2018-05-28 09:14:39 UTC - Idan: that still will allow us 5ms latency?
----
2018-05-28 09:15:27 UTC - Ali Ahmed: yes
----
2018-05-28 09:15:28 UTC - Idan: @Matteo Merli iam struggling to build the producer using the new API. perhaps you can guide: 
            Producer&lt;String&gt; producer = client.newProducer()
                    .topic(topic)
                    .create();
----
2018-05-28 09:15:34 UTC - Idan: great
----
2018-05-28 09:15:53 UTC - Idan: @Ali Ahmed for what througput? we expecting 100,000 msg per sec
----
2018-05-28 09:16:01 UTC - Idan: that will still allow us 5ms?
----
2018-05-28 09:17:24 UTC - Ali Ahmed: can be that depends on your batch  and payload size.
----
2018-05-28 09:18:25 UTC - Idan: does the client under the hood has connection-pool? coz to flush this queue every 1 message
----
2018-05-28 09:18:39 UTC - Idan: on high-load
----
2018-05-28 09:18:42 UTC - Idan: can take lots of resources
----
2018-05-28 09:20:16 UTC - Ali Ahmed: here is how create a producer
```
Producer&lt;byte[]&gt; producer = client.newProducer()
        .topic(topic)
        .enableBatching(true)
        .sendTimeout(10, TimeUnit.SECONDS)
        .producerName("my-producer")
        .create();
```
----
2018-05-28 09:22:24 UTC - Ali Ahmed: are you saying one vert.x instance is being invoked 100,000 times a second ?
----
2018-05-28 09:22:36 UTC - Idan: not one.. we can have multiple
----
2018-05-28 09:22:43 UTC - Idan: multiple instances
----
2018-05-28 09:22:58 UTC - Idan: but the queue eventually will hit load of 100,000 msg per sec
----
2018-05-28 09:23:18 UTC - Idan: we are gaming platform
----
2018-05-28 09:23:27 UTC - Idan: having million of players around the globe
----
2018-05-28 09:24:12 UTC - Idan: we had doubts to use Kafka, but pulsar looks like the new-age
----
2018-05-28 09:24:31 UTC - Ali Ahmed: pulsar can publish 10's of millions messages per second with low latency and guaranteed durability but I can’t give a simple answer
----
2018-05-28 09:25:00 UTC - Ali Ahmed: at these high rates there are config switches that need to be tuned
----
2018-05-28 09:25:27 UTC - Idan: Is there a doc somewhere explain the fields of this request?
----
2018-05-28 09:25:56 UTC - Idan: @Ali Ahmed thats sound great. iam trying to find my way within your docs to understand how to use the system wisely
----
2018-05-28 09:26:24 UTC - Idan: btw: how would you best suggest to test pulsar locally ?
----
2018-05-28 09:26:25 UTC - Ali Ahmed: docs are still work in progress but a draft is here  
<https://pulsar.incubator.apache.org/docs/latest/clients/Java/>
----
2018-05-28 09:28:16 UTC - Ali Ahmed: depends what you plan to do , I generally spin a small cluster in aws for testing purposes , local instances are really meant to do some sanity function testing
----
2018-05-28 09:28:32 UTC - Ali Ahmed: I can’t recommend a local cluster for production traffic
----
2018-05-28 09:28:39 UTC - Idan: ofcourse not. just for testing
----
2018-05-28 09:28:46 UTC - Idan: to get my consumers-producers going
----
2018-05-28 09:28:50 UTC - Ali Ahmed: sure
----
2018-05-28 09:29:11 UTC - Ali Ahmed: local is fine for that
----
2018-05-28 09:31:09 UTC - Ali Ahmed: To come back to my original point the good think about pulsar relative to kafka is that’s configurable in production , you can scale up scale down brokers and bookies and change configs with out any downtime since there is no re partitioning or rebalancing.
----
2018-05-28 09:31:52 UTC - Ali Ahmed: so I recommend start small and depending on the workload in prod adjust number of nodes and configs
----
2018-05-28 09:37:08 UTC - Idan: @Ali Ahmed thanks for pointing this out.
----
2018-05-28 09:38:57 UTC - Ali Ahmed: @Idan Just to let you know 2.0 release is forthcoming very soon
----
2018-05-28 09:39:17 UTC - Idan: yes. iam waiting for it. right now working on rc1 client
----
2018-05-28 10:02:03 UTC - Mate Varga: Hi,
would someone be so kind and provide me a list of companies using Pulsar right now in production? Of course I don't need a 'complete' list, just a few names.
----
2018-05-28 10:08:39 UTC - Ali Ahmed: ```yahoo, <http://taxistartup.com|taxistartup.com>```
----
2018-05-28 10:11:30 UTC - Mate Varga: Thanks.
----
2018-05-28 10:13:10 UTC - Mate Varga: I'm asking this because we need to decide what kind of messaging infra are we going to introduce internally, and feature-wise Pulsar is just great, but it's also quite new and it'd be quite inconvenient if it was left without maintenance. :slightly_smiling_face:
----
2018-05-28 10:14:33 UTC - Idan: @Mate Varga same thoughts here:)
----
2018-05-28 10:16:57 UTC - Ali Ahmed: I can’t say much here but there are commercial enterprise supported deployments being done and there is an aggressive roadmap for features and performance improvements planned.
----
2018-05-28 10:23:17 UTC - Matti-Pekka Laaksonen: I have been experimenting with Pulsar, though only with a local setup on my laptop. I work for a public transport authority, and we'd like to use Pulsar (or Kafka, but Pulsar seems easier to manage) for processing real time data from our vehicles. Mostly this means arrival and departure time predictions for stops, but also gps locations for each vehicle every second, and we are talking about few thousand messages per second here. This is just to give some background, but I think my questions are quite general in nature.

I have written some small Java services that incrementally consume individual messages, process them and produce them into pulsar for further processing. When we get a raw estimate for a single stop event, there's going to be three to five linear processing steps before the prediction is published, and the message is fed to a new topic in Pulsar each step. Pulsar promises a latency of 5ms and this is what I have been seeing on my laptop-based environment. But doesn't this mean that a synchronous consumer-producer-loop can only handle 200 messages per second? And if the messages arrive in batches of 200 messages, the last one has actually a latency of 1000 ms? So can synchronous consumers/producers be used only for very light workloads, and are asynchronous consumers and producers the more normal case?
----
2018-05-28 10:31:13 UTC - Mate Varga: @Matti-Pekka Laaksonen what kind of coupling do you have between producers and consumers? the ~200 msgs/s limit applies if you have a single producer and a single consumer and they communicate through a single topic and the producer somehow waits with the next message until the consumer can process the previous message
----
2018-05-28 10:31:50 UTC - Mate Varga: but this is essentially synchronous RPC over a high-latency infrastructure, you should not be using a 'queue' for this I think
----
2018-05-28 10:32:15 UTC - Matti-Pekka Laaksonen: @Mate Varga Yes, this is exactly the case, and I quite quickly realized it is not the smartest use for Pulsar
----
2018-05-28 10:32:37 UTC - Mate Varga: Let's say I'd like to start prototyping now and use Pulsar in prod in a few weeks for non-business critical apps. Is 2.0 the way to go?
----
2018-05-28 10:34:05 UTC - Ali Ahmed: 2.0 rc is already out official release is in the next few days
----
2018-05-28 10:34:46 UTC - Matti-Pekka Laaksonen: This leads also to another conceptual problem I've come across. Some processing steps are simple and simply transform the message in some way. However, sometimes we need to maintain a consistent state that we update with the Pulsar messages. One example use case is handling the forecasts for all the currently ongoing trips. This requires a HashMap-like structure that holds all the stop predictions for each trip that is currently under way
----
2018-05-28 10:35:28 UTC - Mate Varga: Use grpc or Avro RPC or Thrift for synchronous RPC -- you can get close to 10 microseconds in latency, but it's a completely different use case.
----
2018-05-28 10:35:54 UTC - Mate Varga: Thanks. :thumbsup:
----
2018-05-28 10:37:55 UTC - Matti-Pekka Laaksonen: I have been juggling different options, but the current idea is to have a single Java process that uses async consumers and producers to read data from Pulsar and write it back, but the state is only touched by a single thread. Does this sound reasonable at all?
----
2018-05-28 10:39:49 UTC - Mate Varga: Almost :slightly_smiling_face: your state will be your 'hashmap' and the in-flight messages.
----
2018-05-28 10:40:19 UTC - Matti-Pekka Laaksonen: @Matti-Pekka Laaksonen uploaded a file: <https://apache-pulsar.slack.com/files/UATSA58QK/FAX1GHHB6/-.java|Untitled>
----
2018-05-28 10:40:26 UTC - Mate Varga: if the 'output' of your processing pipeline is the 'hashmap', then it's absolutely reasonable to only update it from a single consumer
----
2018-05-28 10:41:31 UTC - Matti-Pekka Laaksonen: This snippet is the current implementation of the async consumer/processor-loop. The whole CompletableFuture API is also new to me, so this is quite a big bite to take at once
----
2018-05-28 10:41:35 UTC - Ali Ahmed: @Matti-Pekka Laaksonen I will give some general thoughts here what you describing is generic synchronous producer and consumer loop that will hold true for any system, in case of a system like pulsar you really need to think on scaling out many producers and consumers which pulsar handles very well.
----
2018-05-28 10:42:49 UTC - Ali Ahmed: for compute especially stateful compute , pulsar is introducing some features in alpha preview as part of 2.0 that will be right way to go
----
2018-05-28 10:43:00 UTC - Matti-Pekka Laaksonen: @Mate Varga Yes, the output is actually the HashMap (likely a cache from Google's Guava library in the production version) that we'll serialize periodically. But in addition to this I think we'll publish every incremental update also
----
2018-05-28 10:43:52 UTC - Mate Varga: Do you have any (early) discussion or design doc on this?
----
2018-05-28 10:44:32 UTC - Ali Ahmed: <https://streaml.io/blog/pulsar-functions/>
----
2018-05-28 10:45:08 UTC - Matti-Pekka Laaksonen: In general, it would be nice to see an example of a well scaling consumer-producer process that reads messages from Pulsar and writes them back to it
----
2018-05-28 10:45:42 UTC - Mate Varga: Ah, you meant the functions. Cool, thanks.
----
2018-05-28 10:46:50 UTC - Ali Ahmed: if people are trying to simulate high capacity workloads on pulsar you can use this as a starting point
<https://github.com/openmessaging/openmessaging-benchmark>
----
2018-05-28 10:47:13 UTC - Ali Ahmed: there are examples for pulsar already there
----
2018-05-28 10:49:33 UTC - Mate Varga: @Ali Ahmed - we're using Docker Swarm as container orchestrator. You don't have docs for Swarm which is ok, but -- what are the recommendations for a small production deployments? The docs for bare metal say that we'd need 6 VMs (we definitely won't). Should I try following the Kubernetes docs (that recommend 2 BookKeepers, 3 ZKs, 3 Pulsar brokers, plus monitoring)?
----
2018-05-28 10:50:46 UTC - Ali Ahmed: the kubernetes template sounds reasonable
+1 : Mate Varga, Ali Ahmed
----
2018-05-28 11:06:04 UTC - Ivan Kelly: @Mate Varga there are some docker compose files in review (not merged yet). If this is going to be production, you should have 3 BK. 2 pulsar brokers should be enough
----
2018-05-28 11:06:37 UTC - Mate Varga: thanks, 2 BK was a bit suspicious :slightly_smiling_face:
----
2018-05-28 11:07:18 UTC - Ivan Kelly: ya, if you lose one, then all writes have only one copy, which is asking for trouble
----
2018-05-28 11:08:03 UTC - Ivan Kelly: from a "voting" point of view, 2 is fine though, because BK doesn't implement consensus itself, it uses ZK
----
2018-05-28 11:08:40 UTC - Mate Varga: BTW, does Pulsar support individual message _deletion_? (not just ack)
----
2018-05-28 11:09:27 UTC - Mate Varga: GDPR is a pain because we cannot use e.g. Kafka to store all of our time-series data because we might need to delete messages
----
2018-05-28 11:09:36 UTC - Mate Varga: so I wonder how Pulsar behaves in this aspect
----
2018-05-28 11:10:44 UTC - Ivan Kelly: no, you're going to find it hard to find a high performant store which does deletes. Deletion is hard
----
2018-05-28 11:11:16 UTC - Ivan Kelly: but what you can do is encrypt the personal info with a key which is unique to the user
----
2018-05-28 11:11:42 UTC - Ivan Kelly: then if you need to delete the user, just delete the key
----
2018-05-28 11:12:27 UTC - Mate Varga: we do encryption but it's .. not that trivial; there are some caveats like how do you filter and search effectively for encrypted data
----
2018-05-28 11:12:45 UTC - Ivan Kelly: you could even use another pulsar topic for this, and use compaction and a 90 retention period
----
2018-05-28 11:12:53 UTC - Mate Varga: this = ?
----
2018-05-28 11:13:07 UTC - Ivan Kelly: to store the encryption keys
----
2018-05-28 11:13:38 UTC - Ivan Kelly: there's a part in GDPR about only keeping data for 90 days to. so you'd have to roll the keys regularly also
----
2018-05-28 11:13:56 UTC - Ivan Kelly: from what I hear, this is what spotify is doing internally
----
2018-05-28 11:14:17 UTC - Mate Varga: ah ok, we have our own model for handling encryption keys (that's more sophisticated than this above because it is medical data), so that is OK
----
2018-05-28 11:14:48 UTC - Mate Varga: so we basically cannot persist user keys :slightly_smiling_face:
----
2018-05-28 11:17:32 UTC - Byron: @Matteo Merli i am attempting to build a static Go binary. getting a whole host of errors. dynamically linking builds correctly.
----
2018-05-28 11:17:36 UTC - Ivan Kelly: anyhow, my point is, deleting individual pieces of info, to the point that it no longer exists on disk, is hard
----
2018-05-28 11:17:59 UTC - Mate Varga: yes, right now I am thinking on just storing this kind of data in a separate document store (if needed)
----
2018-05-28 11:18:13 UTC - Mate Varga: e.g. Datomic or smtg similar that has a relatively good 'time concept'
----
2018-05-28 11:18:32 UTC - Mate Varga: and use the messaging infrastructure with non-infinite retention period
----
2018-05-28 11:18:45 UTC - Ivan Kelly: is all the data personal?
----
2018-05-28 11:18:50 UTC - Ivan Kelly: @Byron he likely won't be online for 5 hours or so
----
2018-05-28 11:20:10 UTC - Ivan Kelly: from what I've read on GDPR, it's aimed at places where of all the data, only a small bit is identifible personal data, so it encourages this data to be pulled out, and stored apart from the rest.
----
2018-05-28 11:20:24 UTC - Ivan Kelly: the pseudo anonymization part
----
2018-05-28 11:21:10 UTC - Byron: Thanks, I figured.. it is early :wink:
----
2018-05-28 11:21:22 UTC - Byron: Just leaving it there for when he reads it
----
2018-05-28 11:22:10 UTC - Ivan Kelly: For example, lets say you're tracking a user over the internet. Their IP is identifible, but the sites they browse isn't. So the IP should be moved to some other store, and a reference to this put in the tracking stream.
----
2018-05-28 11:22:58 UTC - Mate Varga: We're storing personal medical health records. So we have sensitive data and even more sensitive data, basically.
----
2018-05-28 11:23:30 UTC - Ivan Kelly: mailing list could be another good option (and also will share the solution with anyone else having the same issue)
----
2018-05-28 11:23:43 UTC - Mate Varga: Pseudoanonymization is really ineffective at scale.
----
2018-05-28 11:30:53 UTC - Ivan Kelly: ya, i get the impression that the EU were thinking in the database normalization mindset from 15 years ago
----
2018-05-28 14:25:05 UTC - Mate Varga: Another question: is there a way to completely 'reset' the state of Pulsar? The use case is automatic system testing where we need a well-known state before each test case.
----
2018-05-28 14:34:00 UTC - Ivan Kelly: this is on multiple VMs/machines?
----
2018-05-28 14:34:55 UTC - Ivan Kelly: stop all services, delete the zookeeper journal and data directories, delete the bookkeeper journal and ledger directories.
----
2018-05-28 14:35:42 UTC - Ivan Kelly: bring zookeeper up, cluster init metadata, and then boot bk and pulsar
----
2018-05-28 14:40:47 UTC - Mate Varga: hm, we'd probably use the single-container dockerized Pulsar
----
2018-05-28 14:41:01 UTC - Mate Varga: and one test takes somewhere between 2-15 seconds
----
2018-05-28 14:42:10 UTC - Mate Varga: restarting pulsar takes at least 5 seconds as far as I can see, which is a bit too much (we reset other state as well, flush RDBMSs, etc. but that usually takes a few hundred msec)
----
2018-05-28 14:43:08 UTC - Mate Varga: so I'm looking for smtg similar to <https://kafka.apache.org/10/documentation/streams/developer-guide/app-reset-tool>
----
2018-05-28 14:44:51 UTC - Ivan Kelly: so you just want the topics to be empty?
----
2018-05-28 14:45:01 UTC - Ivan Kelly: you could delete and recreate the namespaces
----
2018-05-28 14:45:38 UTC - Ivan Kelly: or even use different namespaces for each test
----
2018-05-28 14:48:22 UTC - Mate Varga: &gt; you could delete and recreate the namespaces
that might be an option. Just emptying topics might not be ideal because the app itself might create topics, so leaving them around creates coupling between individual test cases.
----
2018-05-28 14:48:41 UTC - Mate Varga: &gt; or even use different namespaces for each test
hm, i need to learn more about how Pulsar is using namespaces
----
2018-05-28 14:48:47 UTC - Mate Varga: thanks!
----
2018-05-28 14:51:14 UTC - Ivan Kelly: if you can use separate namespaces, it'll give you the best isolation
----
2018-05-28 14:51:23 UTC - Ivan Kelly: or even different tenants
----
2018-05-28 14:52:51 UTC - Mate Varga: how expensive it is to create namespaces and/or tenants?
----
2018-05-28 14:53:02 UTC - Mate Varga: (in a practically empty pulsar cluster)
----
2018-05-28 14:56:20 UTC - Ivan Kelly: it's a few znodes in zookeeper
+1 : Mate Varga, Ali Ahmed
----
2018-05-28 16:52:24 UTC - Igor Zubchenok: Will it all work in 3 nodes configuration (with broker+bookkeeper+zookeeper at every node) if one node is dead when I set for `managedLedgerDefaultEnsembleSize=3`, `managedLedgerDefaultWriteQuorum=3` and `managedLedgerDefaultAckQuorum=2`?
----
2018-05-28 16:55:18 UTC - Sijie Guo: No in general you need managedLedgerDefaulfEnsembleSize + 1 bookies. Or you can reduce ensemble size or write quorum size 
----
2018-05-28 16:56:53 UTC - Igor Zubchenok: If I set `managedLedgerDefaultWriteQuorum=2`, so I have to set `managedLedgerDefaultAckQuorum=1` to avoid any timeouts caused by bookkeeper autorecovery, right?
----
2018-05-28 16:59:57 UTC - Sijie Guo: No you can still set ack quorum to 2. When a bookie goes down, those ledgers who stored in the down bookie will do ensemble changes to store in the other 2 bookies. You will see a hiccup during ensemble changing, but it is very minimal in general 
----
2018-05-28 17:04:44 UTC - Matteo Merli: Igor, was the earlier problem due to fsyncing on the single HDD for journal and storage?
----
2018-05-28 17:04:44 UTC - Igor Zubchenok: When a bookie is down we have so long downtime now like 10 minutes (in our case we create tens of thousands of topics per hour) so I would like try to avoid any ensemble changes if one bookie is down.
----
2018-05-28 17:05:49 UTC - Igor Zubchenok: &gt;Igor, was the earlier problem due to fsyncing on the single HDD for journal and storage?
we use 3 nodes Intel Xeon E3 with SSD for production
----
2018-05-28 17:06:55 UTC - Sijie Guo: if so you can enable delayEnsembleChange, in that way, you can still keep current setting 3/2/2 and allow one bookie going down without triggering ensemble changes.
----
2018-05-28 17:09:32 UTC - Igor Zubchenok: if `managedLedgerDefaultWriteQuorum=2` so data is stored at two bookies, and if one of two is down, broker will not get ack from 2 bookies (only from one). - this will cause a timeout or an exception.
----
2018-05-28 17:11:02 UTC - Sijie Guo: Sorry my previous comment is for 3/3/2
----
2018-05-28 17:12:02 UTC - Sijie Guo: So you are still get 2 guaranteed copies, toleranting one bookie going down without triggering ensemble changes
----
2018-05-28 17:12:20 UTC - Igor Zubchenok: Earlier problem was gone (or we was not able to reproduce it) when we disabled fsync. Then we pushed to production and now we have something similar.
----
2018-05-28 17:13:14 UTC - Igor Zubchenok: With config 3/3/2, will new topics/ledgers be created if one of three bookie is down?
----
2018-05-28 17:15:47 UTC - Sijie Guo: Creation will not succeed. Creation requires ensembleSize + 1 bookies to tolerant one goes down.
----
2018-05-28 17:17:39 UTC - Igor Zubchenok: So let's summarize: for complete one node fault tolerance I need 4 nodes minimal and config 3/3/2 and enable delayEnsembleChange?
----
2018-05-28 17:19:44 UTC - Sijie Guo: Yes. But If you have 4 nodes, I don’t think you need delayEnsembleChange.
----
2018-05-28 17:20:36 UTC - Igor Zubchenok: enabling delayEnsembleChange - means disabling BK autorecovery?
----
2018-05-28 17:23:45 UTC - Igor Zubchenok: (cannot find delayEnsembleChange in configs)
----
2018-05-28 17:24:36 UTC - Sijie Guo: No, auto recovery is independent. Delay ensemble change means if it doesn’t break ack quorum, the bk client doesn’t proactively do ensemble changes. But the auto recovery will still happen. 
----
2018-05-28 17:25:18 UTC - Sijie Guo: It is a new setting in bk. So it is not updated in the broker’s template conf 
----
2018-05-28 17:26:34 UTC - Igor Zubchenok: Pulsar 1.22 uses BK 4.3.x, seem delayEnsembleChange is available in BK 4.5.x
----
2018-05-28 17:28:45 UTC - Igor Zubchenok: *Summary: 4 nodes minimal and configuration 3/3/2 for complete one node fault tolerance.*

(maybe it will help someone, we currently are failing with 3 nodes 2/2/2 configuration)
+1 : Sijie Guo, Artem Shaban, Vasily Yanov
----
2018-05-28 19:59:39 UTC - Karthik Ramasamy: @Idan @Mate Varga current companies running pulsar in production include Yahoo, Yahoo Japan, Taxistartup and there are bunch of others that we don’t know about - we are collecting more information
----
2018-05-28 21:31:10 UTC - Igor Zubchenok: Ough, are we (TaxiStartup) only 1 of 3?
----
2018-05-28 23:59:47 UTC - Karthik Ramasamy: <https://twitter.com/danielfejo/status/988427636270563328>
----
2018-05-29 00:01:54 UTC - Karthik Ramasamy: Mercado Libre
----
2018-05-29 00:02:16 UTC - Karthik Ramasamy: more pilots in several media companies and banks are ongoing
----
2018-05-29 05:15:28 UTC - JD Suarez: @Matteo Merli @Jerry Peng Thx for the confirmation. I was able to deploy my function fine without using the yaml file but using yaml I was getting an error. It seems like "jar" is not a valid attribute, the error message spit out around 16 valid configuration attributes but "jar" was not among them.
----
2018-05-29 05:56:55 UTC - Jerry Peng: @JD Suarez yup currently you have to specify the jar package via CLI arguments. But you can specify the rest of the configs in a yaml file
----
2018-05-29 05:57:32 UTC - Jerry Peng: e.g.
./bin/pulsar-admin functions create --functionConfigFile pulsar-functions/java-examples/src/main/resources/example-function-config.yaml --jar pulsar-functions/java-examples/target/pulsar-functions-api-examples.jar
----