You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pulsar.apache.org by Apache Pulsar Slack <ap...@gmail.com> on 2018/10/02 09:11:03 UTC
Slack digest for #general - 2018-10-02

2018-10-01 11:30:42 UTC - Jean-Bernard van Zuylen: @Sijie Guo Pull request available: <https://github.com/apache/pulsar/pull/2690>
----
2018-10-01 14:54:19 UTC - Sijie Guo: :+1: 
----
2018-10-01 15:23:02 UTC - Grant Wu: I am confused as to why people are worried about the size of the docker image.
----
2018-10-01 15:23:21 UTC - Grant Wu: There is no docker image for Functions, as I understand it.  There’s a docker image for Pulsar.  But Pulsar does not invoke a Docker image for functions.
----
2018-10-01 15:23:30 UTC - Grant Wu: I am also confused as to why people are concerned about the overhead of function calls.
----
2018-10-01 15:23:46 UTC - Grant Wu: I feel like any such overhead is dwarfed by wanting to use higher level languages like Javascript
----
2018-10-01 15:26:14 UTC - Grant Wu: Or, hell, the overhead from doing IPC
----
2018-10-01 15:30:01 UTC - Matteo Merli: There’s no IPC involved though, the runtime has a consumer/producer instance in the function process. It the same as using directly the pub-sub API. 
----
2018-10-01 15:30:39 UTC - Grant Wu: Er, doesn’t that still need to talk to Pulsar
----
2018-10-01 15:30:52 UTC - Grant Wu: Over the binary protocol
----
2018-10-01 15:31:15 UTC - Grant Wu: Like, directly using the pub-sub API requires talking over a protocol
----
2018-10-01 15:31:50 UTC - Matteo Merli: Sure, but that’s already optimized for high throughput through batching
----
2018-10-01 15:32:08 UTC - Grant Wu: Sure, I’m just saying that the overhead of _invoking a function_ is likely to be minimal still
----
2018-10-01 15:32:30 UTC - Matteo Merli: And the use of flow control to push many messages to a consumer insurance
----
2018-10-01 15:32:38 UTC - Matteo Merli: (Instance)
----
2018-10-01 15:34:25 UTC - Grant Wu: Although, I don’t know much about the Python/NodeJS interpreters, maybe invoking a function is relatively expensive.  I doubt it, though, especially in the case of invoking the same function over and over
----
2018-10-01 15:44:16 UTC - Grant Wu: Is a docker container started to run the Pulsar function?  That’s not the impression I got, but, correct me if my assumption is wrong
----
2018-10-01 16:35:38 UTC - Sanjeev Kulkarni: The overhead of invoking is fairly small even for interpreted languages like Python.
----
2018-10-01 16:36:36 UTC - Grant Wu: I believe that too, I’m just… willing to entertain the possibility :stuck_out_tongue: My philosophy re: performance is “pls bring measurements”
----
2018-10-01 16:36:47 UTC - Sanjeev Kulkarni: However what could be large is any kind of overhead associated within a function logic. For instance if the function is looking  up a value in a database, then we need to make sure that it doesnt happen on a per message basis. This is where initialzing those connections in constructors and reusing them in function code makes sense
----
2018-10-01 16:36:54 UTC - Grant Wu: That’s a good point!
----
2018-10-01 16:39:00 UTC - Sanjeev Kulkarni: WRT docker, I think the point is, when pulsar supports submitting functions to kubernetes(<https://github.com/apache/pulsar/pull/1950>), then every function submission starts a kubernetes job which will need to download a docker image from somewhere to init the pods. These pods only need the stuff for running functions and need not have anything for running pulsar itself. Thus, it might make sense to have a functions only docker images that might make this image very small.
----
2018-10-01 16:39:44 UTC - Grant Wu: Ah, okay, I didn’t realize that was done
----
2018-10-01 16:40:09 UTC - Grant Wu: Wouldn’t the docker images already be downloaded to run Pulsar though
----
2018-10-01 16:40:17 UTC - Grant Wu: i.e. wouldn’t they be available locally
----
2018-10-01 16:40:27 UTC - Grant Wu: Or am I misunderstanding the architecture
----
2018-10-01 16:41:30 UTC - Grant Wu: I guess it’s not necessarily the case
----
2018-10-01 16:41:43 UTC - Grant Wu: Because it might not be on a machine which has downloaded the images before
----
2018-10-01 16:42:14 UTC - Sanjeev Kulkarni: It might. I’m not sure if kubernetes master caches some images locally so that they need not be downloaded from internet. But even if there is caching, imagine starting a function with parallelism of 100, and suddenly you see 100 copies of this large image being copied within a local network. Might starve other application even if momentarily.
----
2018-10-01 16:42:30 UTC - Grant Wu: I see, makes sense.
----
2018-10-01 16:45:45 UTC - Dave Southwell: Noob question.  Last week I setup a staging pulsar instance and today I came in to find my /tmp was full of ./librocksdbjni956141922235128626.so type files.  Did I mis-configure something?  I mean clearly I have, but can some one point me in the right direction?
----
2018-10-01 16:47:46 UTC - Matteo Merli: Did you have any repeated crashes there? The`librocksdbjnixxxx.so` are automatically extracted by RocksDB jar. Under normal conditions, these files should be removed on shutdown
----
2018-10-01 16:47:58 UTC - Grant Wu: Is that related to <https://apache-pulsar.slack.com/archives/C5ZSVEN4E/p1538348632000100>
----
2018-10-01 16:48:52 UTC - Dave Southwell: I'll check the logs for crashes.
----
2018-10-01 17:26:06 UTC - Guillem: @Grant Wu i think a lot of questions here are coming from seeing the parallelism between pulsar and serverless
----
2018-10-01 17:26:32 UTC - Grant Wu: I'm seeing no parallelism
----
2018-10-01 17:26:34 UTC - Guillem: in serverless, you usually instantiate a container when you need to run a function, and destroy it if it hasn't been used for a while
----
2018-10-01 17:26:49 UTC - Guillem: as such, you want to minimize the footprint of the image for a number of reasons
----
2018-10-01 17:27:14 UTC - Guillem: disk usage, that potential download/caching of the image, time to start a container itself
----
2018-10-01 17:28:29 UTC - Grant Wu: Pulsar functions and serverless, sure
----
2018-10-01 17:28:33 UTC - Guillem: also not sure what other services may run in a pulsar container with the current image
----
2018-10-01 17:28:45 UTC - Guillem: so maybe there's something running even if you want that container for functions only
----
2018-10-01 17:28:51 UTC - Guillem: and that may be using some memory
----
2018-10-01 17:28:57 UTC - Grant Wu: Presumably nothing else would be turned on
----
2018-10-01 17:29:03 UTC - Guillem: (although this is just an assumption, i don't now the inner details)
----
2018-10-01 17:29:04 UTC - Matteo Merli: Nope, nothing is running in the container — if you don’t start it
----
2018-10-01 17:29:09 UTC - Guillem: ok, cool
----
2018-10-01 17:29:27 UTC - Guillem: then it would only be a matter of optimizing disk space and container startup times i guess
----
2018-10-01 17:30:04 UTC - Matteo Merli: &gt; in serverless, you usually instantiate a container when you need to run a function, and destroy it if it hasn’t been used for a while

This is achieved through the parallelism setting of the function. It controls how many instances you have active. Still manual setting for now.
----
2018-10-01 17:30:48 UTC - Guillem: yep, i'm aware of it @Matteo Merli and i think in the world of pulsar, it makes sense to keep those containers running unless you explicitely kill them
----
2018-10-01 17:31:10 UTC - Guillem: it's not really 1:1 with serverless, so keeping those containers alive for stream processing make a lot of sense to me
----
2018-10-01 17:31:23 UTC - Guillem: it's probably more an issue of optimizing resource and also managing the scaling easily
----
2018-10-01 17:36:51 UTC - Guillem: so in terms of how functions work, can somebody tell me if my current understanding of how pulsar does it is correct?
- the worker container (i think this happens in the brokers?) will use the client library specific to the runtime to connect to the pulsar queue (source as input, sink as output)
- then the container will instantiate the class that you embed your function into so it does 'preload' things like what was discussed before around DB connections and so
- then, when a new message is received in the pulsar source, it will be sent to the process() method of the instantiated class and the output piped to the output
- at the end, the class will still be instantiated and waiting for new messages to arrive to call the process() method again
----
2018-10-01 17:40:48 UTC - Sanjeev Kulkarni: Its a little different than that. This is a brief summary of the workflow
----
2018-10-01 17:41:55 UTC - Sanjeev Kulkarni: 1. User submits a pulsar function using the rest api. The rest call can be serviced by a broker who has functions_worker config enabled. Or it could go to a server that is dedicated to handle function requests
----
2018-10-01 17:43:16 UTC - Sanjeev Kulkarni: 2. Function_workers are configured to use some kind of runtime. They could be configured to use threadruntime(applicable for java functions only), process runtime or kubernetes runtime. So depending on the runtime and the function, this worker ensemble collectively starts the requested number of function instances amongst them
----
2018-10-01 17:44:08 UTC - Sanjeev Kulkarni: 3. The runtime decides what kind of action to do. Threadruntime just starts a new thread to service the function. ProcessRuntime starts a new process and Kubernetes runtime launches a k8 job
----
2018-10-01 17:44:49 UTC - Sanjeev Kulkarni: 4. What is started by these runtime is a function instance that is nothing but a wrapped (producer -&gt; function -&gt; consumer) application.
----
2018-10-01 17:46:20 UTC - Sanjeev Kulkarni: As such producer/consumer of the function instance gets the data directly from puslar using the pulsar api
----
2018-10-01 17:46:52 UTC - Sanjeev Kulkarni: that is different from the usual serverless architecture where the producer sits outside the serverless function and pipes the data to it
----
2018-10-01 18:41:32 UTC - Nicolas Ha: Is there a web healthcheck endpoint for Pulsar? Or just an endpoint that responds when the broker is alive?
This would be useful for CI/Healthcheck. It would be awesome if it did conform to kuberntes livenessProbe too
----
2018-10-01 18:41:59 UTC - Nicolas Ha: Pretty sure I asked a while back and there wasn’t one - if that’s not the case should I create a ticket?
----
2018-10-01 18:46:17 UTC - Ali Ahmed: @Nicolas Ha yes there is you can use “http://{broker-host}:8080/admin/brokers/configuration”
----
2018-10-01 18:46:29 UTC - Ali Ahmed: and check for 200 Ok
----
2018-10-01 18:46:57 UTC - Nicolas Ha: that would work for me yes, thank you :slightly_smiling_face: Do you know by any chance if it requires authentication?
----
2018-10-01 18:52:14 UTC - Nicolas Ha: (I’ll try and see)
----
2018-10-01 19:05:09 UTC - Nicolas Ha: no need for auth it seems :slightly_smiling_face: thanks Ahmed
----
2018-10-02 05:05:59 UTC - Nathanial Murphy: So I'm trying to backfill my pulsar instance with data from a datasource on a single partition. What are the common bottlenecks with pulsar that I can avoid to speed this up? I need to keep this topic to a single partition for the strict/total ordering guarantees
----
2018-10-02 05:18:48 UTC - Nathanial Murphy: also, second question - is there any plans to support distributed transactions across topics like kafka currently does?
----
2018-10-02 05:32:25 UTC - Matteo Merli: Make sure you publish asynchronously, to pipeline messages from client to broker and achieve higher throughput 
----
2018-10-02 05:33:01 UTC - Matteo Merli: Yes, there are plans to get into that as well
----
2018-10-02 05:34:32 UTC - Nathanial Murphy: you're the real mvp @Matteo Merli
last question - what's the easiest way to get the last published message to a topic
----
2018-10-02 05:50:31 UTC - Matteo Merli: from a producer’s perspective? you mean after a crash or when publishing?
----
2018-10-02 05:57:14 UTC - Nathanial Murphy: After a crash. I need to be able to know where to resume on both the Pulsar and the mysql binlog side.
----
2018-10-02 05:59:41 UTC - Matteo Merli: Take a look at <http://pulsar.apache.org/docs/en/cookbooks-deduplication/>
----
2018-10-02 06:00:31 UTC - Matteo Merli: and <https://streaml.io/blog/pulsar-effectively-once> for a more prosaic version
----
2018-10-02 06:01:17 UTC - Matteo Merli: Once the deduplication is enabled, you can use `long lastSequenceId = producer.getLastSequenceId();` to fetch what was the last message published by a particular producer
----
2018-10-02 06:05:18 UTC - Nathanial Murphy: can I use that sequence ID to look up a message ID?
----
2018-10-02 06:05:40 UTC - Nathanial Murphy: I'm trying to resume another stream from a separate system to get data into pulsar - in this case, a (filename, byte offset) tuple from a mysql binlog. This information is available in the last message my producer published to a given topic. I can guarantee that my producer is the only one writing to this topic, and that my topic only covers one partition.
----
2018-10-02 06:27:50 UTC - Matteo Merli: You can assign any meaning to the sequence id, as long as it’s monotonically increasing (jumping ahead is fine)
----
2018-10-02 06:28:32 UTC - Matteo Merli: I mean, when you publish a message, you can specify the sequence id.
----
2018-10-02 06:33:21 UTC - Nathanial Murphy: Okay, sure. I'd rather not encode a filepath in a sequenceID though. Is there any way of getting the last published message on a topic and/or partition?
----
2018-10-02 06:36:24 UTC - Matteo Merli: Not directly. You could use a Reader but it would be either posistion on oldest message, specific message or latest message (but excluded). There’s currently no option to position on latest message “included” :confused:
----
2018-10-02 06:40:48 UTC - Nathanial Murphy: hm. I could apply an incredibly aggressive compaction scheme to minimise the number of reads as it doesn't make sense to "compact" this topic
----
2018-10-02 06:41:04 UTC - Nathanial Murphy: idk. You've given me a lot to ruminate on. Thanks.
----