You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pulsar.apache.org by Lari Hotari <lh...@apache.org> on 2023/06/20 07:33:22 UTC

[DISCUSS] Pluggable Pulsar Functions runtime to support new runtimes

Dear Pulsar Community Members,

I would like to initiate a discussion on making the Pulsar Functions
runtime "pluggable". In doing so, we can ensure that the addition of new
runtime types becomes more straightforward.

This use case will allow us to add support for Pulsar Functions based on
various platforms such as:

* Pulsar Client Reactive
* Node.js / JavaScript
* WebAssembly (WASM)
* Spring Pulsar & Reactive Spring

One of the weak points in the current Pulsar Functions runtime is the
default handling of messages individually. Individual message processing
can be slow and inefficient in cases where the main function of the
Pulsar Function (or Sink) is to do backend API calls.

Although pipelining (processing multiple in-flight messages) is possible
in current Pulsar Functions and Sinks, it often leads to complex and
error-prone solutions, especially when there's a need to combine
key-based ordered processing with retry and backoff implementations.

The Reactive Pulsar Client provides an inbuilt solution for implementing
pipelining. With its ReactiveMessagePipelineBuilder, we can configure
concurrency levels with key-ordered processing support. This capability
could potentially eliminate the need to use key-shared subscriptions to
scale Pulsar processing. If a reactive Pulsar Function were primarily to
serve as a router for API calls, we could adjust the concurrency level
to hundreds or even thousands, provided the backend could handle the
load.

With a pluggable Pulsar Functions runtime, we could introduce new
runtime types without the need for implementing each type in the
upstream project. This strategy could likely lead to new opportunities
for innovative ideas and contributions in this field.

I am interested to know your thoughts on making the Pulsar Functions
runtime pluggable so that we can add new runtime types.

Best Regards,

-Lari

Re: [DISCUSS] Pluggable Pulsar Functions runtime to support new runtimes

Posted by Lari Hotari <lh...@apache.org>.
On 2023/06/20 09:12:28 Enrico Olivelli wrote:
> > I am interested to know your thoughts on making the Pulsar Functions
> > runtime pluggable so that we can add new runtime types.
> 
> I see that RuntimeFactory [1] is already customizable.
> What can we do more ?
> Are you talking about providing alternative implementations for
> JavaInstanceRunnable [2] ?

My intention was to first focus on the use case before getting into the details of how it would exactly be implemented. With a pluggable solution, I mean having a solution in place where you could possibly add .nar files to some directory and add support for new runtime types by implementing some plugin specification. The current solution doesn't contain this property.
A pluggable solution would make it easier for contributing new runtime types. 
Let's say if we would want to add support for these technologies:
* functions written in Node.js / JavaScript
* functions using WebAssembly (WASM), for example implemented in Rust

Makes sense?

-Lari

Re: [DISCUSS] Pluggable Pulsar Functions runtime to support new runtimes

Posted by Enrico Olivelli <eo...@gmail.com>.
Lari,

Il giorno mar 20 giu 2023 alle ore 09:33 Lari Hotari
<lh...@apache.org> ha scritto:
>
> Dear Pulsar Community Members,
>
> I would like to initiate a discussion on making the Pulsar Functions
> runtime "pluggable". In doing so, we can ensure that the addition of new
> runtime types becomes more straightforward.
>
> This use case will allow us to add support for Pulsar Functions based on
> various platforms such as:
>
> * Pulsar Client Reactive
> * Node.js / JavaScript
> * WebAssembly (WASM)
> * Spring Pulsar & Reactive Spring
>
> One of the weak points in the current Pulsar Functions runtime is the
> default handling of messages individually. Individual message processing
> can be slow and inefficient in cases where the main function of the
> Pulsar Function (or Sink) is to do backend API calls.
>
> Although pipelining (processing multiple in-flight messages) is possible
> in current Pulsar Functions and Sinks, it often leads to complex and
> error-prone solutions, especially when there's a need to combine
> key-based ordered processing with retry and backoff implementations.
>
> The Reactive Pulsar Client provides an inbuilt solution for implementing
> pipelining. With its ReactiveMessagePipelineBuilder, we can configure
> concurrency levels with key-ordered processing support. This capability
> could potentially eliminate the need to use key-shared subscriptions to
> scale Pulsar processing. If a reactive Pulsar Function were primarily to
> serve as a router for API calls, we could adjust the concurrency level
> to hundreds or even thousands, provided the backend could handle the
> load.
>
> With a pluggable Pulsar Functions runtime, we could introduce new
> runtime types without the need for implementing each type in the
> upstream project. This strategy could likely lead to new opportunities
> for innovative ideas and contributions in this field.
>
> I am interested to know your thoughts on making the Pulsar Functions
> runtime pluggable so that we can add new runtime types.

I see that RuntimeFactory [1] is already customizable.
What can we do more ?
Are you talking about providing alternative implementations for
JavaInstanceRunnable [2] ?

Thanks or bringing up this topic
Enrico

[1] https://github.com/apache/pulsar/blob/82237d3684fe506bcb6426b3b23f413422e6e4fb/pulsar-functions/runtime/src/main/java/org/apache/pulsar/functions/runtime/RuntimeFactory.java#L35
[2] https://github.com/apache/pulsar/blob/f7c0b3c49c9ad8c28d0b00aa30d727850eb8bc04/pulsar-functions/instance/src/main/java/org/apache/pulsar/functions/instance/JavaInstanceRunnable.java#L116

>
> Best Regards,
>
> -Lari

Re: [DISCUSS] Pluggable Pulsar Functions runtime to support new runtimes

Posted by Baodi Shi <ba...@apache.org>.
Hi, Lari.

Thanks for your proposal. Sounds good to me.

Supporting function pluggable essentially decouples the scheduling layer
and the computing layer. This can not only extend the function runtime in
more languages but even use this feature to implement a new non-function
runtime plugin, such as the EventSouring development model.

Although pipelining (processing multiple in-flight messages) is possible
> in current Pulsar Functions and Sinks, it often leads to complex and
> error-prone solutions, especially when there's a need to combine
> key-based ordered processing with retry and backoff implementations.

At present, the scheduling logic of the function will not be related to the
key of the message, I understand that it belongs to the internal feature of
the runtime, and this can also be enhanced based on the current runtime? I
don't quite understand what this has to do with runtime pluginization. Can
you explain?


In addition, the Pulsar Functions have a missing piece in how functions are
> mapped to instances. It's not very efficient to even run each and every
> function as a separate deployable entity. The cost of each independent JVMs
> is high. It would be also better to have a model where where could be a
> group of functions that are provided by one instance and always run
> together. Having this option could bring down the cost and also improve the
> developer experience. The framework shouldn't require the developer that
> each individual function is deployed in a separate .jar file which gets run
> in a separate JVM.


This means that we need to enhance the function interface to that the user
tells the runtime layer which topic the function subscription cares about.
In addition, if we want to implement concurrent sequential processing based
on keys, we can also let the function register the key of interest. These
can be implemented in the current runtime, which has nothing to do with
plug-in runtime, right?


I wonder what are the future plans after plugging in the runtime? Do you
want to develop a new runtime that implements many of the benefits you
enumerate, or will you continue to enhance it on the original runtime
(Java).



Thanks,
Baodi Shi


On Jun 20, 2023 at 15:33:22, Lari Hotari <lh...@apache.org> wrote:

> Dear Pulsar Community Members,
>
> I would like to initiate a discussion on making the Pulsar Functions
> runtime "pluggable". In doing so, we can ensure that the addition of new
> runtime types becomes more straightforward.
>
> This use case will allow us to add support for Pulsar Functions based on
> various platforms such as:
>
> * Pulsar Client Reactive
> * Node.js / JavaScript
> * WebAssembly (WASM)
> * Spring Pulsar & Reactive Spring
>
> One of the weak points in the current Pulsar Functions runtime is the
> default handling of messages individually. Individual message processing
> can be slow and inefficient in cases where the main function of the
> Pulsar Function (or Sink) is to do backend API calls.
>
> Although pipelining (processing multiple in-flight messages) is possible
> in current Pulsar Functions and Sinks, it often leads to complex and
> error-prone solutions, especially when there's a need to combine
> key-based ordered processing with retry and backoff implementations.
>
> The Reactive Pulsar Client provides an inbuilt solution for implementing
> pipelining. With its ReactiveMessagePipelineBuilder, we can configure
> concurrency levels with key-ordered processing support. This capability
> could potentially eliminate the need to use key-shared subscriptions to
> scale Pulsar processing. If a reactive Pulsar Function were primarily to
> serve as a router for API calls, we could adjust the concurrency level
> to hundreds or even thousands, provided the backend could handle the
> load.
>
> With a pluggable Pulsar Functions runtime, we could introduce new
> runtime types without the need for implementing each type in the
> upstream project. This strategy could likely lead to new opportunities
> for innovative ideas and contributions in this field.
>
> I am interested to know your thoughts on making the Pulsar Functions
> runtime pluggable so that we can add new runtime types.
>
> Best Regards,
>
> -Lari
>

Re: [DISCUSS] Pluggable Pulsar Functions runtime to support new runtimes

Posted by Lari Hotari <lh...@apache.org>.
On 2023/06/21 07:21:31 Asaf Mesika wrote:
> Lari, would it be possible to explain in more detail the paint points
> you're describing?
 
Well the point of the pluggable Function runtime types is to support other technologies. Let's forget the reactive messaging solution for a moment.
With a pluggable solution, I mean having a solution in place where you could possibly add .nar files to some directory and add support for new runtime types by implementing some plugin specification. The current solution doesn't contain this property.
A pluggable solution would make it easier for contributing new runtime types.
Let's say if we would want to add support for these technologies:
* functions written in Node.js / JavaScript
* functions using WebAssembly (WASM), for example implemented in Rust that also compiles to WASM.

> You say processing messages individually is slow; hence, processing them in
> batches is better. I guess it's especially useful if you need to group a
> batch based on a key. What I don't understand is how the framework today
> limits you from using something like a reactive client which does the
> batching inside.

I didn't say anything about batches. It's about pipelining. That means that you have multiple messages "in flight". That is different than batching. The most well known example of pipelining is HTTP pipelining [1].
Pulsar Functions already supports async functions which are functions that have a method that returns a CompetableFuture type. To limit the amount of messages "in flight", the worker config includes a setting "maxPendingAsyncRequests" [2] which defaults to 1000. It is odd that the setting is at worker config level and not at the function level.
Reactive Streams is not about batching. One of the clear benefits over plain async programming is that there's a well defined way for handling backpressure. For any high scale system handling backpressure (== flow control) is one of the core concerns.

In this case, if there was a pluggable Pulsar Functions runtime, it would be possible to add a runtime type optimized for Reactive Pulsar. That could also enable using Spring Pulsar in Reactive mode with the rest of Reactive Spring. 

The current .nar plugin packaging is a mess. If you take a look of what goes inside a .nar file, it is a mess. There are classes that shouldn't be there. The .nar plugin creation is a very slow and inefficient. I can provide details if you are interested to know. 

With pluggable Pulsar Functions runtime, it would also be possible to create a cleaner packaging for JVM functions. Packaging for different ecosystems like Quarkus and Spring Boot could be optimized for those ecosystems and not the other way around where Pulsar's outdated .nar packaging is dictating the options.

In addition, the Pulsar Functions have a missing piece in how functions are mapped to instances. It's not very efficient to even run each and every function as a separate deployable entity. The cost of each independent JVMs is high. It would be also better to have a model where where could be a group of functions that are provided by one instance and always run together. Having this option could bring down the cost and also improve the developer experience. The framework shouldn't require the developer that each individual function is deployed in a separate .jar file which gets run in a separate JVM. 

So you asked if there is pain with Pulsar Functions. There definitely is. Instead of causing more fragmentation in the ecosystem with multiple pluggable infrastructure layers, we should make the core upstream offering better. 

I'd also like to see a deployment option for Pulsar Functions where you could choose to not deploy Pulsar Functions with pulsar-admin and instead package the functions in an application that you deploy in Kubernetes with helm or whatever way you choose to do that.
This could also be taken into account when designing the pluggable Pulsar Functions runtime. 

StreamNative's Function Mesh [3] takes a different approach to Pulsar Function life cycle management. That might be a good fit in many cases. 
However, we should have a way where Pulsar Functions could be deployed without any central management solution, as ordinary applications. 

Perhaps everyone is happy with the current way Pulsar Functions are. If everyone is already satisfied, things won't improve. Do we want to make Pulsar more popular and easier for our users? Do we care about supporting node.js / Javascript / Typescript or new languages like Rust? If we do, we better start thinking of adding that support. I would like to propose that we make adding new runtime types easy by making it "pluggable". That could mean multiple things and that's why we are having this discussion. I hope others could also chime in.

-Lari

[1] https://en.wikipedia.org/wiki/HTTP_pipelining
[2] https://github.com/apache/pulsar/blob/f7c0b3c49c9ad8c28d0b00aa30d727850eb8bc04/pulsar-functions/runtime/src/main/java/org/apache/pulsar/functions/worker/WorkerConfig.java#L725-L729
[3] https://github.com/streamnative/function-mesh

Re: [DISCUSS] Pluggable Pulsar Functions runtime to support new runtimes

Posted by Asaf Mesika <as...@gmail.com>.
Lari, would it be possible to explain in more detail the paint points
you're describing?

You say processing messages individually is slow; hence, processing them in
batches is better. I guess it's especially useful if you need to group a
batch based on a key. What I don't understand is how the framework today
limits you from using something like a reactive client which does the
batching inside.

On Tue, Jun 20, 2023 at 10:33 AM Lari Hotari <lh...@apache.org> wrote:

> Dear Pulsar Community Members,
>
> I would like to initiate a discussion on making the Pulsar Functions
> runtime "pluggable". In doing so, we can ensure that the addition of new
> runtime types becomes more straightforward.
>
> This use case will allow us to add support for Pulsar Functions based on
> various platforms such as:
>
> * Pulsar Client Reactive
> * Node.js / JavaScript
> * WebAssembly (WASM)
> * Spring Pulsar & Reactive Spring
>
> One of the weak points in the current Pulsar Functions runtime is the
> default handling of messages individually. Individual message processing
> can be slow and inefficient in cases where the main function of the
> Pulsar Function (or Sink) is to do backend API calls.
>
> Although pipelining (processing multiple in-flight messages) is possible
> in current Pulsar Functions and Sinks, it often leads to complex and
> error-prone solutions, especially when there's a need to combine
> key-based ordered processing with retry and backoff implementations.
>
> The Reactive Pulsar Client provides an inbuilt solution for implementing
> pipelining. With its ReactiveMessagePipelineBuilder, we can configure
> concurrency levels with key-ordered processing support. This capability
> could potentially eliminate the need to use key-shared subscriptions to
> scale Pulsar processing. If a reactive Pulsar Function were primarily to
> serve as a router for API calls, we could adjust the concurrency level
> to hundreds or even thousands, provided the backend could handle the
> load.
>
> With a pluggable Pulsar Functions runtime, we could introduce new
> runtime types without the need for implementing each type in the
> upstream project. This strategy could likely lead to new opportunities
> for innovative ideas and contributions in this field.
>
> I am interested to know your thoughts on making the Pulsar Functions
> runtime pluggable so that we can add new runtime types.
>
> Best Regards,
>
> -Lari
>