You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pulsar.apache.org by Matteo Merli <ma...@gmail.com> on 2024/03/05 23:02:37 UTC

[DISCUSS] Retire pulsar-all Docker image and spin-off Python Functions runtime

The docker image `pulsar-all` is a convenience image that is created on top
of the base `pulsar` image, including all the Pulsar IO connectors as well
as the tiered storage offloaders.

The Dockerfile for `pulsar-all` can be found here:
https://github.com/apache/pulsar/blob/master/docker/pulsar-all/Dockerfile

The resulting image is very big:

```
apachepulsar/pulsar-all                   3.1.2
 3d1aa250bf6c   2 months ago        3.68GB
```

This poses a challenge in many ways:
 1. Our CI pipeline needs to build these images and cache them across
different stages of the pipeline
 2. It takes a lot of time for release managers to build and push these
images to Docker Hub
 3. Users using this image in production see very long download times,
something that can affect the availability of the system (eg: more chances
of a 2nd broker to crash if a restart takes a very long time).
 4. It's very unlikely that one user will require all the connectors, most
likely, it would use just 2-3 of them.

The problem is that `pulsar-all` was introduced at a time when there were
~3 Pulsar IO connectors. Right now we do have 35 connectors, with a 1.9 GB
total size.

The proposal here is to drop this image altogether. Users will be able to
construct their own targeted images in a very simple way:

```
FROM apachepulsar/pulsar:latest
RUN mkdir -p connectors && \
    cd connectors && \
    wget
https://downloads.apache.org/pulsar/pulsar-3.2.0/connectors/pulsar-io-elastic-search-3.2.0.nar
```



### Pulsar Functions Python Runtime

In order to support Python functions runtime, we have been including the
Pulsar base image with quite a bit of dependencies, from `pulsar-client`
Python SDK, to gRPC which is quite a heavy package with many transitive
dependencies.

Given that the vast majority would be using the `pulsar` base image to run
brokers and not python functions, it would make sense to split the Python
support into a different image, like `pulsar-functions-python`, which
extends from the base image and adds all the needed Python dependencies.

This way it will be very easy for users to select the appropriate image and
we wouldn't be carrying a big amount of useless Python dependencies to
users who don't need them.


What are people's opinions with respect to this?

Matteo

--
Matteo Merli
<ma...@gmail.com>

Re: [DISCUSS] Retire pulsar-all Docker image and spin-off Python Functions runtime

Posted by Matteo Merli <ma...@gmail.com>.
Using the Alpine image (PIP-324 in progress), removing Python I was able to
see ~350 MB size for the `pulsar` base image.

There could be additional space savings by removing unused JVM modules from
the image.


--
Matteo Merli
<ma...@gmail.com>


On Thu, Mar 7, 2024 at 10:09 AM Girish Sharma <sc...@gmail.com>
wrote:

> +1
> We are recently struggling with building a pulsar image in house (lots of
> app sec constraints etc). a much reduced and minimal image would certainly
> help there.
>
> Any estimates on the size reduction in the base pulsar image after removal
> of python related content? Is there scope of further slim down of the base
> pulsar image by removing anything non essential in running a broker (or as
> a bookie or zk)
>
> Regards
>
> On Thu, Mar 7, 2024 at 11:19 PM Neng Lu <fr...@gmail.com> wrote:
>
> > +1
> >
> > This can reduce the image size significantly and thus improve the
> > efficiency and reduce the cost.
> >
> > On Tue, Mar 5, 2024 at 11:25 PM Enrico Olivelli <eo...@gmail.com>
> > wrote:
> >
> > > +1
> > >
> > > Great idea
> > >
> > > Enrico
> > >
> > > Il Mer 6 Mar 2024, 08:23 Zixuan Liu <no...@gmail.com> ha scritto:
> > >
> > > > +1
> > > >
> > > > This is a good idea, and then we must provide a document on building
> > the
> > > > own connector image and python functions runtime image.
> > > >
> > > > Thanks,
> > > > Zixuan
> > > >
> > > > Matteo Merli <ma...@gmail.com> 于2024年3月6日周三 07:04写道:
> > > >
> > > > > The docker image `pulsar-all` is a convenience image that is
> created
> > on
> > > > top
> > > > > of the base `pulsar` image, including all the Pulsar IO connectors
> as
> > > > well
> > > > > as the tiered storage offloaders.
> > > > >
> > > > > The Dockerfile for `pulsar-all` can be found here:
> > > > >
> > > >
> > >
> >
> https://github.com/apache/pulsar/blob/master/docker/pulsar-all/Dockerfile
> > > > >
> > > > > The resulting image is very big:
> > > > >
> > > > > ```
> > > > > apachepulsar/pulsar-all                   3.1.2
> > > > >  3d1aa250bf6c   2 months ago        3.68GB
> > > > > ```
> > > > >
> > > > > This poses a challenge in many ways:
> > > > >  1. Our CI pipeline needs to build these images and cache them
> across
> > > > > different stages of the pipeline
> > > > >  2. It takes a lot of time for release managers to build and push
> > these
> > > > > images to Docker Hub
> > > > >  3. Users using this image in production see very long download
> > times,
> > > > > something that can affect the availability of the system (eg: more
> > > > chances
> > > > > of a 2nd broker to crash if a restart takes a very long time).
> > > > >  4. It's very unlikely that one user will require all the
> connectors,
> > > > most
> > > > > likely, it would use just 2-3 of them.
> > > > >
> > > > > The problem is that `pulsar-all` was introduced at a time when
> there
> > > were
> > > > > ~3 Pulsar IO connectors. Right now we do have 35 connectors, with a
> > 1.9
> > > > GB
> > > > > total size.
> > > > >
> > > > > The proposal here is to drop this image altogether. Users will be
> > able
> > > to
> > > > > construct their own targeted images in a very simple way:
> > > > >
> > > > > ```
> > > > > FROM apachepulsar/pulsar:latest
> > > > > RUN mkdir -p connectors && \
> > > > >     cd connectors && \
> > > > >     wget
> > > > >
> > > > >
> > > >
> > >
> >
> https://downloads.apache.org/pulsar/pulsar-3.2.0/connectors/pulsar-io-elastic-search-3.2.0.nar
> > > > > ```
> > > > >
> > > > >
> > > > >
> > > > > ### Pulsar Functions Python Runtime
> > > > >
> > > > > In order to support Python functions runtime, we have been
> including
> > > the
> > > > > Pulsar base image with quite a bit of dependencies, from
> > > `pulsar-client`
> > > > > Python SDK, to gRPC which is quite a heavy package with many
> > transitive
> > > > > dependencies.
> > > > >
> > > > > Given that the vast majority would be using the `pulsar` base image
> > to
> > > > run
> > > > > brokers and not python functions, it would make sense to split the
> > > Python
> > > > > support into a different image, like `pulsar-functions-python`,
> which
> > > > > extends from the base image and adds all the needed Python
> > > dependencies.
> > > > >
> > > > > This way it will be very easy for users to select the appropriate
> > image
> > > > and
> > > > > we wouldn't be carrying a big amount of useless Python dependencies
> > to
> > > > > users who don't need them.
> > > > >
> > > > >
> > > > > What are people's opinions with respect to this?
> > > > >
> > > > > Matteo
> > > > >
> > > > > --
> > > > > Matteo Merli
> > > > > <ma...@gmail.com>
> > > > >
> > > >
> > >
> >
> >
> > --
> > Best Regards,
> > Neng
> >
>
>
> --
> Girish Sharma
>

Re: [DISCUSS] Retire pulsar-all Docker image and spin-off Python Functions runtime

Posted by Girish Sharma <sc...@gmail.com>.
+1
We are recently struggling with building a pulsar image in house (lots of
app sec constraints etc). a much reduced and minimal image would certainly
help there.

Any estimates on the size reduction in the base pulsar image after removal
of python related content? Is there scope of further slim down of the base
pulsar image by removing anything non essential in running a broker (or as
a bookie or zk)

Regards

On Thu, Mar 7, 2024 at 11:19 PM Neng Lu <fr...@gmail.com> wrote:

> +1
>
> This can reduce the image size significantly and thus improve the
> efficiency and reduce the cost.
>
> On Tue, Mar 5, 2024 at 11:25 PM Enrico Olivelli <eo...@gmail.com>
> wrote:
>
> > +1
> >
> > Great idea
> >
> > Enrico
> >
> > Il Mer 6 Mar 2024, 08:23 Zixuan Liu <no...@gmail.com> ha scritto:
> >
> > > +1
> > >
> > > This is a good idea, and then we must provide a document on building
> the
> > > own connector image and python functions runtime image.
> > >
> > > Thanks,
> > > Zixuan
> > >
> > > Matteo Merli <ma...@gmail.com> 于2024年3月6日周三 07:04写道:
> > >
> > > > The docker image `pulsar-all` is a convenience image that is created
> on
> > > top
> > > > of the base `pulsar` image, including all the Pulsar IO connectors as
> > > well
> > > > as the tiered storage offloaders.
> > > >
> > > > The Dockerfile for `pulsar-all` can be found here:
> > > >
> > >
> >
> https://github.com/apache/pulsar/blob/master/docker/pulsar-all/Dockerfile
> > > >
> > > > The resulting image is very big:
> > > >
> > > > ```
> > > > apachepulsar/pulsar-all                   3.1.2
> > > >  3d1aa250bf6c   2 months ago        3.68GB
> > > > ```
> > > >
> > > > This poses a challenge in many ways:
> > > >  1. Our CI pipeline needs to build these images and cache them across
> > > > different stages of the pipeline
> > > >  2. It takes a lot of time for release managers to build and push
> these
> > > > images to Docker Hub
> > > >  3. Users using this image in production see very long download
> times,
> > > > something that can affect the availability of the system (eg: more
> > > chances
> > > > of a 2nd broker to crash if a restart takes a very long time).
> > > >  4. It's very unlikely that one user will require all the connectors,
> > > most
> > > > likely, it would use just 2-3 of them.
> > > >
> > > > The problem is that `pulsar-all` was introduced at a time when there
> > were
> > > > ~3 Pulsar IO connectors. Right now we do have 35 connectors, with a
> 1.9
> > > GB
> > > > total size.
> > > >
> > > > The proposal here is to drop this image altogether. Users will be
> able
> > to
> > > > construct their own targeted images in a very simple way:
> > > >
> > > > ```
> > > > FROM apachepulsar/pulsar:latest
> > > > RUN mkdir -p connectors && \
> > > >     cd connectors && \
> > > >     wget
> > > >
> > > >
> > >
> >
> https://downloads.apache.org/pulsar/pulsar-3.2.0/connectors/pulsar-io-elastic-search-3.2.0.nar
> > > > ```
> > > >
> > > >
> > > >
> > > > ### Pulsar Functions Python Runtime
> > > >
> > > > In order to support Python functions runtime, we have been including
> > the
> > > > Pulsar base image with quite a bit of dependencies, from
> > `pulsar-client`
> > > > Python SDK, to gRPC which is quite a heavy package with many
> transitive
> > > > dependencies.
> > > >
> > > > Given that the vast majority would be using the `pulsar` base image
> to
> > > run
> > > > brokers and not python functions, it would make sense to split the
> > Python
> > > > support into a different image, like `pulsar-functions-python`, which
> > > > extends from the base image and adds all the needed Python
> > dependencies.
> > > >
> > > > This way it will be very easy for users to select the appropriate
> image
> > > and
> > > > we wouldn't be carrying a big amount of useless Python dependencies
> to
> > > > users who don't need them.
> > > >
> > > >
> > > > What are people's opinions with respect to this?
> > > >
> > > > Matteo
> > > >
> > > > --
> > > > Matteo Merli
> > > > <ma...@gmail.com>
> > > >
> > >
> >
>
>
> --
> Best Regards,
> Neng
>


-- 
Girish Sharma

Re: [DISCUSS] Retire pulsar-all Docker image and spin-off Python Functions runtime

Posted by Neng Lu <fr...@gmail.com>.
+1

This can reduce the image size significantly and thus improve the
efficiency and reduce the cost.

On Tue, Mar 5, 2024 at 11:25 PM Enrico Olivelli <eo...@gmail.com> wrote:

> +1
>
> Great idea
>
> Enrico
>
> Il Mer 6 Mar 2024, 08:23 Zixuan Liu <no...@gmail.com> ha scritto:
>
> > +1
> >
> > This is a good idea, and then we must provide a document on building the
> > own connector image and python functions runtime image.
> >
> > Thanks,
> > Zixuan
> >
> > Matteo Merli <ma...@gmail.com> 于2024年3月6日周三 07:04写道:
> >
> > > The docker image `pulsar-all` is a convenience image that is created on
> > top
> > > of the base `pulsar` image, including all the Pulsar IO connectors as
> > well
> > > as the tiered storage offloaders.
> > >
> > > The Dockerfile for `pulsar-all` can be found here:
> > >
> >
> https://github.com/apache/pulsar/blob/master/docker/pulsar-all/Dockerfile
> > >
> > > The resulting image is very big:
> > >
> > > ```
> > > apachepulsar/pulsar-all                   3.1.2
> > >  3d1aa250bf6c   2 months ago        3.68GB
> > > ```
> > >
> > > This poses a challenge in many ways:
> > >  1. Our CI pipeline needs to build these images and cache them across
> > > different stages of the pipeline
> > >  2. It takes a lot of time for release managers to build and push these
> > > images to Docker Hub
> > >  3. Users using this image in production see very long download times,
> > > something that can affect the availability of the system (eg: more
> > chances
> > > of a 2nd broker to crash if a restart takes a very long time).
> > >  4. It's very unlikely that one user will require all the connectors,
> > most
> > > likely, it would use just 2-3 of them.
> > >
> > > The problem is that `pulsar-all` was introduced at a time when there
> were
> > > ~3 Pulsar IO connectors. Right now we do have 35 connectors, with a 1.9
> > GB
> > > total size.
> > >
> > > The proposal here is to drop this image altogether. Users will be able
> to
> > > construct their own targeted images in a very simple way:
> > >
> > > ```
> > > FROM apachepulsar/pulsar:latest
> > > RUN mkdir -p connectors && \
> > >     cd connectors && \
> > >     wget
> > >
> > >
> >
> https://downloads.apache.org/pulsar/pulsar-3.2.0/connectors/pulsar-io-elastic-search-3.2.0.nar
> > > ```
> > >
> > >
> > >
> > > ### Pulsar Functions Python Runtime
> > >
> > > In order to support Python functions runtime, we have been including
> the
> > > Pulsar base image with quite a bit of dependencies, from
> `pulsar-client`
> > > Python SDK, to gRPC which is quite a heavy package with many transitive
> > > dependencies.
> > >
> > > Given that the vast majority would be using the `pulsar` base image to
> > run
> > > brokers and not python functions, it would make sense to split the
> Python
> > > support into a different image, like `pulsar-functions-python`, which
> > > extends from the base image and adds all the needed Python
> dependencies.
> > >
> > > This way it will be very easy for users to select the appropriate image
> > and
> > > we wouldn't be carrying a big amount of useless Python dependencies to
> > > users who don't need them.
> > >
> > >
> > > What are people's opinions with respect to this?
> > >
> > > Matteo
> > >
> > > --
> > > Matteo Merli
> > > <ma...@gmail.com>
> > >
> >
>


-- 
Best Regards,
Neng

Re: [DISCUSS] Retire pulsar-all Docker image and spin-off Python Functions runtime

Posted by Enrico Olivelli <eo...@gmail.com>.
+1

Great idea

Enrico

Il Mer 6 Mar 2024, 08:23 Zixuan Liu <no...@gmail.com> ha scritto:

> +1
>
> This is a good idea, and then we must provide a document on building the
> own connector image and python functions runtime image.
>
> Thanks,
> Zixuan
>
> Matteo Merli <ma...@gmail.com> 于2024年3月6日周三 07:04写道:
>
> > The docker image `pulsar-all` is a convenience image that is created on
> top
> > of the base `pulsar` image, including all the Pulsar IO connectors as
> well
> > as the tiered storage offloaders.
> >
> > The Dockerfile for `pulsar-all` can be found here:
> >
> https://github.com/apache/pulsar/blob/master/docker/pulsar-all/Dockerfile
> >
> > The resulting image is very big:
> >
> > ```
> > apachepulsar/pulsar-all                   3.1.2
> >  3d1aa250bf6c   2 months ago        3.68GB
> > ```
> >
> > This poses a challenge in many ways:
> >  1. Our CI pipeline needs to build these images and cache them across
> > different stages of the pipeline
> >  2. It takes a lot of time for release managers to build and push these
> > images to Docker Hub
> >  3. Users using this image in production see very long download times,
> > something that can affect the availability of the system (eg: more
> chances
> > of a 2nd broker to crash if a restart takes a very long time).
> >  4. It's very unlikely that one user will require all the connectors,
> most
> > likely, it would use just 2-3 of them.
> >
> > The problem is that `pulsar-all` was introduced at a time when there were
> > ~3 Pulsar IO connectors. Right now we do have 35 connectors, with a 1.9
> GB
> > total size.
> >
> > The proposal here is to drop this image altogether. Users will be able to
> > construct their own targeted images in a very simple way:
> >
> > ```
> > FROM apachepulsar/pulsar:latest
> > RUN mkdir -p connectors && \
> >     cd connectors && \
> >     wget
> >
> >
> https://downloads.apache.org/pulsar/pulsar-3.2.0/connectors/pulsar-io-elastic-search-3.2.0.nar
> > ```
> >
> >
> >
> > ### Pulsar Functions Python Runtime
> >
> > In order to support Python functions runtime, we have been including the
> > Pulsar base image with quite a bit of dependencies, from `pulsar-client`
> > Python SDK, to gRPC which is quite a heavy package with many transitive
> > dependencies.
> >
> > Given that the vast majority would be using the `pulsar` base image to
> run
> > brokers and not python functions, it would make sense to split the Python
> > support into a different image, like `pulsar-functions-python`, which
> > extends from the base image and adds all the needed Python dependencies.
> >
> > This way it will be very easy for users to select the appropriate image
> and
> > we wouldn't be carrying a big amount of useless Python dependencies to
> > users who don't need them.
> >
> >
> > What are people's opinions with respect to this?
> >
> > Matteo
> >
> > --
> > Matteo Merli
> > <ma...@gmail.com>
> >
>

Re: [DISCUSS] Retire pulsar-all Docker image and spin-off Python Functions runtime

Posted by Zixuan Liu <no...@gmail.com>.
+1

This is a good idea, and then we must provide a document on building the
own connector image and python functions runtime image.

Thanks,
Zixuan

Matteo Merli <ma...@gmail.com> 于2024年3月6日周三 07:04写道:

> The docker image `pulsar-all` is a convenience image that is created on top
> of the base `pulsar` image, including all the Pulsar IO connectors as well
> as the tiered storage offloaders.
>
> The Dockerfile for `pulsar-all` can be found here:
> https://github.com/apache/pulsar/blob/master/docker/pulsar-all/Dockerfile
>
> The resulting image is very big:
>
> ```
> apachepulsar/pulsar-all                   3.1.2
>  3d1aa250bf6c   2 months ago        3.68GB
> ```
>
> This poses a challenge in many ways:
>  1. Our CI pipeline needs to build these images and cache them across
> different stages of the pipeline
>  2. It takes a lot of time for release managers to build and push these
> images to Docker Hub
>  3. Users using this image in production see very long download times,
> something that can affect the availability of the system (eg: more chances
> of a 2nd broker to crash if a restart takes a very long time).
>  4. It's very unlikely that one user will require all the connectors, most
> likely, it would use just 2-3 of them.
>
> The problem is that `pulsar-all` was introduced at a time when there were
> ~3 Pulsar IO connectors. Right now we do have 35 connectors, with a 1.9 GB
> total size.
>
> The proposal here is to drop this image altogether. Users will be able to
> construct their own targeted images in a very simple way:
>
> ```
> FROM apachepulsar/pulsar:latest
> RUN mkdir -p connectors && \
>     cd connectors && \
>     wget
>
> https://downloads.apache.org/pulsar/pulsar-3.2.0/connectors/pulsar-io-elastic-search-3.2.0.nar
> ```
>
>
>
> ### Pulsar Functions Python Runtime
>
> In order to support Python functions runtime, we have been including the
> Pulsar base image with quite a bit of dependencies, from `pulsar-client`
> Python SDK, to gRPC which is quite a heavy package with many transitive
> dependencies.
>
> Given that the vast majority would be using the `pulsar` base image to run
> brokers and not python functions, it would make sense to split the Python
> support into a different image, like `pulsar-functions-python`, which
> extends from the base image and adds all the needed Python dependencies.
>
> This way it will be very easy for users to select the appropriate image and
> we wouldn't be carrying a big amount of useless Python dependencies to
> users who don't need them.
>
>
> What are people's opinions with respect to this?
>
> Matteo
>
> --
> Matteo Merli
> <ma...@gmail.com>
>

Re: [DISCUSS] Retire pulsar-all Docker image and spin-off Python Functions runtime

Posted by Nicolò Boschi <bo...@gmail.com>.
+1, great ideas

Let's make sure there's a dedicated section in the docs on how to "migrate"
from pulsar-all:3.2.0 to "build your own -all image"

Nicolò Boschi


Il giorno mer 6 mar 2024 alle ore 04:22 Matteo Merli <ma...@gmail.com>
ha scritto:

> I was proposing `pulsar-functions-python`, though I'm open to any other
> name
> --
> Matteo Merli
> <ma...@gmail.com>
>
>
> On Tue, Mar 5, 2024 at 6:43 PM Dave Fisher <wa...@apache.org> wrote:
>
> > What would be the name of the image that contains the functions runtime?
> >
> > Best,
> > Dave
> >
> > > On Mar 5, 2024, at 6:37 PM, Lari Hotari <lh...@apache.org> wrote:
> > >
> > > These are very welcome changes! Let's go ahead asap.
> > >
> > > -Lari
> > >
> > > On Wed, 6 Mar 2024 at 01:04, Matteo Merli <ma...@gmail.com>
> > wrote:
> > >>
> > >> The docker image `pulsar-all` is a convenience image that is created
> on
> > top
> > >> of the base `pulsar` image, including all the Pulsar IO connectors as
> > well
> > >> as the tiered storage offloaders.
> > >>
> > >> The Dockerfile for `pulsar-all` can be found here:
> > >>
> >
> https://github.com/apache/pulsar/blob/master/docker/pulsar-all/Dockerfile
> > >>
> > >> The resulting image is very big:
> > >>
> > >> ```
> > >> apachepulsar/pulsar-all                   3.1.2
> > >> 3d1aa250bf6c   2 months ago        3.68GB
> > >> ```
> > >>
> > >> This poses a challenge in many ways:
> > >> 1. Our CI pipeline needs to build these images and cache them across
> > >> different stages of the pipeline
> > >> 2. It takes a lot of time for release managers to build and push these
> > >> images to Docker Hub
> > >> 3. Users using this image in production see very long download times,
> > >> something that can affect the availability of the system (eg: more
> > chances
> > >> of a 2nd broker to crash if a restart takes a very long time).
> > >> 4. It's very unlikely that one user will require all the connectors,
> > most
> > >> likely, it would use just 2-3 of them.
> > >>
> > >> The problem is that `pulsar-all` was introduced at a time when there
> > were
> > >> ~3 Pulsar IO connectors. Right now we do have 35 connectors, with a
> 1.9
> > GB
> > >> total size.
> > >>
> > >> The proposal here is to drop this image altogether. Users will be able
> > to
> > >> construct their own targeted images in a very simple way:
> > >>
> > >> ```
> > >> FROM apachepulsar/pulsar:latest
> > >> RUN mkdir -p connectors && \
> > >>    cd connectors && \
> > >>    wget
> > >>
> >
> https://downloads.apache.org/pulsar/pulsar-3.2.0/connectors/pulsar-io-elastic-search-3.2.0.nar
> > >> ```
> > >>
> > >>
> > >>
> > >> ### Pulsar Functions Python Runtime
> > >>
> > >> In order to support Python functions runtime, we have been including
> the
> > >> Pulsar base image with quite a bit of dependencies, from
> `pulsar-client`
> > >> Python SDK, to gRPC which is quite a heavy package with many
> transitive
> > >> dependencies.
> > >>
> > >> Given that the vast majority would be using the `pulsar` base image to
> > run
> > >> brokers and not python functions, it would make sense to split the
> > Python
> > >> support into a different image, like `pulsar-functions-python`, which
> > >> extends from the base image and adds all the needed Python
> dependencies.
> > >>
> > >> This way it will be very easy for users to select the appropriate
> image
> > and
> > >> we wouldn't be carrying a big amount of useless Python dependencies to
> > >> users who don't need them.
> > >>
> > >>
> > >> What are people's opinions with respect to this?
> > >>
> > >> Matteo
> > >>
> > >> --
> > >> Matteo Merli
> > >> <ma...@gmail.com>
> >
> >
>

Re: [DISCUSS] Retire pulsar-all Docker image and spin-off Python Functions runtime

Posted by Matteo Merli <ma...@gmail.com>.
I was proposing `pulsar-functions-python`, though I'm open to any other
name
--
Matteo Merli
<ma...@gmail.com>


On Tue, Mar 5, 2024 at 6:43 PM Dave Fisher <wa...@apache.org> wrote:

> What would be the name of the image that contains the functions runtime?
>
> Best,
> Dave
>
> > On Mar 5, 2024, at 6:37 PM, Lari Hotari <lh...@apache.org> wrote:
> >
> > These are very welcome changes! Let's go ahead asap.
> >
> > -Lari
> >
> > On Wed, 6 Mar 2024 at 01:04, Matteo Merli <ma...@gmail.com>
> wrote:
> >>
> >> The docker image `pulsar-all` is a convenience image that is created on
> top
> >> of the base `pulsar` image, including all the Pulsar IO connectors as
> well
> >> as the tiered storage offloaders.
> >>
> >> The Dockerfile for `pulsar-all` can be found here:
> >>
> https://github.com/apache/pulsar/blob/master/docker/pulsar-all/Dockerfile
> >>
> >> The resulting image is very big:
> >>
> >> ```
> >> apachepulsar/pulsar-all                   3.1.2
> >> 3d1aa250bf6c   2 months ago        3.68GB
> >> ```
> >>
> >> This poses a challenge in many ways:
> >> 1. Our CI pipeline needs to build these images and cache them across
> >> different stages of the pipeline
> >> 2. It takes a lot of time for release managers to build and push these
> >> images to Docker Hub
> >> 3. Users using this image in production see very long download times,
> >> something that can affect the availability of the system (eg: more
> chances
> >> of a 2nd broker to crash if a restart takes a very long time).
> >> 4. It's very unlikely that one user will require all the connectors,
> most
> >> likely, it would use just 2-3 of them.
> >>
> >> The problem is that `pulsar-all` was introduced at a time when there
> were
> >> ~3 Pulsar IO connectors. Right now we do have 35 connectors, with a 1.9
> GB
> >> total size.
> >>
> >> The proposal here is to drop this image altogether. Users will be able
> to
> >> construct their own targeted images in a very simple way:
> >>
> >> ```
> >> FROM apachepulsar/pulsar:latest
> >> RUN mkdir -p connectors && \
> >>    cd connectors && \
> >>    wget
> >>
> https://downloads.apache.org/pulsar/pulsar-3.2.0/connectors/pulsar-io-elastic-search-3.2.0.nar
> >> ```
> >>
> >>
> >>
> >> ### Pulsar Functions Python Runtime
> >>
> >> In order to support Python functions runtime, we have been including the
> >> Pulsar base image with quite a bit of dependencies, from `pulsar-client`
> >> Python SDK, to gRPC which is quite a heavy package with many transitive
> >> dependencies.
> >>
> >> Given that the vast majority would be using the `pulsar` base image to
> run
> >> brokers and not python functions, it would make sense to split the
> Python
> >> support into a different image, like `pulsar-functions-python`, which
> >> extends from the base image and adds all the needed Python dependencies.
> >>
> >> This way it will be very easy for users to select the appropriate image
> and
> >> we wouldn't be carrying a big amount of useless Python dependencies to
> >> users who don't need them.
> >>
> >>
> >> What are people's opinions with respect to this?
> >>
> >> Matteo
> >>
> >> --
> >> Matteo Merli
> >> <ma...@gmail.com>
>
>

Re: [DISCUSS] Retire pulsar-all Docker image and spin-off Python Functions runtime

Posted by Dave Fisher <wa...@apache.org>.
What would be the name of the image that contains the functions runtime?

Best,
Dave

> On Mar 5, 2024, at 6:37 PM, Lari Hotari <lh...@apache.org> wrote:
> 
> These are very welcome changes! Let's go ahead asap.
> 
> -Lari
> 
> On Wed, 6 Mar 2024 at 01:04, Matteo Merli <ma...@gmail.com> wrote:
>> 
>> The docker image `pulsar-all` is a convenience image that is created on top
>> of the base `pulsar` image, including all the Pulsar IO connectors as well
>> as the tiered storage offloaders.
>> 
>> The Dockerfile for `pulsar-all` can be found here:
>> https://github.com/apache/pulsar/blob/master/docker/pulsar-all/Dockerfile
>> 
>> The resulting image is very big:
>> 
>> ```
>> apachepulsar/pulsar-all                   3.1.2
>> 3d1aa250bf6c   2 months ago        3.68GB
>> ```
>> 
>> This poses a challenge in many ways:
>> 1. Our CI pipeline needs to build these images and cache them across
>> different stages of the pipeline
>> 2. It takes a lot of time for release managers to build and push these
>> images to Docker Hub
>> 3. Users using this image in production see very long download times,
>> something that can affect the availability of the system (eg: more chances
>> of a 2nd broker to crash if a restart takes a very long time).
>> 4. It's very unlikely that one user will require all the connectors, most
>> likely, it would use just 2-3 of them.
>> 
>> The problem is that `pulsar-all` was introduced at a time when there were
>> ~3 Pulsar IO connectors. Right now we do have 35 connectors, with a 1.9 GB
>> total size.
>> 
>> The proposal here is to drop this image altogether. Users will be able to
>> construct their own targeted images in a very simple way:
>> 
>> ```
>> FROM apachepulsar/pulsar:latest
>> RUN mkdir -p connectors && \
>>    cd connectors && \
>>    wget
>> https://downloads.apache.org/pulsar/pulsar-3.2.0/connectors/pulsar-io-elastic-search-3.2.0.nar
>> ```
>> 
>> 
>> 
>> ### Pulsar Functions Python Runtime
>> 
>> In order to support Python functions runtime, we have been including the
>> Pulsar base image with quite a bit of dependencies, from `pulsar-client`
>> Python SDK, to gRPC which is quite a heavy package with many transitive
>> dependencies.
>> 
>> Given that the vast majority would be using the `pulsar` base image to run
>> brokers and not python functions, it would make sense to split the Python
>> support into a different image, like `pulsar-functions-python`, which
>> extends from the base image and adds all the needed Python dependencies.
>> 
>> This way it will be very easy for users to select the appropriate image and
>> we wouldn't be carrying a big amount of useless Python dependencies to
>> users who don't need them.
>> 
>> 
>> What are people's opinions with respect to this?
>> 
>> Matteo
>> 
>> --
>> Matteo Merli
>> <ma...@gmail.com>


Re: [DISCUSS] Retire pulsar-all Docker image and spin-off Python Functions runtime

Posted by Lari Hotari <lh...@apache.org>.
These are very welcome changes! Let's go ahead asap.

-Lari

On Wed, 6 Mar 2024 at 01:04, Matteo Merli <ma...@gmail.com> wrote:
>
> The docker image `pulsar-all` is a convenience image that is created on top
> of the base `pulsar` image, including all the Pulsar IO connectors as well
> as the tiered storage offloaders.
>
> The Dockerfile for `pulsar-all` can be found here:
> https://github.com/apache/pulsar/blob/master/docker/pulsar-all/Dockerfile
>
> The resulting image is very big:
>
> ```
> apachepulsar/pulsar-all                   3.1.2
>  3d1aa250bf6c   2 months ago        3.68GB
> ```
>
> This poses a challenge in many ways:
>  1. Our CI pipeline needs to build these images and cache them across
> different stages of the pipeline
>  2. It takes a lot of time for release managers to build and push these
> images to Docker Hub
>  3. Users using this image in production see very long download times,
> something that can affect the availability of the system (eg: more chances
> of a 2nd broker to crash if a restart takes a very long time).
>  4. It's very unlikely that one user will require all the connectors, most
> likely, it would use just 2-3 of them.
>
> The problem is that `pulsar-all` was introduced at a time when there were
> ~3 Pulsar IO connectors. Right now we do have 35 connectors, with a 1.9 GB
> total size.
>
> The proposal here is to drop this image altogether. Users will be able to
> construct their own targeted images in a very simple way:
>
> ```
> FROM apachepulsar/pulsar:latest
> RUN mkdir -p connectors && \
>     cd connectors && \
>     wget
> https://downloads.apache.org/pulsar/pulsar-3.2.0/connectors/pulsar-io-elastic-search-3.2.0.nar
> ```
>
>
>
> ### Pulsar Functions Python Runtime
>
> In order to support Python functions runtime, we have been including the
> Pulsar base image with quite a bit of dependencies, from `pulsar-client`
> Python SDK, to gRPC which is quite a heavy package with many transitive
> dependencies.
>
> Given that the vast majority would be using the `pulsar` base image to run
> brokers and not python functions, it would make sense to split the Python
> support into a different image, like `pulsar-functions-python`, which
> extends from the base image and adds all the needed Python dependencies.
>
> This way it will be very easy for users to select the appropriate image and
> we wouldn't be carrying a big amount of useless Python dependencies to
> users who don't need them.
>
>
> What are people's opinions with respect to this?
>
> Matteo
>
> --
> Matteo Merli
> <ma...@gmail.com>

Re: [DISCUSS] Retire pulsar-all Docker image and spin-off Python Functions runtime

Posted by Ran Gao <rg...@apache.org>.
+1

Most users don't need all built-in connectors, it's too bloated.

Best Regards,
Ran Gao

On 2024/03/05 23:02:37 Matteo Merli wrote:
> The docker image `pulsar-all` is a convenience image that is created on top
> of the base `pulsar` image, including all the Pulsar IO connectors as well
> as the tiered storage offloaders.
> 
> The Dockerfile for `pulsar-all` can be found here:
> https://github.com/apache/pulsar/blob/master/docker/pulsar-all/Dockerfile
> 
> The resulting image is very big:
> 
> ```
> apachepulsar/pulsar-all                   3.1.2
>  3d1aa250bf6c   2 months ago        3.68GB
> ```
> 
> This poses a challenge in many ways:
>  1. Our CI pipeline needs to build these images and cache them across
> different stages of the pipeline
>  2. It takes a lot of time for release managers to build and push these
> images to Docker Hub
>  3. Users using this image in production see very long download times,
> something that can affect the availability of the system (eg: more chances
> of a 2nd broker to crash if a restart takes a very long time).
>  4. It's very unlikely that one user will require all the connectors, most
> likely, it would use just 2-3 of them.
> 
> The problem is that `pulsar-all` was introduced at a time when there were
> ~3 Pulsar IO connectors. Right now we do have 35 connectors, with a 1.9 GB
> total size.
> 
> The proposal here is to drop this image altogether. Users will be able to
> construct their own targeted images in a very simple way:
> 
> ```
> FROM apachepulsar/pulsar:latest
> RUN mkdir -p connectors && \
>     cd connectors && \
>     wget
> https://downloads.apache.org/pulsar/pulsar-3.2.0/connectors/pulsar-io-elastic-search-3.2.0.nar
> ```
> 
> 
> 
> ### Pulsar Functions Python Runtime
> 
> In order to support Python functions runtime, we have been including the
> Pulsar base image with quite a bit of dependencies, from `pulsar-client`
> Python SDK, to gRPC which is quite a heavy package with many transitive
> dependencies.
> 
> Given that the vast majority would be using the `pulsar` base image to run
> brokers and not python functions, it would make sense to split the Python
> support into a different image, like `pulsar-functions-python`, which
> extends from the base image and adds all the needed Python dependencies.
> 
> This way it will be very easy for users to select the appropriate image and
> we wouldn't be carrying a big amount of useless Python dependencies to
> users who don't need them.
> 
> 
> What are people's opinions with respect to this?
> 
> Matteo
> 
> --
> Matteo Merli
> <ma...@gmail.com>
>