You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Weston Steimel <we...@gmail.com> on 2023/01/02 07:17:51 UTC

Re: [DISCUSS] Python Wheel Size

Apologies for being very late to this discussion, but if anyone is still
interested in this work, I did quite long ago attempt something like this
at https://github.com/westonsteimel/pyarrow-parquet.  Eventually I gave up
on that approach (due to the time taken for builds etc) and instead moved
to taking the published wheels and stripping them down to only what I
wanted at https://github.com/westonsteimel/pyarrow-slim.  I haven't updated
that in quite some time, but perhaps it can serve as a useful starting
point.

Thanks,
--Weston Steimel

On Mon, 10 Oct 2022, 13:08 Wes McKinney, <we...@gmail.com> wrote:

> We've discussed this in the past, I think. In addition to having many
> optional components enabled, the pyarrow wheel also includes the unit
> tests directory which is of growing size. I think if we made a
> pyarrow-slim wheel with support only for core Arrow (IPC, etc.) and
> Parquet file reading, it might be possible to trim by significant
> percentage.
>
> Rusty -- if you would like to push this forward I would suggest
> creating an alternative wheel build script to the one that we use and
> modify flags / add other customizations (e.g. trimming unit tests)
> that produce a wheel that we could build and possibly upload as
> "pyarrow-slim" on PyPI
>
> On Mon, Oct 3, 2022 at 8:55 AM Antoine Pitrou <an...@python.org> wrote:
> >
> >
> > Hi Rusty,
> >
> > Le 02/10/2022 à 22:51, Rusty Conover a écrit :
> > > Hi Arrow Team,
> > >
> > > I'm using Apache Arrow with AWS Lambda Functions.
> > >
> > > The primary motivation is AWS Athena's user-defined functions[1].
> Those
> > > functions process and return Arrow IPC segments.
> > >
> > > * The published Python wheels for Apache Arrow include almost every
> feature
> > > of Arrow. (Gandiva, Plasma, Flight)
> >
> > Gandiva isn't compiled in the Python wheels. Plasma is reasonably small
> > (but is also being deprecated soon). Flight is more sizable. However,
> > most of the size seems to be in Arrow itself and Parquet. A large part
> > of the size is probably attributable to the Arrow compute engine and
> > functions, and also perhaps to filesystem implementations such as S3 and
> > GCS (due to the large third-party dependencies that they bundle).
> >
> > > Would it be possible to create a new Python package (i.e.,
> "pyarrow-slim")
> > > that would disable some of the functionality but result in smaller
> python
> > > wheels?
> >
> > Perhaps. The first step would be to allow disabling more components in
> > PyArrow, though. Otherwise I'm afraid the size reduction wouldn't be
> > terrific.
> >
> > Regards
> >
> > Antoine.
>