You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Sascha Steinbiss <sa...@debian.org> on 2021/06/11 09:26:30 UTC

Debian packaging for Arrow

Hi Arrow community!

I am a Debian Developer looking to package Arrow officially in Debian as
a dependency for a specific tool I want to get into Debian as well.

I do have a working package based on the JFrog packaging groundwork [0]
but had to make various changes mostly to avoid downloading dependencies
from the Internet (which is not allowed during the Debian build
process). So, mostly setting -DARROW_DEPENDENCY_SOURCE=SYSTEM and tuning
enabled/disabled features based on what we have and what we don't.
Result is at [1].

It looks like I can build all packages built by the JFrog packaging with
no problems (at least for amd64). Build log attached. The only exception
here are ORC and S3 support, which are missing because the ORC library
[2] and the AWS C++ SDK [3] are not packaged yet. But apart from that it
looks like everything works.

Just so you know, nothing has been officially uploaded yet. The package
is still in preparation and only used internally within my organization
so far.

Being quite far in the packaging process, I have some questions:

1.) Would somebody from the upstream team be interested in collaborating
to keep Arrow maintained in Debian? I would be able to review updates
and sponsor uploads.

2.) One quite scary thing left is documenting all copyright and license
occurrences in the codebase. It looks like there is a fair bit of
embedded code coming from various sources and with varying levels of
modification. The debian/copyright file in the JFrog packaging only
contains a number of TODOs so I guess this is still up to me to finish
before I can think of doing an upload.
Is the LICENSE.txt in the Arrow source root directory complete and lists
_all_ third-party licenses and copyright holders in the release tarball?
If so, I could use it as a template and just reformat it as required by
Debian? That would be nice to know, otherwise that would mean a lot of
digging and probably still missing something. Missed license or
copyright holder mentions are the most common reason why new packages
are rejected during the initial, mandatory manual review for new
packages, BTW, so I'd like to avoid unnecessary review iterations ;)

Thanks!

Best regards
Sascha

[0]
https://apache.jfrog.io/artifactory/arrow/debian/pool/bullseye/main/a/apache-arrow/apache-arrow_4.0.0-1.debian.tar.xz
[1] https://salsa.debian.org/satta/arrow/-/tree/master/debian
[2] https://github.com/apache/orc
[3] https://github.com/aws/aws-sdk-cpp


Re: Debian packaging for Arrow

Posted by Sascha Steinbiss <sa...@debian.org>.
Hi Mauricio,

> This is a great idea, thanks a lot!
> Will you use PPA? Let me know to test these binaries

Well, Debian does not have any "formal" PPA hosting, that's more of an
Ubuntu thing ;)

For testing, I'd be happy to provide the binary packages (built for
latest Debian, aka bullseye), sure.

Here's a repo to add to your sources.list:

  deb http://deb.sacchar.in/ bullseye main

I have attached the public key for this repo to this email if you want
to give it a try. The repo carries the following packages:

$ aptly repo search arrow
apache-arrow-apt-source_4.0.0-1_all
gir1.2-arrow-1.0_4.0.0-1_amd64
gir1.2-arrow-cuda-1.0_4.0.0-1_amd64
gir1.2-arrow-dataset-1.0_4.0.0-1_amd64
gir1.2-gandiva-1.0_4.0.0-1_amd64
gir1.2-parquet-1.0_4.0.0-1_amd64
gir1.2-plasma-1.0_4.0.0-1_amd64
libarrow-cuda-dev_4.0.0-1_amd64
libarrow-cuda-glib-dev_4.0.0-1_amd64
libarrow-cuda-glib400_4.0.0-1_amd64
libarrow-cuda-glib400-dbgsym_4.0.0-1_amd64
libarrow-cuda400_4.0.0-1_amd64
libarrow-cuda400-dbgsym_4.0.0-1_amd64
libarrow-dataset-dev_4.0.0-1_amd64
libarrow-dataset-glib-dev_4.0.0-1_amd64
libarrow-dataset-glib-doc_4.0.0-1_amd64
libarrow-dataset-glib400_4.0.0-1_amd64
libarrow-dataset-glib400-dbgsym_4.0.0-1_amd64
libarrow-dataset400_4.0.0-1_amd64
libarrow-dataset400-dbgsym_4.0.0-1_amd64
libarrow-dev_4.0.0-1_amd64
libarrow-flight-dev_4.0.0-1_amd64
libarrow-flight400_4.0.0-1_amd64
libarrow-flight400-dbgsym_4.0.0-1_amd64
libarrow-glib-dev_4.0.0-1_amd64
libarrow-glib-doc_4.0.0-1_all
libarrow-glib400_4.0.0-1_amd64
libarrow-glib400-dbgsym_4.0.0-1_amd64
libarrow-python-dev_4.0.0-1_amd64
libarrow-python-flight-dev_4.0.0-1_amd64
libarrow-python-flight400_4.0.0-1_amd64
libarrow-python-flight400-dbgsym_4.0.0-1_amd64
libarrow-python400_4.0.0-1_amd64
libarrow-python400-dbgsym_4.0.0-1_amd64
libarrow400_4.0.0-1_amd64
libarrow400-dbgsym_4.0.0-1_amd64
libgandiva-dev_4.0.0-1_amd64
libgandiva-glib-dev_4.0.0-1_amd64
libgandiva-glib-doc_4.0.0-1_amd64
libgandiva-glib400_4.0.0-1_amd64
libgandiva-glib400-dbgsym_4.0.0-1_amd64
libgandiva400_4.0.0-1_amd64
libgandiva400-dbgsym_4.0.0-1_amd64
libparquet-dev_4.0.0-1_amd64
libparquet-glib-dev_4.0.0-1_amd64
libparquet-glib-doc_4.0.0-1_all
libparquet-glib400_4.0.0-1_amd64
libparquet-glib400-dbgsym_4.0.0-1_amd64
libparquet400_4.0.0-1_amd64
libparquet400-dbgsym_4.0.0-1_amd64
libplasma-dev_4.0.0-1_amd64
libplasma-glib-dev_4.0.0-1_amd64
libplasma-glib-doc_4.0.0-1_amd64
libplasma-glib400_4.0.0-1_amd64
libplasma-glib400-dbgsym_4.0.0-1_amd64
libplasma400_4.0.0-1_amd64
libplasma400-dbgsym_4.0.0-1_amd64
plasma-store-server_4.0.0-1_amd64
plasma-store-server-dbgsym_4.0.0-1_amd64


What I am proposing in the long run is to come up with a source package
that will eventually be built by Debian's build servers and then
provided as part of the official Debian distribution, without having to
set up a third-party APT source like a PPA.


Cheers
Sascha

Re: Debian packaging for Arrow

Posted by Mauricio Vargas <ma...@uc.cl.INVALID>.
This is a great idea, thanks a lot!
Will you use PPA? Let me know to test these binaries

On Fri, Jun 11, 2021, 5:27 AM Sascha Steinbiss <sa...@debian.org> wrote:

> Hi Arrow community!
>
> I am a Debian Developer looking to package Arrow officially in Debian as
> a dependency for a specific tool I want to get into Debian as well.
>
> I do have a working package based on the JFrog packaging groundwork [0]
> but had to make various changes mostly to avoid downloading dependencies
> from the Internet (which is not allowed during the Debian build
> process). So, mostly setting -DARROW_DEPENDENCY_SOURCE=SYSTEM and tuning
> enabled/disabled features based on what we have and what we don't.
> Result is at [1].
>
> It looks like I can build all packages built by the JFrog packaging with
> no problems (at least for amd64). Build log attached. The only exception
> here are ORC and S3 support, which are missing because the ORC library
> [2] and the AWS C++ SDK [3] are not packaged yet. But apart from that it
> looks like everything works.
>
> Just so you know, nothing has been officially uploaded yet. The package
> is still in preparation and only used internally within my organization
> so far.
>
> Being quite far in the packaging process, I have some questions:
>
> 1.) Would somebody from the upstream team be interested in collaborating
> to keep Arrow maintained in Debian? I would be able to review updates
> and sponsor uploads.
>
> 2.) One quite scary thing left is documenting all copyright and license
> occurrences in the codebase. It looks like there is a fair bit of
> embedded code coming from various sources and with varying levels of
> modification. The debian/copyright file in the JFrog packaging only
> contains a number of TODOs so I guess this is still up to me to finish
> before I can think of doing an upload.
> Is the LICENSE.txt in the Arrow source root directory complete and lists
> _all_ third-party licenses and copyright holders in the release tarball?
> If so, I could use it as a template and just reformat it as required by
> Debian? That would be nice to know, otherwise that would mean a lot of
> digging and probably still missing something. Missed license or
> copyright holder mentions are the most common reason why new packages
> are rejected during the initial, mandatory manual review for new
> packages, BTW, so I'd like to avoid unnecessary review iterations ;)
>
> Thanks!
>
> Best regards
> Sascha
>
> [0]
>
> https://apache.jfrog.io/artifactory/arrow/debian/pool/bullseye/main/a/apache-arrow/apache-arrow_4.0.0-1.debian.tar.xz
> [1] https://salsa.debian.org/satta/arrow/-/tree/master/debian
> [2] https://github.com/apache/orc
> [3] https://github.com/aws/aws-sdk-cpp
>
>

Re: Debian packaging for Arrow

Posted by Sutou Kouhei <ko...@clear-code.com>.
Hi,

In <bd...@debian.org>
  "Re: Debian packaging for Arrow" on Sat, 12 Jun 2021 20:25:17 +0200,
  Sascha Steinbiss <sa...@debian.org> wrote:

> I checked but could not find any: searching in the Debian package index
> lists nothing related to Apache Arrow [0].

Yes. I know that Apache Arrow packages doesn't exist in the
official Debian repository.

Sorry. I should have said that I hope that we have Apache
Arrow packages in the official Debian repository.

>> Could you create a "diff -ru" output between [0] and [1]?
> 
> Sure, attached. I think some changed lines can also be reverted to your
> original state, I may have oversimplified some of them during hacking.

Thanks. It seems that debian/ in apache/arrow have some
missing Build-Depends/Depends.

> Also, are you suggesting that when you say I upload "your" Debian
> package, do you mean the .debs? Because for something to get accepted
> into Debian, we need to only upload _source_ packages, not binary
> packages. Each package must be built on Debian servers from source.

No. I wanted to say that can you reuse apache/arrow's

https://apache.jfrog.io/artifactory/arrow/debian/pool/bullseye/main/a/apache-arrow/apache-arrow_4.0.1-1.debian.tar.xz

to update
https://salsa.debian.org/satta/arrow/-/tree/master/debian .

> What do you think about the following, more established approach:

Thanks for the suggestion.

I don't want to mirror apache/arrow code base to
https://salsa.debian.org/satta/arrow because it increases
maintenance cost. And I don't want to change the upstream of
debian/ to https://salsa.debian.org/satta/arrow from
apache/arrow.

We provide .deb not only Debian GNU/Linux but also
Ubuntu. We may need some different versions of debian/ to
support them. We generate debian/ dynamically instead of
having some copies. For example,

https://github.com/apache/arrow/blob/master/dev/tasks/linux-packages/apache-arrow/debian/control.in#L15

is used to share debian/control with platforms that have
and doesn't have libc-ares-dev.

And we have nightly build infrastructure against the latest
master:

https://lists.apache.org/list.html?builds@arrow.apache.org
https://github.com/apache/arrow/blob/master/dev/tasks/linux-packages/github.linux.amd64.yml

This is useful to fix a packaging and/or code base problem
as soon as possible. If we choose the following approach,

> 0) You clone the salsa repository [2] locally and keep it in sync with
> the version on salsa.

, I think that we can't use this.

> 2) You import the new tarball into your local packaging repo with 'gbp
> import-orig', update debian/changelog to reflect the new version, update
> debian/copyright if there are new files, refresh patches, etc.

Can we copy

https://apache.jfrog.io/artifactory/arrow/debian/pool/bullseye/main/a/apache-arrow/apache-arrow_4.0.1-1.debian.tar.xz

contents into https://salsa.debian.org/satta/arrow to do
this?

> 3) You build a new package with git-buildpackage in a local chroot (e.g.
> with sbuild or cowbuilder, ...) to make sure that everything builds
> correctly.

Can we use GitLab CI for this instead of using local
machine?


> What do you think? I know that this would mean moving the Debian
> packaging workflow outside of your Arrow repository, but I think it
> would make life easier in the long run.

I have some concerns described in the above.

> Another option would be that you just send me a source package for each
> version you'd like to see uploaded (*.orig.tar.gz, *.dsc and
> *.debian.tar.xz) and I would use that for review and upload. But then
> any change that I might want to do would need to eventually be fed back
> into your upstream repository, and I think we can do without the extra
> round-trip if we keep everything Debian-related in one place.

I thought this approach in my first e-mail. I couldn't
fully describe it. Sorry.


Thanks,
-- 
kou

Re: Debian packaging for Arrow

Posted by Sascha Steinbiss <sa...@debian.org>.
Hi Sutou,

cool, thanks for your comments! Let's see if I can elaborate a bit more
on my ideas.

> I'm the original author of the Debian packages for Debian.
> I'm positive that Apache Arrow package exists in the
> official Debian repository.

I checked but could not find any: searching in the Debian package index
lists nothing related to Apache Arrow [0].

Also, I had filed an RFP (request for packaging) a long time ago [1],
and if there had been such a package, I am sure the maintainer would
have closed the RFP and directed me towards the existing package ;)

>> I do have a working package based on the JFrog packaging groundwork [0]
>> but had to make various changes mostly to avoid downloading dependencies
>> from the Internet (which is not allowed during the Debian build
>> process). So, mostly setting -DARROW_DEPENDENCY_SOURCE=SYSTEM and tuning
>> enabled/disabled features based on what we have and what we don't.
>> Result is at [1].
> 
> Could you create a "diff -ru" output between [0] and [1]?

Sure, attached. I think some changed lines can also be reverted to your
original state, I may have oversimplified some of them during hacking.

dh_auto_test is currently disabled because the rules target downloads
stuff. Maybe we want to package that as well to avoid 'git clone' there.

>> The only exception here are ORC and S3 support, which are
>> missing because the ORC library [2] and the AWS C++ SDK
>> [3] are not packaged yet.
> 
> Do you have a plan to package them? If they exist in the
> official Debian repository, we can use them.

I don't really have such a plan, sorry. They have in turn numerous
dependencies which would also need to be packaged separately, and I do
not need them myself. It would be easy to enable support for these
features as soon as _somebody_ packages them eventually. Which I am
pretty confident will happen, as I guess AWS is not going to go away soon 
;)

>> 1.) Would somebody from the upstream team be interested in collaborating
>> to keep Arrow maintained in Debian? I would be able to review updates
>> and sponsor uploads.
> 
> I'm interested in it. How about the following way?
> 
>   1. You open pull requests for each your improvement
>      to https://github.com/apache/arrow/ .
> 
>   2. We mention you on GitHub when we open a pull request
>      that is related to Debian packages such as
>      https://github.com/apache/arrow/pull/10514 .
> 
>   3. You upload our Debian package to the official Debian
>      repository when we release a new version.
>      You can notice a new release on this mailing list.

Interesting -- that's not how it usually works. Debian packaging code is
not expected not live within the upstream code repository but within a
dedicated packaging repository (see [2] as an example) which contains
the upstream code (version-tracked in a separate branch), the debian
directory and an additional pristine-tar branch to produce byte-correct
replicates of the original upstream tarball. Most currently popular and
reliable Debian development tooling (such as git-buildpackage)
implicitly expects and requires this layout. The packaging repo is
typically also supposed (but not required) to live on salsa.debian.org,
the official Debian development GitLab. But usually most upstream
projects do not want to have these Debian-specific branches cluttering
their repo space.

Also, are you suggesting that when you say I upload "your" Debian
package, do you mean the .debs? Because for something to get accepted
into Debian, we need to only upload _source_ packages, not binary
packages. Each package must be built on Debian servers from source.

So... No offense, but I don't think merging my packaging code into yours
is the best idea.
What do you think about the following, more established approach:

0) You clone the salsa repository [2] locally and keep it in sync with
the version on salsa.

1) You release a new version via GitHub. That means there will be a new
release tarball to download via uscan.

2) You import the new tarball into your local packaging repo with 'gbp
import-orig', update debian/changelog to reflect the new version, update
debian/copyright if there are new files, refresh patches, etc.

3) You build a new package with git-buildpackage in a local chroot (e.g.
with sbuild or cowbuilder, ...) to make sure that everything builds
correctly.

4) You push your changeset to the salsa repo, tag a Debian version and
ping me to review the packaging. I will then build a source package,
sign it and upload it to be built on Debian's build farm for all platforms.

That is the workflow for releasing a new version, would of course be
similar for other updates (bugfixes in the packaging, etc). I would make
sure you get all the necessary permissions to work on the salsa repository.

What do you think? I know that this would mean moving the Debian
packaging workflow outside of your Arrow repository, but I think it
would make life easier in the long run.

Another option would be that you just send me a source package for each
version you'd like to see uploaded (*.orig.tar.gz, *.dsc and
*.debian.tar.xz) and I would use that for review and upload. But then
any change that I might want to do would need to eventually be fed back
into your upstream repository, and I think we can do without the extra
round-trip if we keep everything Debian-related in one place.

[...]
>> Is the LICENSE.txt in the Arrow source root directory complete and lists
>> _all_ third-party licenses and copyright holders in the release tarball?
> 
> No. Most of them are covered but some of them only exists in
> source code such as
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/mman.h .

Okay, looks like I'd have to actually look through everything, gathering
and documenting licenses. Might take a while :D

Thanks
Sascha


[0]
https://packages.debian.org/search?suite=default&section=all&arch=any&searchon=names&keywords=arrow
[1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=970021
[2] https://salsa.debian.org/satta/arrow

Re: Debian packaging for Arrow

Posted by Sutou Kouhei <ko...@clear-code.com>.
Hi,

I'm the original author of the Debian packages for Debian.
I'm positive that Apache Arrow package exists in the
official Debian repository.

> I do have a working package based on the JFrog packaging groundwork [0]
> but had to make various changes mostly to avoid downloading dependencies
> from the Internet (which is not allowed during the Debian build
> process). So, mostly setting -DARROW_DEPENDENCY_SOURCE=SYSTEM and tuning
> enabled/disabled features based on what we have and what we don't.
> Result is at [1].

Could you create a "diff -ru" output between [0] and [1]?

> The only exception here are ORC and S3 support, which are
> missing because the ORC library [2] and the AWS C++ SDK
> [3] are not packaged yet.

Do you have a plan to package them? If they exist in the
official Debian repository, we can use them.

> 1.) Would somebody from the upstream team be interested in collaborating
> to keep Arrow maintained in Debian? I would be able to review updates
> and sponsor uploads.

I'm interested in it. How about the following way?

  1. You open pull requests for each your improvement
     to https://github.com/apache/arrow/ .

  2. We mention you on GitHub when we open a pull request
     that is related to Debian packages such as
     https://github.com/apache/arrow/pull/10514 .

  3. You upload our Debian package to the official Debian
     repository when we release a new version.
     You can notice a new release on this mailing list.


> 2.) One quite scary thing left is documenting all copyright and license
> occurrences in the codebase. It looks like there is a fair bit of
> embedded code coming from various sources and with varying levels of
> modification. The debian/copyright file in the JFrog packaging only
> contains a number of TODOs so I guess this is still up to me to finish
> before I can think of doing an upload.

I think so too.

> Is the LICENSE.txt in the Arrow source root directory complete and lists
> _all_ third-party licenses and copyright holders in the release tarball?

No. Most of them are covered but some of them only exists in
source code such as
https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/mman.h .

I put many TODOs to debian/changelog because of this...


Thanks,
-- 
kou

In <a6...@debian.org>
  "Debian packaging for Arrow" on Fri, 11 Jun 2021 11:26:30 +0200,
  Sascha Steinbiss <sa...@debian.org> wrote:

> Hi Arrow community!
> 
> I am a Debian Developer looking to package Arrow officially in Debian as
> a dependency for a specific tool I want to get into Debian as well.
> 
> I do have a working package based on the JFrog packaging groundwork [0]
> but had to make various changes mostly to avoid downloading dependencies
> from the Internet (which is not allowed during the Debian build
> process). So, mostly setting -DARROW_DEPENDENCY_SOURCE=SYSTEM and tuning
> enabled/disabled features based on what we have and what we don't.
> Result is at [1].
> 
> It looks like I can build all packages built by the JFrog packaging with
> no problems (at least for amd64). Build log attached. The only exception
> here are ORC and S3 support, which are missing because the ORC library
> [2] and the AWS C++ SDK [3] are not packaged yet. But apart from that it
> looks like everything works.
> 
> Just so you know, nothing has been officially uploaded yet. The package
> is still in preparation and only used internally within my organization
> so far.
> 
> Being quite far in the packaging process, I have some questions:
> 
> 1.) Would somebody from the upstream team be interested in collaborating
> to keep Arrow maintained in Debian? I would be able to review updates
> and sponsor uploads.
> 
> 2.) One quite scary thing left is documenting all copyright and license
> occurrences in the codebase. It looks like there is a fair bit of
> embedded code coming from various sources and with varying levels of
> modification. The debian/copyright file in the JFrog packaging only
> contains a number of TODOs so I guess this is still up to me to finish
> before I can think of doing an upload.
> Is the LICENSE.txt in the Arrow source root directory complete and lists
> _all_ third-party licenses and copyright holders in the release tarball?
> If so, I could use it as a template and just reformat it as required by
> Debian? That would be nice to know, otherwise that would mean a lot of
> digging and probably still missing something. Missed license or
> copyright holder mentions are the most common reason why new packages
> are rejected during the initial, mandatory manual review for new
> packages, BTW, so I'd like to avoid unnecessary review iterations ;)
> 
> Thanks!
> 
> Best regards
> Sascha
> 
> [0]
> https://apache.jfrog.io/artifactory/arrow/debian/pool/bullseye/main/a/apache-arrow/apache-arrow_4.0.0-1.debian.tar.xz
> [1] https://salsa.debian.org/satta/arrow/-/tree/master/debian
> [2] https://github.com/apache/orc
> [3] https://github.com/aws/aws-sdk-cpp
>