You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Micah Kornfield <em...@gmail.com> on 2019/11/26 04:52:22 UTC

[DISCUSS][C++/Python] Bazel example

As previously discussed [1], I took on the effort the effort of trying to
come up with a demo for using bazel as a build system for C++/Python.  The
results [2] are a little bit of a mixed bag.

I was able to construct an example that runs on my Mac that can compile and
run most of the tests in "src/arrow" as well as the IPC read/write test,
and a python test (test_array.py).  I also have C++ Flight compiling.  A
demonstration for how different library locations can be selected is also
available [3]. This would need a lot more work to come to the current
functionality that CMake has.

After going through this exercise I put together a list of pros and cons
below.

I would like to hear from other devs:
1.  Their opinions on setting this up as an alternative system (I'm willing
to invest some more time in it).
2. What people think the minimum bar for merging a PR like this should be?

Pros:
1.  Being able to run "bazel test python/..." and having compilation of all
python dependencies just work is a nice experience.
2.  Because of the granular compilation units, it can improve developer
velocity. Unit tests can depend only on the sub-components they are meant
to test. They don't need to compile and relink arrow.so.
3.  The built-in documentation it provides about visibility and
relationships between components is nice (its uncovered some "interesting
dependencies").  I didn't make heavy use of it, but its concept of
"visibility" makes things more explicit about what external consumers
should be depending on, and what inter-project components should depend on
(e.g. explicitly limit the scope of vendored code).
4.  Extensions are essentially python, which might be easier to work with
then CMake

Cons:
1.  Bazel is opinionated on C++ layout.  In particular it requires some
workarounds to deal with circular .h/.cc dependencies.  The two main ways
of doing this are either increasing the size of compilable units [4] to
span all dependencies in the cycle, or creating separate
header/implementation targets, I've used both strategies in the PR.  One
could argue that it would be nice to reduce circular dependencies in
general.
2.  Bazel python support still seems lacking.  To make the test work, I
needed to explicitly include all transitive dependencies of the "pip"
installed packaged by hand.
3.  Bazel in general doesn't seem to have wide adoption so any
customization probably won't have a whole lot of support (I've been told
there are some adapters with CMake that can leverage some of the existing
code).
4.  It is more verbose to configure then CMake (each compilation unit needs
to be spelled out with dependencies).
5.  The "packaging" story of different build artifacts still needs to be
explored.

Thanks,
Micah


[1]
https://lists.apache.org/thread.html/26c2a9e7e35ffc6f6ff68fbbfb38a0a33002b8e7210e8d323566f447@%3Cdev.arrow.apache.org%3E
[2] https://github.com/apache/arrow/pull/5897/files
[3]
https://github.com/apache/arrow/pull/5897/files#diff-85ecc9fdaae4c714198a1c31c7748f2a
[4]
https://github.com/apache/arrow/pull/5897/files#diff-c23198ffa8af9adf6825cb9c6f6e135b

Re: [DISCUSS][C++/Python] Bazel example

Posted by Micah Kornfield <em...@gmail.com>.
>
> I don't get how this is a cycle.  It only means Bazel is too limited to
> distinguish between a header dependency and a C++ module?


Agreed, this isn't a true cycle, but bazel is opinionated about this (i.e.
forces workarounds).   In the example I highlighted it might have been
cleaner to take the approach  combining the two ".cc" files and ".h" files
into a single bazel target.  Within Google, there is a fairly strong
convention of 1 ".h" and ".cc" per build target.


> Do you mean that long compile times are ok because we can ask
> contributors to buy 16-core monsters?


No, this was my poor attempt at humor.  I apologize if it offended you or
anyone else.  The hardware I use for my Arrow development is old enough
that I've just started accepting slow build times.

Getting back to potentially merging this, we discussed on bazel on the sync
call.  One option is to not add this to the Arrow CI builds and let Google
projects that depend on the binding be responsible for keeping it working.
This has the potential for bit-rot, but might be a good compromise and let
other developers try it out to see if they like it.

Cheers,
Micah

On Wed, Nov 27, 2019 at 6:52 AM Antoine Pitrou <an...@python.org> wrote:

>
> Le 27/11/2019 à 06:16, Micah Kornfield a écrit :
> >
> >>  Can you give an example of circular dependency?  Can this be solved by
> >> having more "type_fwd.h" headers for forward declarations of opaque
> types?
> >
> > I think the type_fwd.h might contribute to the problem. The solution
> would
> > be more granular header/compilation units when possible (or combining
> > targets appropriately).  An example of the problem is expression.h/.cc
> and
> > operation.h/.cc in the compute library.  Because operation.cc depends on
> > expression.h and expression.cc relies on expression.h there is cycle
> > between the two targets.
>
> I don't get how this is a cycle.  It only means Bazel is too limited to
> distinguish between a header dependency and a C++ module?
>
> For me, a cycle would be something like "expression.h includes
> operation.h which includes expression.h" (I've actually already seen
> things like this, though not in Arrow AFAIR).
>
> > I thought computer
> > upgrades where something to look forward to ;)
>
> Do you mean that long compile times are ok because we can ask
> contributors to buy 16-core monsters?
>
> Regards
>
> Antoine.
>

Re: [DISCUSS][C++/Python] Bazel example

Posted by Antoine Pitrou <an...@python.org>.
Le 27/11/2019 à 06:16, Micah Kornfield a écrit :
> 
>>  Can you give an example of circular dependency?  Can this be solved by
>> having more "type_fwd.h" headers for forward declarations of opaque types?
> 
> I think the type_fwd.h might contribute to the problem. The solution would
> be more granular header/compilation units when possible (or combining
> targets appropriately).  An example of the problem is expression.h/.cc and
> operation.h/.cc in the compute library.  Because operation.cc depends on
> expression.h and expression.cc relies on expression.h there is cycle
> between the two targets.

I don't get how this is a cycle.  It only means Bazel is too limited to
distinguish between a header dependency and a C++ module?

For me, a cycle would be something like "expression.h includes
operation.h which includes expression.h" (I've actually already seen
things like this, though not in Arrow AFAIR).

> I thought computer
> upgrades where something to look forward to ;)

Do you mean that long compile times are ok because we can ask
contributors to buy 16-core monsters?

Regards

Antoine.

Re: [DISCUSS][C++/Python] Bazel example

Posted by Micah Kornfield <em...@gmail.com>.
Hi Antoine,


> My question would be: what happens after the PR is merged?  Are
> developers supposed to keep the Bazel setup working in addition to
> CMake?  Or is there a dedicated maintainer (you? :-)) to fix regressions
> when they happen?

In the short term, I would be will to be a dedicated maintainer for Mac
(and once I get Linux support working for that as well).   I'd like to
classify the support as very experimental (not advertise in documentation
yet).  If other devs find Bazel useful, I would expect others to help with
maintenance naturally.  If it gets too much for me to maintain, I'm willing
to drop support completely, since it won't be a critical part of the build
infrastructure.  Once the setup is more complete, I would plan on adding a
CI target for it as well.


>  Can you give an example of circular dependency?  Can this be solved by
> having more "type_fwd.h" headers for forward declarations of opaque types?

I think the type_fwd.h might contribute to the problem. The solution would
be more granular header/compilation units when possible (or combining
targets appropriately).  An example of the problem is expression.h/.cc and
operation.h/.cc in the compute library.  Because operation.cc depends on
expression.h and expression.cc relies on expression.h there is cycle
between the two targets.  I fixed this by making a new header only target
for expression.h, which the operation target depends on.   Then the
expression target depends on the operation target.  An alternative approach
would be to combine "expression.*" and "operation.*" into a single target.


> (also, generally, it would be desirable to use more of these, since our
> compile times have become egregious as of late - I'm currently
> considering replacing my 8-core desktop CPU with a beefier one :-/)

I'm not a huge fan of this approach in general, but since I haven't been
able to contribute on a day-to-day basis to the C++ code base, I'll let the
active contributors decide the best course here.  I thought computer
upgrades where something to look forward to ;)

This sounds really like a bummer. Do you have to spell those out by
> hand?  Or is there some tool that infers dependencies and generates the
> declarations for you?

Yes, I had to spell them out by hand.  There is an internal tool at Google
that helps with it (I didn't use it for this PR). There has been some
discussion of open-sourcing the tool [1], but I wouldn't expect it any time
soon.  Luckily things are fairly well modularized at the moment, so while
painful, I still felt it was not tremendously painful.  Another solution
would be to have larger targets (e.g. one per directory) that use globs
which would make it less painful, but this loses some of the benefits
mentioned above.

[1] https://github.com/bazelbuild/bazel/issues/6871

On Tue, Nov 26, 2019 at 1:27 AM Antoine Pitrou <an...@python.org> wrote:

>
> Hi Micah,
>
> Le 26/11/2019 à 05:52, Micah Kornfield a écrit :
> >
> > After going through this exercise I put together a list of pros and cons
> > below.
> >
> > I would like to hear from other devs:
> > 1.  Their opinions on setting this up as an alternative system (I'm
> willing
> > to invest some more time in it).
> > 2. What people think the minimum bar for merging a PR like this should
> be?
>
> My question would be: what happens after the PR is merged?  Are
> developers supposed to keep the Bazel setup working in addition to
> CMake?  Or is there a dedicated maintainer (you? :-)) to fix regressions
> when they happen?
>
> > Pros:
> > 1.  Being able to run "bazel test python/..." and having compilation of
> all
> > python dependencies just work is a nice experience.
> > 2.  Because of the granular compilation units, it can improve developer
> > velocity. Unit tests can depend only on the sub-components they are meant
> > to test. They don't need to compile and relink arrow.so.
> > 3.  The built-in documentation it provides about visibility and
> > relationships between components is nice (its uncovered some "interesting
> > dependencies").  I didn't make heavy use of it, but its concept of
> > "visibility" makes things more explicit about what external consumers
> > should be depending on, and what inter-project components should depend
> on
> > (e.g. explicitly limit the scope of vendored code).
> > 4.  Extensions are essentially python, which might be easier to work with
> > then CMake
>
> Those sound nice.
>
> > Cons:
> > 1.  Bazel is opinionated on C++ layout.  In particular it requires some
> > workarounds to deal with circular .h/.cc dependencies.  The two main ways
> > of doing this are either increasing the size of compilable units [4] to
> > span all dependencies in the cycle, or creating separate
> > header/implementation targets, I've used both strategies in the PR.  One
> > could argue that it would be nice to reduce circular dependencies in
> > general.
>
> Can you give an example of circular dependency?  Can this be solved by
> having more "type_fwd.h" headers for forward declarations of opaque types?
>
> (also, generally, it would be desirable to use more of these, since our
> compile times have become egregious as of late - I'm currently
> considering replacing my 8-core desktop CPU with a beefier one :-/)
>
> > 4.  It is more verbose to configure then CMake (each compilation unit
> needs
> > to be spelled out with dependencies).
>
> This sounds really like a bummer. Do you have to spell those out by
> hand?  Or is there some tool that infers dependencies and generates the
> declarations for you?
>
> Regards
>
> Antoine.
>

Re: [DISCUSS][C++/Python] Bazel example

Posted by Antoine Pitrou <an...@python.org>.
Hi Micah,

Le 26/11/2019 à 05:52, Micah Kornfield a écrit :
> 
> After going through this exercise I put together a list of pros and cons
> below.
> 
> I would like to hear from other devs:
> 1.  Their opinions on setting this up as an alternative system (I'm willing
> to invest some more time in it).
> 2. What people think the minimum bar for merging a PR like this should be?

My question would be: what happens after the PR is merged?  Are
developers supposed to keep the Bazel setup working in addition to
CMake?  Or is there a dedicated maintainer (you? :-)) to fix regressions
when they happen?

> Pros:
> 1.  Being able to run "bazel test python/..." and having compilation of all
> python dependencies just work is a nice experience.
> 2.  Because of the granular compilation units, it can improve developer
> velocity. Unit tests can depend only on the sub-components they are meant
> to test. They don't need to compile and relink arrow.so.
> 3.  The built-in documentation it provides about visibility and
> relationships between components is nice (its uncovered some "interesting
> dependencies").  I didn't make heavy use of it, but its concept of
> "visibility" makes things more explicit about what external consumers
> should be depending on, and what inter-project components should depend on
> (e.g. explicitly limit the scope of vendored code).
> 4.  Extensions are essentially python, which might be easier to work with
> then CMake

Those sound nice.

> Cons:
> 1.  Bazel is opinionated on C++ layout.  In particular it requires some
> workarounds to deal with circular .h/.cc dependencies.  The two main ways
> of doing this are either increasing the size of compilable units [4] to
> span all dependencies in the cycle, or creating separate
> header/implementation targets, I've used both strategies in the PR.  One
> could argue that it would be nice to reduce circular dependencies in
> general.

Can you give an example of circular dependency?  Can this be solved by
having more "type_fwd.h" headers for forward declarations of opaque types?

(also, generally, it would be desirable to use more of these, since our
compile times have become egregious as of late - I'm currently
considering replacing my 8-core desktop CPU with a beefier one :-/)

> 4.  It is more verbose to configure then CMake (each compilation unit needs
> to be spelled out with dependencies).

This sounds really like a bummer. Do you have to spell those out by
hand?  Or is there some tool that infers dependencies and generates the
declarations for you?

Regards

Antoine.