You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "wjones127 (via GitHub)" <gi...@apache.org> on 2023/03/14 13:55:10 UTC

[GitHub] [arrow] wjones127 opened a new issue, #15280: [C++] Move Acero out of libarrow

wjones127 opened a new issue, #15280:
URL: https://github.com/apache/arrow/issues/15280

   ### Describe the enhancement requested
   
   `libarrow_acero`? `libacero`?
   
   The core libarrow is largely stable while Acero remains experimental. It would be nice to eventually let these libraries evolve somewhat independently, like arrow-rs and datafusion.
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ildipo commented on issue #15280: [C++] Move Acero out of libarrow

Posted by "ildipo (via GitHub)" <gi...@apache.org>.
ildipo commented on issue #15280:
URL: https://github.com/apache/arrow/issues/15280#issuecomment-1473742744

   yes that is correct - it uses both


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] icexelloss commented on issue #15280: [C++] Move Acero out of libarrow

Posted by GitBox <gi...@apache.org>.
icexelloss commented on issue #15280:
URL: https://github.com/apache/arrow/issues/15280#issuecomment-1380465061

   For what it's worth, as a Acero user I currently don't use dataset and don't see myself using it in the near future. So I am +1 not putting dataset into libacero and have a seperate so files for it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] icexelloss commented on issue #15280: [C++] Move Acero out of libarrow

Posted by "icexelloss (via GitHub)" <gi...@apache.org>.
icexelloss commented on issue #15280:
URL: https://github.com/apache/arrow/issues/15280#issuecomment-1473853892

   @ildipo where does substrait uses dataset? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] icexelloss commented on issue #15280: [C++] Move Acero out of libarrow

Posted by "icexelloss (via GitHub)" <gi...@apache.org>.
icexelloss commented on issue #15280:
URL: https://github.com/apache/arrow/issues/15280#issuecomment-1468149349

   The above PR doesn't not close this issue - only a very first step towards it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rok commented on issue #15280: [C++] Move Acero out of libarrow

Posted by GitBox <gi...@apache.org>.
rok commented on issue #15280:
URL: https://github.com/apache/arrow/issues/15280#issuecomment-1377212553

   How usable are compute kernels now without Acero? How usable is Acero without kernels?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] zeroshade commented on issue #15280: [C++] Move Acero out of libarrow

Posted by "zeroshade (via GitHub)" <gi...@apache.org>.
zeroshade commented on issue #15280:
URL: https://github.com/apache/arrow/issues/15280#issuecomment-1400626347

   CC @benibus 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on issue #15280: [C++] Move Acero out of libarrow

Posted by GitBox <gi...@apache.org>.
wjones127 commented on issue #15280:
URL: https://github.com/apache/arrow/issues/15280#issuecomment-1377952001

   > Both datasets and substrait will need to depend on libacero. It would be a bit weird to have libarrow_dataset -> libacero -> libarrow_compute. I'm ok with the weirdness though.
   
   Yeah that's why I was contemplating renaming `libarrow_dataset` to `libacero_dataset`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on issue #15280: [C++] Move Acero out of libarrow

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on issue #15280:
URL: https://github.com/apache/arrow/issues/15280#issuecomment-1490660120

   Agree that wasn't what I initially hoped, but I think the argument for doing it like this is:
   
    1. We don't want to move `libarrow_dataset` to `libacero_dataset`, which causes more breaking changes for users than strictly necessary
    2. We have no intention of having Acero usable without Arrow (unlike the Parquet codebase, which theoretically should be splittable into its own repo if we ever want to). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on issue #15280: [C++] Move Acero out of libarrow

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on issue #15280:
URL: https://github.com/apache/arrow/issues/15280#issuecomment-1377202034

   What part would we want to move to a separate library? 
   Right now Acero lives in `arrow/compute/exec/`, right? Would we only move that, or the full of `arrow/compute` ? (xref ARROW-8891 - *"Split non-cast compute kernels into a separate shared library"*)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ildipo commented on issue #15280: [C++] Move Acero out of libarrow

Posted by "ildipo (via GitHub)" <gi...@apache.org>.
ildipo commented on issue #15280:
URL: https://github.com/apache/arrow/issues/15280#issuecomment-1472749351

   @icexelloss @westonpace 
   
   I think we need a slightly different solution:
   * substrait uses dataset
   * dataset uses acero
   * acero uses arrow
   
   my plan is to introduce a libacero (from compute/exec) which depends only on libarrow and leave libsubstrait as is, make libdataset depend on libacero


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on issue #15280: [C++] Move Acero out of libarrow

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on issue #15280:
URL: https://github.com/apache/arrow/issues/15280#issuecomment-1490617887

   The current PR https://github.com/apache/arrow/pull/34711 creates a `src/arrow/acero` (so still part of the arrow namespace), adn not to `src/acero`. Both have been mentioned above, I don't know if there was a clear conclusion on that question?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] icexelloss closed issue #15280: [C++] Move Acero out of libarrow

Posted by "icexelloss (via GitHub)" <gi...@apache.org>.
icexelloss closed issue #15280: [C++] Move Acero out of libarrow
URL: https://github.com/apache/arrow/issues/15280


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] paleolimbot commented on issue #15280: [C++] Move Acero out of libarrow

Posted by GitBox <gi...@apache.org>.
paleolimbot commented on issue #15280:
URL: https://github.com/apache/arrow/issues/15280#issuecomment-1377365910

   I would love to see this split, although the ability to distribute it separately (e.g., so that a separate R or Python 'acero' package could exist) is something I hope this would be a step in the direction of.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #15280: [C++] Move Acero out of libarrow

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #15280:
URL: https://github.com/apache/arrow/issues/15280#issuecomment-1472854515

   Do you mean "substrait uses dataset AND acero"?  Then I agree.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] icexelloss closed issue #15280: [C++] Move Acero out of libarrow

Posted by "icexelloss (via GitHub)" <gi...@apache.org>.
icexelloss closed issue #15280: [C++] Move Acero out of libarrow
URL: https://github.com/apache/arrow/issues/15280


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #15280: [C++] Move Acero out of libarrow

Posted by GitBox <gi...@apache.org>.
westonpace commented on issue #15280:
URL: https://github.com/apache/arrow/issues/15280#issuecomment-1377946455

   `libacero` will be easier than `libkernels` (there are still parts of the core lib that depend on compute kernels, I think) though both seem valid.
   
   The only gray area I can see is group-by.  There is currently a standalone group-by implementation that runs outside of Acero, that is marked internal, that I am [trying to get rid of](https://github.com/apache/arrow/pull/14867).
   
   My vote would be that "group-by" is a part of Acero.  Note, the `hash_` aggregate kernels themselves could remain as part of compute/kernels.  Aggregations would still be usable outside of Acero.
   
   If that's the case then you could also move `arrow/compute/row` to Acero (and probably move it to `arrow/compute/exec/row` while you are at it).
   
   > I prefer to libacero because Gandiva uses libgandiva not libarrow_gandiva.
   ADBC also uses libadbc_* not libarrow_adbc_*.
   
   Both datasets and substrait will need to depend on libacero.  It would be a bit weird to have `libarrow_dataset` -> `libacero` -> `libarrow_compute`.  I'm ok with the weirdness though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on issue #15280: [C++] Move Acero out of libarrow

Posted by GitBox <gi...@apache.org>.
wjones127 commented on issue #15280:
URL: https://github.com/apache/arrow/issues/15280#issuecomment-1382059493

   cc @jorisvandenbossche @zeroshade 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on issue #15280: [C++] Move Acero out of libarrow

Posted by GitBox <gi...@apache.org>.
wjones127 commented on issue #15280:
URL: https://github.com/apache/arrow/issues/15280#issuecomment-1379489392

   I spent some time looking at the necessary changes, and it occured to me with have a bit of a mess with namespaces as well. We introduced a library called `libarrow_substrait` and that contains the namespace `arrow::engine`. Nothing else uses that namespace. Acero (i.e. the code in `src/arrow/compute/exec` all lives in the `arrow::compute` namespace.
   
   Although I'd like to have this issue somewhat limited in scope, it would be nice to know where we are trying to go with the library structure. I created a document outlining the current structure, an a couple ideas of a slightly more sane structure:
   
   https://docs.google.com/document/d/1IcQfqFlvMgM3GreNbAvax0_2XKFHRdmcs9nqTPQlhOs/edit?usp=sharing
   
   I welcome any comments or alternative suggestions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rok commented on issue #15280: [C++] Move Acero out of libarrow

Posted by GitBox <gi...@apache.org>.
rok commented on issue #15280:
URL: https://github.com/apache/arrow/issues/15280#issuecomment-1377265855

   So then another option is to split further to `libkernels`, `libacero`. I don't have an opinion on the split though :).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on issue #15280: [C++] Move Acero out of libarrow

Posted by GitBox <gi...@apache.org>.
wjones127 commented on issue #15280:
URL: https://github.com/apache/arrow/issues/15280#issuecomment-1377555897

   > What part would we want to move to a separate library?
   Right now Acero lives in arrow/compute/exec/, right? Would we only move that, or the full of arrow/compute ? (xref [ARROW-8891](https://issues.apache.org/jira/browse/ARROW-8891) - "Split non-cast compute kernels into a separate shared library")
   
   Yes, just `arrow/compute/exec/`. Splitting compute functions is its own issue and much messier.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ildipo commented on issue #15280: [C++] Move Acero out of libarrow

Posted by "ildipo (via GitHub)" <gi...@apache.org>.
ildipo commented on issue #15280:
URL: https://github.com/apache/arrow/issues/15280#issuecomment-1474031929

   https://github.com/ildipo/arrow/blob/main/cpp/src/arrow/engine/substrait/relation_internal.cc


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #15280: [C++] Move Acero out of libarrow

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #15280:
URL: https://github.com/apache/arrow/issues/15280#issuecomment-1479522711

   > @ildipo where does substrait uses dataset? (This is slightly surprising to me because I thought substrait code mostly just read in substrait plan protobuf and turn into an Acero declaration)
   
   @icexelloss the `scan` and `write` nodes are provided by datasets.  You can use `source` (e.g. an in-memory source) with plain Acero but if you want to scan files you need datasets.  Substrait's `ReadRel` when `LocalFiles` will use a `scan` node.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #15280: [C++] Move Acero out of libarrow

Posted by GitBox <gi...@apache.org>.
westonpace commented on issue #15280:
URL: https://github.com/apache/arrow/issues/15280#issuecomment-1379596541

   Thanks for the detailed analysis and the suggestions.  `engine` echos back to a time when we were thinking there might be potentially more formats like Substrait that could drive the engine.  It was an early standin for "acero" as the name wasn't agreed at the time.  I'm fine getting rid of the term "engine" and merging substrait & acero and agree that is a good idea.
   
   I'm a little hesitant about lumping "datasets" with Acero.  Datasets is somewhat heavyweight, it requires parquet, orc, csv, filesystems, etc.  None of those things are relevant to the core Acero.  One could imagine Acero being used without any need for file I/O.  For example, flight input, apply some compute operations, flight output.
   
   So my preference would be second chart you drew that has libacero and libacero_dataset.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on issue #15280: [C++] Move Acero out of libarrow

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on issue #15280:
URL: https://github.com/apache/arrow/issues/15280#issuecomment-1377223717

   > How usable is Acero without kernels?
   
   Acero would then of course have to depend on libarrow (or libarrow_compute) for the kernels.
   
   > How usable are compute kernels now without Acero?
   
   Very usable (if you are OK with the numpy paradigm of having eager functions for all operations). The current integration in pandas to work with arrow data is built on top of plain compute kernels. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on issue #15280: [C++] Move Acero out of libarrow

Posted by GitBox <gi...@apache.org>.
wjones127 commented on issue #15280:
URL: https://github.com/apache/arrow/issues/15280#issuecomment-1376425281

   `libarrow_dataset` depends on Acero, and is tightly integrated with it. Should it be moved into `libacero`? or renamed `libacero_dataset`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou commented on issue #15280: [C++] Move Acero out of libarrow

Posted by GitBox <gi...@apache.org>.
kou commented on issue #15280:
URL: https://github.com/apache/arrow/issues/15280#issuecomment-1376855728

   I prefer to `libacero` because Gandiva uses `libgandiva` not `libarrow_gandiva`.
   ADBC also uses `libadbc_*` not `libarrow_adbc_*`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ildipo commented on issue #15280: [C++] Move Acero out of libarrow

Posted by "ildipo (via GitHub)" <gi...@apache.org>.
ildipo commented on issue #15280:
URL: https://github.com/apache/arrow/issues/15280#issuecomment-1452705159

   Since at TS we need this I'll be doing this refactoring with the solution n2 in the google doc.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] icexelloss commented on issue #15280: [C++] Move Acero out of libarrow

Posted by "icexelloss (via GitHub)" <gi...@apache.org>.
icexelloss commented on issue #15280:
URL: https://github.com/apache/arrow/issues/15280#issuecomment-1490749617

   I agree with Will here that Acero is still better to live under the Arrow umbrella since there are internal dependency (e.g., dataset depend on acero)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org