You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "westonpace (via GitHub)" <gi...@apache.org> on 2023/05/09 21:45:08 UTC

[GitHub] [arrow] westonpace commented on issue #35508: [C++][Python] Adding data to tdigest in pyarrow

westonpace commented on issue #35508:
URL: https://github.com/apache/arrow/issues/35508#issuecomment-1540931214

   Yes, internally, most aggregate functions are implemented in an incremental map/reduce style.  A lot of the pieces are in place to expose this but not everything.
   
   However, C++ changes would be required (I think).
   
   You could choose to do this completely outside of Acero using the function registry directly.  You would end up creating something that looks quite a bit like the aggregate node so I'd recommend starting by looking at that and getting familiar.
   
   On the other hand, I think I'd prefer an approach reusing Acero since it already does this.  Acero can take streaming (and potentially infinite inputs).  This is not very well exposed to pyarrow today (I think you can use an iterator of batches as the input to a scanner).  It might be nice to be able to create a push-based pyarrow/acero data source though.
   
   The other problem is that you would need an aggregate node that emitted its results every time a batch arrived.  This shouldn't be too hard and could probably be done with a flag to the aggregate node.  The only wrinkle I think is that the aggregate node, today, has one state per thread.  For this to work you would want to synchronize all the threads on a single state (or run without threads).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org