You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "ozankabak (via GitHub)" <gi...@apache.org> on 2023/02/24 18:32:02 UTC

[GitHub] [arrow-rs] ozankabak commented on issue #3740: Support for Async CSV Writer

ozankabak commented on issue #3740:
URL: https://github.com/apache/arrow-rs/issues/3740#issuecomment-1444218489

   > I don't think this is avoidable, arrow is a columnar data format, it fundamentally assumes batching to amortise dispatch overheads. Row-based streaming would require a completely different architecture, likely using a JIT?
   
   @tustvold, I think there is maybe some terminology-related confusion going on here w.r.t. batching. I am sure @metesynnada was not trying to say he wants to avoid batching in its entirety. I think what he envisions (albeit maybe not conveyed clearly) is simply an API that operates with an async writer so that non-IO operations can carry on when the actual write to the object store is taking place.
   
   The current API (i.e. the `put` function) is already `async` and it performs the actual write in a separate thread AFAICT. If this is indeed true, it already doesn't stop the other non-IO operations. Given that we want to serialize synchronously for performance reasons, then it doesn't really matter where we do it -- the API seems sufficient to me as is. I just had a discussion with @metesynnada on this, he seems to agree and can comment further on this if I'm missing something.
   
   Given that we are analyzing this part of the code, one good thing we can do is to investigate whether avoiding the new IO thread and using async primitives to do the actual writing within the same thread makes sense. I am not entirely sure what the advantages/disadvantages of doing that will be. @metesynnada can do some measurements to quantify this. Maybe you can share the reasoning behind the current choice?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org