You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/04/26 12:32:03 UTC

[GitHub] [arrow-rs] alamb opened a new issue #82: – Async Sans IO: R/W into/to Arrow Arrays

alamb opened a new issue #82:
URL: https://github.com/apache/arrow-rs/issues/82


   *Note*: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-9275
   
   This issue can be considered an epic level that spans across other arrow projects.
   
   *Drill down*
   
   Currently, traits like `ParquetReader` only allow synchronous interface which uses BufReader having 8KB constant buffer. Over the network, this becomes a problem. This can be easily solvable with differential buffers. In addition to this shortage, there is a problem of executor engine is needed to schedule from async trait methods to sync trait methods which should sit somewhere in between to make requests asynchronous to external IO. On-disk IO is acceptable with the approach we currently have since no reliable evented IO exists for on-disk IO on major platforms.
   
   All these considered abstractions that will expose asynchronous IO without any side from executors, needs to be exposed.
   
    
   
   *Design Suggestions & Considerations*
   
   The design should apply and consider:
    * Sans IO, (for more information about Sans approach please see [https://sans-io.readthedocs.io/] ) 
    * Not including any executor specific data, at all.
    * Tests should work with any executor with little to no modification.
    * Buffers are adjusted accordingly and use differential buffers to optimize network trips.
    * Sync IO shouldn't be touched. At all costs. If we try to unify Sync IO traits or we do overlapping implementation, that will make our life harder in the future. Sans IO should be compartmentalized.
   
    
   
   *Notes*
   
   If Sans approach is not taken, the project will:
    * use an extreme amount of dependencies.
    * be not compatible with other Rust code at all.
    * break currently working code uses array ingestions.
    * integrations tests are going to be harder.
    * it will really hard to adapt to completion-based APIs stabilize in the future. (in the user projects)
    * this suggestion is not about the flight format or any flight-related information atm. This is purely making on-disk, remote IO (provider backends like AWS etc.) async.
   
    
   
   *Open points*
   
   A couple of open points:
    * Identifying traits that are going to be asyncized.
    * Designing internal routines.
    * package name to expose.
    * Gather traits into the designated packages in all file formats.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on issue #82: – Async Sans IO: R/W into/to Arrow Arrays

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #82:
URL: https://github.com/apache/arrow-rs/issues/82#issuecomment-826797471


   Comment from Mahmut Bulut(vertexclique) @ 2020-06-30T10:30:24.500+0000:
   <pre>[~nevi_me], [~andygrove], [~paddyhoran] I need input for this from you if possible.</pre>
   
   Comment from Neville Dipale(nevi_me) @ 2020-07-01T19:16:41.647+0000:
   <pre>Hi [~vertexclique], I'm out of my depth with Sans IO. Are you proposing a way of using async IO without being bound to a specific runtime (tokio, async-std, etc.)?
   
   There has been interest in async IO, so I presume that once we have a concrete implementation plan, we might be able to get more contributors to help (assuming it's a lot of effort).
   
   As you mention that this might potentially span across other projects; perhaps you could bring this up in the mailing list, to get more feedback from the wider community?</pre>
   
   Comment from Mahmut Bulut(vertexclique) @ 2020-07-03T11:00:31.593+0000:
   <pre>Yes, exactly Neville, so users can choose whatever they want to incorporate in their workloads, which enables plenty of projects with different workloads, scenarios, etc.
   
   And yes again, I feel like there should be a collaborative effort together to add APIs around crates. Spans a little wider than other tickets.
   
   Sure! I will send a similar email with similar content of this ticket. Tagging `[Rust]`. Thanks for the feedback, will send a mail asap.</pre>
   
   Comment from Andy Grove(andygrove) @ 2020-08-25T01:35:37.057+0000:
   <pre>[~vertexclique] For some reason I didn't see this issue until now. I am interested in discussing this further and especially how it relates to other issues we have open around async.
   
   Also pinging [~alamb] and [~jorgecarleitao] who have been involved in discussions related to this in the DataFusion crate.</pre>
   
   Comment from Andrew Lamb(alamb) @ 2020-08-25T14:12:09.216+0000:
   <pre>In general, I think the notion of implementing async Parquet and Arrow APIs that don't rely on tokio or other executors is a good idea. 
   
   I think in order to make the crate as widely useful as possible, it should also retain a synchronous API for use with the rust standard library.
   
   One pattern I have seen is a using a `async` crate option that adds the appropriate async options (and possibly additional dependencies). For example, https://docs.rs/bzip2/0.4.1/bzip2/#async-io
   
   </pre>
   
   Comment from Max Burke(m18e) @ 2020-08-26T17:15:30.549+0000:
   <pre>I'd be a little concerned about over-generalizing out of the gate. Having done a similar song and dance with some internal code one of the things I like about tailoring to a specific runtime is that synchronization primitives taken from a particular runtime are able to leverage that runtime. Tokio's mutex, for example, will yield back to the executor if it contends on a mutex lock rather than tying up a pool thread, which can be a Very Good Thing with async-heavy workloads.
   
   I'm not sure what you mean in terms of specifics for the "sans-IO" method, I assume by this you mean the user would be expected to pass in implementations of the AsyncRead/etc. traits which will read from disk or network or memory or wherever? </pre>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org