You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/02/04 13:11:27 UTC

[GitHub] [arrow-rs] alamb opened a new issue #1269: Provide an `async` ParquetWriter for arrow

alamb opened a new issue #1269:
URL: https://github.com/apache/arrow-rs/issues/1269


   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   As it is nice to be able to read parquet files using rust `async` IO https://github.com/apache/arrow-rs/issues/111, it would be nice to write them as well 
   
    
   **Describe the solution you'd like**
   A `async` writer similar in spirit to the reader created in https://github.com/apache/arrow-rs/issues/111 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on issue #1269: Provide an `async` ParquetWriter for arrow

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #1269:
URL: https://github.com/apache/arrow-rs/issues/1269#issuecomment-1030898575


   > I noticed that we have a buffer optimization in the sync write part. I'm not sure if we need to keep the buffer in the async write part.
   
   Are you referring to the buffer added in #1214 as it would be really cool to keep that if possible? `RecordBatch` are likely significantly smaller than the ideal size for a row group


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] xudong963 commented on issue #1269: Provide an `async` ParquetWriter for arrow

Posted by GitBox <gi...@apache.org>.
xudong963 commented on issue #1269:
URL: https://github.com/apache/arrow-rs/issues/1269#issuecomment-1030067641


   I'll try it @alamb 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] xudong963 commented on issue #1269: Provide an `async` ParquetWriter for arrow

Posted by GitBox <gi...@apache.org>.
xudong963 commented on issue #1269:
URL: https://github.com/apache/arrow-rs/issues/1269#issuecomment-1030625184


   Thanks for your nice suggestion! @alamb, I'll start the task tomorrow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on issue #1269: Provide an `async` ParquetWriter for arrow

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #1269:
URL: https://github.com/apache/arrow-rs/issues/1269#issuecomment-1030604503


   Thanks @xudong963  - I personally suggest starting with the API design. As in write an example program showing how you would use the feature. Something like the following (sorting out what types to use for streams, etc):
   
   ```rust
   async fn main() { 
     // get output stream
     let writer = AsyncParquet::new(...);
     // write batches to the writer somehow (not sure how??)
     for batch in batches {
       ....
     }
   }
   ```
   
      


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] xudong963 commented on issue #1269: Provide an `async` ParquetWriter for arrow

Posted by GitBox <gi...@apache.org>.
xudong963 commented on issue #1269:
URL: https://github.com/apache/arrow-rs/issues/1269#issuecomment-1031008590


   @tustvold Yep, thanks! I'll take a look at #1214 and reorganize my thoughts.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] xudong963 commented on issue #1269: Provide an `async` ParquetWriter for arrow

Posted by GitBox <gi...@apache.org>.
xudong963 commented on issue #1269:
URL: https://github.com/apache/arrow-rs/issues/1269#issuecomment-1030858823


   I felt the need to provide these new APIs and structs.
   
   ```rust
   pub struct AsyncArrowWriter<W: W: AsyncWrite + Unpin + Send> {
       writer: FileStream<W>,
       ...
   }
   
   pub struct FileStream<W: AsyncWriter + Send + Unpined> {
       writer: W
   }
   
   impl <...> FileStream {
       pub async fn write(&mut self, ...) -> Result<> {
           // 1. firstly, write header
           write_header().await?;
           // 2. secondly, write rowgroups
           write_row_groups().await?;
           // 3. thirdly, write metadata
           write_metadata().await?;
       }
   }
   ```
   
   I noticed that we have a **buffer** optimization in the sync write part. I'm not sure if we need to keep the buffer in the async write part.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org