You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "piyushdubey (via GitHub)" <gi...@apache.org> on 2024/02/27 05:02:43 UTC

[I] Parquet to Arrow conversion [arrow]

piyushdubey opened a new issue, #40258:
URL: https://github.com/apache/arrow/issues/40258

   ### Describe the usage question you have. Please include as many useful details as  possible.
   
   
   Hello,
   
   I am trying to convert a Delta Table to an Arrow Stream.
   
   The table can have any number of parquet files and may or may not be partitioned. I am using Parquet.Net to read Parquet Files. 
   
   How should I think about parity between parquet files and RecordBatch. Should I create one RecordBatch per parquet file? What should the overall parquet to arrow conversion logic look like? Any pointers?
   
   
   Here's a tentative algorithm I have in mind. 
   
   1. Iterate over the list parquet files
   2. Read `ParquetRowGroupReader reader = parquetReader.OpenRowGroupReader(rowGroupIndex);`
   3. Extract Columns and Add them to a record batch one by one
   4. Read RecordBatch into ArrowStreamWriter().
   
   Appreciate any help with this.
   
   ### Component(s)
   
   C#


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C#] Parquet to Arrow conversion [arrow]

Posted by "CurtHagenlocher (via GitHub)" <gi...@apache.org>.
CurtHagenlocher closed issue #40258: [C#] Parquet to Arrow conversion
URL: https://github.com/apache/arrow/issues/40258


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C#] Parquet to Arrow conversion [arrow]

Posted by "adamreeve (via GitHub)" <gi...@apache.org>.
adamreeve commented on issue #40258:
URL: https://github.com/apache/arrow/issues/40258#issuecomment-2010974690

   As an alternative to Parquet.Net, you could use ParquetSharp, which wraps the Arrow C++ Parquet library and has built-in support for reading Parquet files as Arrow record batches: https://github.com/G-Research/ParquetSharp/blob/master/docs/Arrow.md
   
   Disclaimer: I'm a maintainer of ParquetSharp
   
   But otherwise your algorithm seems sensible. I'd only suggest you might want one RecordBatch per row group rather than per file, in case your files contain many row groups and could be large.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C#] Parquet to Arrow conversion [arrow]

Posted by "CurtHagenlocher (via GitHub)" <gi...@apache.org>.
CurtHagenlocher commented on issue #40258:
URL: https://github.com/apache/arrow/issues/40258#issuecomment-2016540074

   Marking this as closed. In the absence of an "official" Apache C# implementation of a Parquet reader, I think there isn't much to add here. Interestingly, early versions of Parquet.Net also had an Arrow API but it was removed because the maintainer didn't see any advantage to it. (Utilization of Arrow was much less widespread in those days.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C#] Parquet to Arrow conversion [arrow]

Posted by "piyushdubey (via GitHub)" <gi...@apache.org>.
piyushdubey commented on issue #40258:
URL: https://github.com/apache/arrow/issues/40258#issuecomment-2016885995

   Got it. Thanks @CurtHagenlocher 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C#] Parquet to Arrow conversion [arrow]

Posted by "piyushdubey (via GitHub)" <gi...@apache.org>.
piyushdubey commented on issue #40258:
URL: https://github.com/apache/arrow/issues/40258#issuecomment-2016886238

   > As an alternative to Parquet.Net, you could use ParquetSharp, which wraps the Arrow C++ Parquet library and has built-in support for reading Parquet files as Arrow record batches: https://github.com/G-Research/ParquetSharp/blob/master/docs/Arrow.md
   > 
   > Disclaimer: I'm a maintainer of ParquetSharp
   > 
   > But otherwise your algorithm seems sensible. I'd only suggest you might want one RecordBatch per row group rather than per file, in case your files contain many row groups and could be large.
   
   Thanks @adamreeve - Looks like eventually I will need to move to ParquetSharp only. Arrow conversion with other Parquet parsers is pretty tedious to work with.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org