You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "simonthum (via GitHub)" <gi...@apache.org> on 2023/04/29 23:34:47 UTC

[GitHub] [arrow] simonthum opened a new issue, #35371: How to read an Arrow IPC with multiple record batches in C#

simonthum opened a new issue, #35371:
URL: https://github.com/apache/arrow/issues/35371

   ### Describe the usage question you have. Please include as many useful details as  possible.
   
   
   I have a large-ish file with hundreads of record batches. I can read them one by one successfully.
   
   However, I would like to create a ML.net DataFrame, which I suppose means I shoud join the record batches into a single large one before. I tried this:
   
   ````
   var rbBuilder = new RecordBatch.Builder(allocator);
               
   using (var stream = File.OpenRead("p2_uc.arrow"))
   using (var reader = new ArrowFileReader(stream, allocator))
             {
                 RecordBatch recordBatch;
                 while ((recordBatch = await reader.ReadNextRecordBatchAsync()) != null)
                 {
                     rbBuilder.Append(recordBatch);
                 }
             }
             
             var df = DataFrame.FromArrowRecordBatch(
                 rbBuilder.Build());
   ````
   
   However, I get an exception as the builder has far too many fields. I suppose RecordBatch.Builder.Append is not intended for that job.
   
   I have not found any example on how to read a arrow IPC file as a DataFrame. Is that supported at all?
   
   ### Component(s)
   
   C#


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] How to read an Arrow IPC with multiple record batches in C# [arrow]

Posted by "CurtHagenlocher (via GitHub)" <gi...@apache.org>.
CurtHagenlocher commented on issue #35371:
URL: https://github.com/apache/arrow/issues/35371#issuecomment-1849013636

   ArrowArrayConcatenator has now been made public, removing the need to call it via Reflection. The in-memory layout of a record batch collection is different than the in-memory layout of a single record batch, so any changes to make DataFrame work differently would need to happen on the DataFrame side.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] How to read an Arrow IPC with multiple record batches in C# [arrow]

Posted by "simonthum (via GitHub)" <gi...@apache.org>.
simonthum commented on issue #35371:
URL: https://github.com/apache/arrow/issues/35371#issuecomment-1858785303

   Hi, this is good to know but I actually looked for a more effective way to work with many record batches. is there some other ticket for that?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] simonthum commented on issue #35371: How to read an Arrow IPC with multiple record batches in C#

Posted by "simonthum (via GitHub)" <gi...@apache.org>.
simonthum commented on issue #35371:
URL: https://github.com/apache/arrow/issues/35371#issuecomment-1529497092

   I went with a reflection based call to ArrowArrayConcatenator per column. It takes about 4 times as long as the record batch loading. I am still looking for an efficient method.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] simonthum commented on issue #35371: How to read an Arrow IPC with multiple record batches in C#

Posted by "simonthum (via GitHub)" <gi...@apache.org>.
simonthum commented on issue #35371:
URL: https://github.com/apache/arrow/issues/35371#issuecomment-1575616830

   Actually, the concatenation worked well with ArrowArrayConcatenator, but I consider it a workaround. If memory mapped access is to work as (I think is) intended, it should be possible to construct a DataFrame from a collection of RecordBatches. Or am I on the wrong track?
   
   So it might be that MDA.DataFrame needs enhancement, not this library.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #35371: How to read an Arrow IPC with multiple record batches in C#

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #35371:
URL: https://github.com/apache/arrow/issues/35371#issuecomment-1538573072

   `RecordBatch.Builder.Append(RecordBatch)` doesn't work how I think you are expecting.  Instead of combining the rows from the two batches (making a longer record batch) it combines the columns from the two batches (making a wider record batch).
   
   There is no class that does the concatenating behavior you are describing today though it seems a reasonable enhancement request.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] How to read an Arrow IPC with multiple record batches in C# [arrow]

Posted by "CurtHagenlocher (via GitHub)" <gi...@apache.org>.
CurtHagenlocher closed issue #35371: How to read an Arrow IPC with multiple record batches in C#
URL: https://github.com/apache/arrow/issues/35371


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org