You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "XinyuZeng (via GitHub)" <gi...@apache.org> on 2023/03/09 03:50:20 UTC

[GitHub] [arrow] XinyuZeng opened a new issue, #34509: [C++][Parquet] batch size in ArrowReaderProperties does not affect ReadTable API

XinyuZeng opened a new issue, #34509:
URL: https://github.com/apache/arrow/issues/34509

   ### Describe the enhancement requested
   
   set_batch_size in ArrowReaderProperties can be configured when building Parquet's FileReader. However, this option only affects RecordBatchReader. The ReadTable API will still generate a continuous table with only one batch, and the batch_size parameter does not take any effect during ReadTable. (I originally thought it was batch size during the parquet scan) Probably add more explanations to the doc to avoid confusion.
   
   ### Component(s)
   
   Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on issue #34509: [C++][Parquet] batch size in ArrowReaderProperties does not affect ReadTable API

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on issue #34509:
URL: https://github.com/apache/arrow/issues/34509#issuecomment-1620520424

   ```C++
     /// Set number of records to read per batch for the RecordBatchReader.
     virtual void set_batch_size(int64_t batch_size) = 0;
   ```
   
   @XinyuZeng Hi, after checking the code, the `set_batch_size` declares that is for `RecordBatchReader`. And `ReadTable` or `ReadColumn` reads all data in a file or some specific columns. So I think this might be expected. Would you mind check that?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on issue #34509: [C++][Parquet] batch size in ArrowReaderProperties does not affect ReadTable API

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on issue #34509:
URL: https://github.com/apache/arrow/issues/34509#issuecomment-1623620458

   I think because it was just the most straightforward way to implement it. I remember looking briefly into the underlying readers and concluded that we would have to manage the chunking ourselves in that case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] XinyuZeng commented on issue #34509: [C++][Parquet] batch size in ArrowReaderProperties does not affect ReadTable API

Posted by "XinyuZeng (via GitHub)" <gi...@apache.org>.
XinyuZeng commented on issue #34509:
URL: https://github.com/apache/arrow/issues/34509#issuecomment-1621177935

   What confuses me before is here: [class ArrowReaderProperties](https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet21ArrowReaderPropertiesE) says this is "Properties for configuring FileReader behavior." But [set_batch_size](https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet21ArrowReaderProperties14set_batch_sizeE7int64_t) is useless for FileReader. Add a few lines to doc may help the users who only want to use the FileReader be clear about those options.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on issue #34509: [C++][Parquet] batch size in ArrowReaderProperties does not affect ReadTable API

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on issue #34509:
URL: https://github.com/apache/arrow/issues/34509#issuecomment-1623615380

   The silly thing is that `FileReaderImpl::GetRecordBatchGenerator` seems to read one table per row group and then slices that table in batch-size'd record batches. @lidavidm Do you remember why is that?
   ```c++
       const int64_t batch_size = self->properties().batch_size();
       return self->DecodeRowGroups(self, {row_group}, column_indices, cpu_executor)
           .Then([batch_size](const std::shared_ptr<Table>& table)
                     -> ::arrow::Result<RecordBatchGenerator> {
             ::arrow::TableBatchReader table_reader(*table);
             table_reader.set_chunksize(batch_size);
             ARROW_ASSIGN_OR_RAISE(auto batches, table_reader.ToRecordBatches());
             return ::arrow::MakeVectorGenerator(std::move(batches));
           });
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on issue #34509: [C++][Parquet] batch size in ArrowReaderProperties does not affect ReadTable API

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on issue #34509:
URL: https://github.com/apache/arrow/issues/34509#issuecomment-1621180979

   Hmm I know what you mean, let me have a try


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on issue #34509: [C++][Parquet] batch size in ArrowReaderProperties does not affect ReadTable API

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on issue #34509:
URL: https://github.com/apache/arrow/issues/34509#issuecomment-1620516170

   I believe `batch_size()` is only used by the interfaces acero might uses. For parquet internal or even some `ReadTable` api, this is useless..


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou closed issue #34509: [C++][Parquet] batch size in ArrowReaderProperties does not affect ReadTable API

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou closed issue #34509: [C++][Parquet] batch size in ArrowReaderProperties does not affect ReadTable API
URL: https://github.com/apache/arrow/issues/34509


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org