You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/07/10 18:27:41 UTC

[GitHub] [arrow] bkietz opened a new pull request #7704: ARROW-9297: [C++][Parquet] Support chunked row groups in RowGroupRecordBatchReader

bkietz opened a new pull request #7704:
URL: https://github.com/apache/arrow/pull/7704


   Previously, if any field reader for a row group yielded multiple chunks for a single row group an error would be raised.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] emkornfield commented on a change in pull request #7704: ARROW-9297: [C++][Parquet] Support chunked row groups in RowGroupRecordBatchReader

Posted by GitBox <gi...@apache.org>.

emkornfield commented on a change in pull request #7704:
URL: https://github.com/apache/arrow/pull/7704#discussion_r453150052



##########
File path: cpp/src/parquet/arrow/reader.cc
##########
@@ -781,6 +770,9 @@ Status FileReaderImpl::GetRecordBatchReader(const std::vector<int>& row_group_in
                                             const std::vector<int>& column_indices,
                                             std::unique_ptr<RecordBatchReader>* out) {
   // column indices check
+  ARROW_ASSIGN_OR_RAISE(std::ignore, manifest_.GetFieldIndices(column_indices));

Review comment:
       it would be nice if RETURN_NOT_OK could be used here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz commented on a change in pull request #7704: ARROW-9297: [C++][Parquet] Support chunked row groups in RowGroupRecordBatchReader

Posted by GitBox <gi...@apache.org>.

bkietz commented on a change in pull request #7704:
URL: https://github.com/apache/arrow/pull/7704#discussion_r453337740



##########
File path: cpp/src/parquet/arrow/arrow_schema_test.cc
##########
@@ -644,6 +649,78 @@ TEST_F(TestConvertParquetSchema, ParquetRepeatedNestedSchema) {
   ASSERT_NO_FATAL_FAILURE(CheckFlatSchema(arrow_schema));
 }
 
+TEST_F(TestConvertParquetSchema, ColumnSubselection) {
+  std::vector<NodePtr> parquet_fields;
+  std::vector<std::shared_ptr<Field>> arrow_fields;
+  {
+    //   optional int32 leaf1;
+    //   repeated group outerGroup {
+    //     optional int32 leaf2;
+    //     repeated group innerGroup {
+    //       optional int32 leaf3;
+    //     }
+    //   }
+    parquet_fields.push_back(

Review comment:
       On consideration, I think I'll scrap the new overload of `FromParquetSchema`. It's straightforward to retrieve the specific fields by index from a "whole" arrow schema and doesn't currently warrant a dedicated utility.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] emkornfield commented on a change in pull request #7704: ARROW-9297: [C++][Parquet] Support chunked row groups in RowGroupRecordBatchReader

Posted by GitBox <gi...@apache.org>.

emkornfield commented on a change in pull request #7704:
URL: https://github.com/apache/arrow/pull/7704#discussion_r453149446



##########
File path: cpp/src/parquet/arrow/reader.cc
##########
@@ -301,62 +302,50 @@ class FileReaderImpl : public FileReader {
 
 class RowGroupRecordBatchReader : public ::arrow::RecordBatchReader {
  public:
-  RowGroupRecordBatchReader(std::vector<std::unique_ptr<ColumnReaderImpl>> field_readers,
-                            std::shared_ptr<::arrow::Schema> schema, int64_t batch_size)
-      : field_readers_(std::move(field_readers)),
-        schema_(std::move(schema)),
-        batch_size_(batch_size) {}
-
   ~RowGroupRecordBatchReader() override {}
 
   std::shared_ptr<::arrow::Schema> schema() const override { return schema_; }
 
-  static Status Make(const std::vector<int>& row_groups,
-                     const std::vector<int>& column_indices, FileReaderImpl* reader,
-                     int64_t batch_size,
-                     std::unique_ptr<::arrow::RecordBatchReader>* out) {
-    std::vector<int> field_indices;
-    if (!reader->manifest_.GetFieldIndices(column_indices, &field_indices)) {
-      return Status::Invalid("Invalid column index");
-    }
+  static ::arrow::Result<std::unique_ptr<RowGroupRecordBatchReader>> Make(
+      const std::vector<int>& row_group_indices, const std::vector<int>& column_indices,

Review comment:
       not present before but adding some docs here might be useful.  especially around lifecyle of reader.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz commented on pull request #7704: ARROW-9297: [C++][Parquet] Support chunked row groups in RowGroupRecordBatchReader

Posted by GitBox <gi...@apache.org>.

bkietz commented on pull request #7704:
URL: https://github.com/apache/arrow/pull/7704#issuecomment-657306850


   Thanks @wesm, @emkornfield!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz commented on a change in pull request #7704: ARROW-9297: [C++][Parquet] Support chunked row groups in RowGroupRecordBatchReader

Posted by GitBox <gi...@apache.org>.

bkietz commented on a change in pull request #7704:
URL: https://github.com/apache/arrow/pull/7704#discussion_r453238293



##########
File path: cpp/src/parquet/arrow/schema.h
##########
@@ -163,24 +165,28 @@ struct PARQUET_EXPORT SchemaManifest {
     return it->second;
   }
 
-  bool GetFieldIndices(const std::vector<int>& column_indices, std::vector<int>* out) {
+  ::arrow::Result<std::vector<int>> GetFieldIndices(
+      const std::vector<int>& column_indices) {
     // Coalesce a list of schema field indices which are the roots of the
     // columns referred to by a list of column indices
     const schema::GroupNode* group = descr->group_node();
     std::unordered_set<int> already_added;
-    out->clear();
-    for (auto& column_idx : column_indices) {
+
+    std::vector<int> out;
+    for (int column_idx : column_indices) {
+      if (column_idx < 0 || column_idx >= descr->num_columns()) {
+        return ::arrow::Status::IndexError("Column index ", column_idx, " is not valid");
+      }
       auto field_node = descr->GetColumnRoot(column_idx);
       auto field_idx = group->FieldIndex(*field_node);
       if (field_idx < 0) {

Review comment:
       https://github.com/apache/arrow/pull/7704#issuecomment-657119271
   In this case the passed data is `column_indices`, which *are* checked. The field index is looked up from the group using a node pulled from that same group, so it seems redundant to check it again. I intended the assertion as a test for that intuition. If there is a way for `field_idx == -1` to occur then I agree that we should raise an exception rather than abort, but otherwise shouldn't the redundant check be removed?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7704: ARROW-9297: [C++][Parquet] Support chunked row groups in RowGroupRecordBatchReader

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #7704:
URL: https://github.com/apache/arrow/pull/7704#discussion_r453580484



##########
File path: cpp/src/parquet/arrow/reader.cc
##########
@@ -780,11 +741,29 @@ Status GetReader(const SchemaField& field, const std::shared_ptr<ReaderContext>&
 Status FileReaderImpl::GetRecordBatchReader(const std::vector<int>& row_group_indices,
                                             const std::vector<int>& column_indices,
                                             std::unique_ptr<RecordBatchReader>* out) {
-  // column indices check
-  for (auto row_group_index : row_group_indices) {
+  // row group indices check
+  for (int row_group_index : row_group_indices) {
     RETURN_NOT_OK(BoundsCheckRowGroup(row_group_index));
   }
 
+  // column indices check
+  ARROW_ASSIGN_OR_RAISE(std::vector<int> field_indices,
+                        manifest_.GetFieldIndices(column_indices));
+
+  std::shared_ptr<::arrow::Schema> batch_schema;
+  RETURN_NOT_OK(GetSchema(&batch_schema));
+
+  // filter to only arrow::Fields which contain the selected physical columns
+  {
+    ::arrow::FieldVector selected_fields;
+
+    for (int field_idx : field_indices) {
+      selected_fields.push_back(batch_schema->field(field_idx));
+    }
+
+    batch_schema = ::arrow::schema(std::move(selected_fields));

Review comment:
       should this preserve the metadata of the original `batch_schema`? (I don't know where the `batch_schema` is further used, though)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] emkornfield commented on a change in pull request #7704: ARROW-9297: [C++][Parquet] Support chunked row groups in RowGroupRecordBatchReader

Posted by GitBox <gi...@apache.org>.

emkornfield commented on a change in pull request #7704:
URL: https://github.com/apache/arrow/pull/7704#discussion_r453149894



##########
File path: cpp/src/parquet/arrow/reader.cc
##########
@@ -301,62 +302,50 @@ class FileReaderImpl : public FileReader {
 
 class RowGroupRecordBatchReader : public ::arrow::RecordBatchReader {
  public:
-  RowGroupRecordBatchReader(std::vector<std::unique_ptr<ColumnReaderImpl>> field_readers,
-                            std::shared_ptr<::arrow::Schema> schema, int64_t batch_size)
-      : field_readers_(std::move(field_readers)),
-        schema_(std::move(schema)),
-        batch_size_(batch_size) {}
-
   ~RowGroupRecordBatchReader() override {}
 
   std::shared_ptr<::arrow::Schema> schema() const override { return schema_; }
 
-  static Status Make(const std::vector<int>& row_groups,
-                     const std::vector<int>& column_indices, FileReaderImpl* reader,
-                     int64_t batch_size,
-                     std::unique_ptr<::arrow::RecordBatchReader>* out) {
-    std::vector<int> field_indices;
-    if (!reader->manifest_.GetFieldIndices(column_indices, &field_indices)) {
-      return Status::Invalid("Invalid column index");
-    }
+  static ::arrow::Result<std::unique_ptr<RowGroupRecordBatchReader>> Make(
+      const std::vector<int>& row_group_indices, const std::vector<int>& column_indices,
+      int64_t batch_size, FileReaderImpl* reader) {
+    std::unique_ptr<RowGroupRecordBatchReader> out(new RowGroupRecordBatchReader);
 
-    std::vector<std::unique_ptr<ColumnReaderImpl>> field_readers(field_indices.size());
-    std::vector<std::shared_ptr<Field>> fields;
+    RETURN_NOT_OK(FromParquetSchema(
+        reader->parquet_reader()->metadata()->schema(), default_arrow_reader_properties(),
+        /*key_value_metadata=*/nullptr, column_indices, &out->schema_));
 
-    auto included_leaves = VectorToSharedSet(column_indices);
-    for (size_t i = 0; i < field_indices.size(); ++i) {
-      RETURN_NOT_OK(reader->GetFieldReader(field_indices[i], included_leaves, row_groups,
-                                           &field_readers[i]));
-      fields.push_back(field_readers[i]->field());
-    }
-    out->reset(new RowGroupRecordBatchReader(std::move(field_readers),
-                                             ::arrow::schema(fields), batch_size));
-    return Status::OK();
+    using ::arrow::RecordBatchIterator;
+
+    auto row_group_index_to_batch_iterator =
+        [=](const int* i) -> ::arrow::Result<RecordBatchIterator> {

Review comment:
       If it is intentional (it looks like this escapes the scope) I would prefer to be explicit about the values being captured.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz commented on a change in pull request #7704: ARROW-9297: [C++][Parquet] Support chunked row groups in RowGroupRecordBatchReader

Posted by GitBox <gi...@apache.org>.

bkietz commented on a change in pull request #7704:
URL: https://github.com/apache/arrow/pull/7704#discussion_r453241420



##########
File path: cpp/src/parquet/arrow/arrow_schema_test.cc
##########
@@ -644,6 +649,78 @@ TEST_F(TestConvertParquetSchema, ParquetRepeatedNestedSchema) {
   ASSERT_NO_FATAL_FAILURE(CheckFlatSchema(arrow_schema));
 }
 
+TEST_F(TestConvertParquetSchema, ColumnSubselection) {
+  std::vector<NodePtr> parquet_fields;
+  std::vector<std::shared_ptr<Field>> arrow_fields;
+  {
+    //   optional int32 leaf1;
+    //   repeated group outerGroup {
+    //     optional int32 leaf2;
+    //     repeated group innerGroup {
+    //       optional int32 leaf3;
+    //     }
+    //   }
+    parquet_fields.push_back(
+        PrimitiveNode::Make("leaf1", Repetition::OPTIONAL, ParquetType::INT32));
+    parquet_fields.push_back(GroupNode::Make(
+        "outerGroup", Repetition::REPEATED,
+        {PrimitiveNode::Make("leaf2", Repetition::OPTIONAL, ParquetType::INT32),
+         GroupNode::Make(
+             "innerGroup", Repetition::REPEATED,
+             {PrimitiveNode::Make("leaf3", Repetition::OPTIONAL, ParquetType::INT32)})}));
+
+    auto inner_group_fields = {::arrow::field("leaf3", INT32, true)};

Review comment:
       Re JSON util: I'm not convinced that would be significantly more concise than our (de-facto) DSL with `schema`, `field`, and friends. Here's what I'm thinking of:
   
   ```json
   {
     "leaf1": "int32",
     "outerGroup": {
       "nested": "struct",
       "fields": [{
         "leaf2": "int32",
         "innerGroup": {
           "nested": "list not null",
           "fields": [{
             "inner-erGroup": {
               "nested": "struct not null",
               "fields": {
                 "leaf3": "int32"
               }
             }
           }]
         }
       }]
     }
   }
   ```
   
   vs
   
   ```c++
   schema({
     field("leaf1", int32()),
     field("outerGroup", struct_({
       field("leaf2", int32()),
       field("innerGroup", list(
         field("inner-erGroup", struct_({
           field("leaf3", int32())
         }), false)
       ), false),
     }))
   })
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] emkornfield commented on a change in pull request #7704: ARROW-9297: [C++][Parquet] Support chunked row groups in RowGroupRecordBatchReader

Posted by GitBox <gi...@apache.org>.

emkornfield commented on a change in pull request #7704:
URL: https://github.com/apache/arrow/pull/7704#discussion_r453149053



##########
File path: cpp/src/parquet/arrow/arrow_schema_test.cc
##########
@@ -644,6 +649,78 @@ TEST_F(TestConvertParquetSchema, ParquetRepeatedNestedSchema) {
   ASSERT_NO_FATAL_FAILURE(CheckFlatSchema(arrow_schema));
 }
 
+TEST_F(TestConvertParquetSchema, ColumnSubselection) {
+  std::vector<NodePtr> parquet_fields;
+  std::vector<std::shared_ptr<Field>> arrow_fields;
+  {
+    //   optional int32 leaf1;
+    //   repeated group outerGroup {
+    //     optional int32 leaf2;
+    //     repeated group innerGroup {
+    //       optional int32 leaf3;
+    //     }
+    //   }
+    parquet_fields.push_back(
+        PrimitiveNode::Make("leaf1", Repetition::OPTIONAL, ParquetType::INT32));
+    parquet_fields.push_back(GroupNode::Make(
+        "outerGroup", Repetition::REPEATED,
+        {PrimitiveNode::Make("leaf2", Repetition::OPTIONAL, ParquetType::INT32),
+         GroupNode::Make(
+             "innerGroup", Repetition::REPEATED,
+             {PrimitiveNode::Make("leaf3", Repetition::OPTIONAL, ParquetType::INT32)})}));
+
+    auto inner_group_fields = {::arrow::field("leaf3", INT32, true)};

Review comment:
       please spell out types in some form where you are using bracket initializations.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wesm closed pull request #7704: ARROW-9297: [C++][Parquet] Support chunked row groups in RowGroupRecordBatchReader

Posted by GitBox <gi...@apache.org>.

wesm closed pull request #7704:
URL: https://github.com/apache/arrow/pull/7704


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz commented on a change in pull request #7704: ARROW-9297: [C++][Parquet] Support chunked row groups in RowGroupRecordBatchReader

Posted by GitBox <gi...@apache.org>.

bkietz commented on a change in pull request #7704:
URL: https://github.com/apache/arrow/pull/7704#discussion_r453344524



##########
File path: cpp/src/parquet/arrow/arrow_schema_test.cc
##########
@@ -644,6 +649,78 @@ TEST_F(TestConvertParquetSchema, ParquetRepeatedNestedSchema) {
   ASSERT_NO_FATAL_FAILURE(CheckFlatSchema(arrow_schema));
 }
 
+TEST_F(TestConvertParquetSchema, ColumnSubselection) {
+  std::vector<NodePtr> parquet_fields;
+  std::vector<std::shared_ptr<Field>> arrow_fields;
+  {
+    //   optional int32 leaf1;
+    //   repeated group outerGroup {
+    //     optional int32 leaf2;
+    //     repeated group innerGroup {
+    //       optional int32 leaf3;
+    //     }
+    //   }
+    parquet_fields.push_back(
+        PrimitiveNode::Make("leaf1", Repetition::OPTIONAL, ParquetType::INT32));
+    parquet_fields.push_back(GroupNode::Make(
+        "outerGroup", Repetition::REPEATED,
+        {PrimitiveNode::Make("leaf2", Repetition::OPTIONAL, ParquetType::INT32),
+         GroupNode::Make(
+             "innerGroup", Repetition::REPEATED,
+             {PrimitiveNode::Make("leaf3", Repetition::OPTIONAL, ParquetType::INT32)})}));
+
+    auto inner_group_fields = {::arrow::field("leaf3", INT32, true)};

Review comment:
       https://issues.apache.org/jira/browse/ARROW-9422




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wesm commented on pull request #7704: ARROW-9297: [C++][Parquet] Support chunked row groups in RowGroupRecordBatchReader

Posted by GitBox <gi...@apache.org>.

wesm commented on pull request #7704:
URL: https://github.com/apache/arrow/pull/7704#issuecomment-657279970


   I made some comments in https://github.com/apache/arrow/pull/7534 that may have been intended for this patch


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] emkornfield commented on a change in pull request #7704: ARROW-9297: [C++][Parquet] Support chunked row groups in RowGroupRecordBatchReader

Posted by GitBox <gi...@apache.org>.

emkornfield commented on a change in pull request #7704:
URL: https://github.com/apache/arrow/pull/7704#discussion_r453149152



##########
File path: cpp/src/parquet/arrow/arrow_schema_test.cc
##########
@@ -644,6 +649,78 @@ TEST_F(TestConvertParquetSchema, ParquetRepeatedNestedSchema) {
   ASSERT_NO_FATAL_FAILURE(CheckFlatSchema(arrow_schema));
 }
 
+TEST_F(TestConvertParquetSchema, ColumnSubselection) {
+  std::vector<NodePtr> parquet_fields;
+  std::vector<std::shared_ptr<Field>> arrow_fields;
+  {
+    //   optional int32 leaf1;
+    //   repeated group outerGroup {
+    //     optional int32 leaf2;
+    //     repeated group innerGroup {
+    //       optional int32 leaf3;
+    //     }
+    //   }
+    parquet_fields.push_back(
+        PrimitiveNode::Make("leaf1", Repetition::OPTIONAL, ParquetType::INT32));
+    parquet_fields.push_back(GroupNode::Make(
+        "outerGroup", Repetition::REPEATED,
+        {PrimitiveNode::Make("leaf2", Repetition::OPTIONAL, ParquetType::INT32),
+         GroupNode::Make(
+             "innerGroup", Repetition::REPEATED,
+             {PrimitiveNode::Make("leaf3", Repetition::OPTIONAL, ParquetType::INT32)})}));
+
+    auto inner_group_fields = {::arrow::field("leaf3", INT32, true)};

Review comment:
       P.S. we really need a JSON utility to construct a schema from json., it would make this much easier to read.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz commented on a change in pull request #7704: ARROW-9297: [C++][Parquet] Support chunked row groups in RowGroupRecordBatchReader

Posted by GitBox <gi...@apache.org>.

bkietz commented on a change in pull request #7704:
URL: https://github.com/apache/arrow/pull/7704#discussion_r453238293



##########
File path: cpp/src/parquet/arrow/schema.h
##########
@@ -163,24 +165,28 @@ struct PARQUET_EXPORT SchemaManifest {
     return it->second;
   }
 
-  bool GetFieldIndices(const std::vector<int>& column_indices, std::vector<int>* out) {
+  ::arrow::Result<std::vector<int>> GetFieldIndices(
+      const std::vector<int>& column_indices) {
     // Coalesce a list of schema field indices which are the roots of the
     // columns referred to by a list of column indices
     const schema::GroupNode* group = descr->group_node();
     std::unordered_set<int> already_added;
-    out->clear();
-    for (auto& column_idx : column_indices) {
+
+    std::vector<int> out;
+    for (int column_idx : column_indices) {
+      if (column_idx < 0 || column_idx >= descr->num_columns()) {
+        return ::arrow::Status::IndexError("Column index ", column_idx, " is not valid");
+      }
       auto field_node = descr->GetColumnRoot(column_idx);
       auto field_idx = group->FieldIndex(*field_node);
       if (field_idx < 0) {

Review comment:
       https://github.com/apache/arrow/pull/7704#issuecomment-657119271
   > I prefer checks to assertions for 'in the wild' errors.  Especially around code that is acting on passed data.
   
   In this case the passed data is `column_indices`, which *are* checked. The field index is looked up from the group using a node pulled from that same group, so it seems redundant to check it again. I intended the assertion as a test for that intuition. If there is a way for `field_idx == -1` to occur then I agree that we should raise an exception rather than abort, but otherwise shouldn't the redundant check be removed?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz commented on a change in pull request #7704: ARROW-9297: [C++][Parquet] Support chunked row groups in RowGroupRecordBatchReader

Posted by GitBox <gi...@apache.org>.

bkietz commented on a change in pull request #7704:
URL: https://github.com/apache/arrow/pull/7704#discussion_r453239911



##########
File path: cpp/src/parquet/arrow/arrow_schema_test.cc
##########
@@ -644,6 +649,78 @@ TEST_F(TestConvertParquetSchema, ParquetRepeatedNestedSchema) {
   ASSERT_NO_FATAL_FAILURE(CheckFlatSchema(arrow_schema));
 }
 
+TEST_F(TestConvertParquetSchema, ColumnSubselection) {
+  std::vector<NodePtr> parquet_fields;
+  std::vector<std::shared_ptr<Field>> arrow_fields;
+  {
+    //   optional int32 leaf1;
+    //   repeated group outerGroup {
+    //     optional int32 leaf2;
+    //     repeated group innerGroup {
+    //       optional int32 leaf3;
+    //     }
+    //   }
+    parquet_fields.push_back(
+        PrimitiveNode::Make("leaf1", Repetition::OPTIONAL, ParquetType::INT32));
+    parquet_fields.push_back(GroupNode::Make(
+        "outerGroup", Repetition::REPEATED,
+        {PrimitiveNode::Make("leaf2", Repetition::OPTIONAL, ParquetType::INT32),
+         GroupNode::Make(
+             "innerGroup", Repetition::REPEATED,
+             {PrimitiveNode::Make("leaf3", Repetition::OPTIONAL, ParquetType::INT32)})}));
+
+    auto inner_group_fields = {::arrow::field("leaf3", INT32, true)};
+    auto inner_group_type = ::arrow::struct_(inner_group_fields);
+    auto outer_group_fields = {
+        ::arrow::field("leaf2", INT32, true),
+        ::arrow::field(
+            "innerGroup",
+            ::arrow::list(::arrow::field("innerGroup", inner_group_type, false)), false)};
+    auto outer_group_type = ::arrow::struct_(outer_group_fields);
+
+    arrow_fields.push_back(::arrow::field("leaf1", INT32, true));
+    arrow_fields.push_back(::arrow::field(
+        "outerGroup",
+        ::arrow::list(::arrow::field("outerGroup", outer_group_type, false)), false));
+  }
+  std::shared_ptr<::arrow::Schema> arrow_schema;
+  std::vector<int> column_indices;
+
+  column_indices = {};
+  ASSERT_OK(
+      ConvertSchema(parquet_fields, /*key_value_metadata=*/nullptr, &column_indices));
+  arrow_schema = ::arrow::schema({});
+  ASSERT_NO_FATAL_FAILURE(CheckFlatSchema(arrow_schema));
+
+  column_indices = {0, 1, 2};
+  ASSERT_OK(
+      ConvertSchema(parquet_fields, /*key_value_metadata=*/nullptr, &column_indices));
+  arrow_schema = ::arrow::schema(arrow_fields);
+  ASSERT_NO_FATAL_FAILURE(CheckFlatSchema(arrow_schema));
+
+  column_indices = {0};
+  ASSERT_OK(
+      ConvertSchema(parquet_fields, /*key_value_metadata=*/nullptr, &column_indices));
+  arrow_schema = ::arrow::schema({arrow_fields[0]});
+  ASSERT_NO_FATAL_FAILURE(CheckFlatSchema(arrow_schema));
+
+  column_indices = {1};
+  ASSERT_OK(
+      ConvertSchema(parquet_fields, /*key_value_metadata=*/nullptr, &column_indices));
+  arrow_schema = ::arrow::schema({arrow_fields[1]});
+  ASSERT_NO_FATAL_FAILURE(CheckFlatSchema(arrow_schema));
+
+  column_indices = {2};

Review comment:
       will do




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz commented on a change in pull request #7704: ARROW-9297: [C++][Parquet] Support chunked row groups in RowGroupRecordBatchReader

Posted by GitBox <gi...@apache.org>.

bkietz commented on a change in pull request #7704:
URL: https://github.com/apache/arrow/pull/7704#discussion_r453341545



##########
File path: cpp/src/parquet/arrow/schema.h
##########
@@ -163,24 +165,28 @@ struct PARQUET_EXPORT SchemaManifest {
     return it->second;
   }
 
-  bool GetFieldIndices(const std::vector<int>& column_indices, std::vector<int>* out) {
+  ::arrow::Result<std::vector<int>> GetFieldIndices(
+      const std::vector<int>& column_indices) {
     // Coalesce a list of schema field indices which are the roots of the
     // columns referred to by a list of column indices
     const schema::GroupNode* group = descr->group_node();
     std::unordered_set<int> already_added;
-    out->clear();
-    for (auto& column_idx : column_indices) {
+
+    std::vector<int> out;
+    for (int column_idx : column_indices) {
+      if (column_idx < 0 || column_idx >= descr->num_columns()) {
+        return ::arrow::Status::IndexError("Column index ", column_idx, " is not valid");
+      }
       auto field_node = descr->GetColumnRoot(column_idx);
       auto field_idx = group->FieldIndex(*field_node);
       if (field_idx < 0) {

Review comment:
       https://issues.apache.org/jira/browse/ARROW-9421




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] emkornfield commented on a change in pull request #7704: ARROW-9297: [C++][Parquet] Support chunked row groups in RowGroupRecordBatchReader

Posted by GitBox <gi...@apache.org>.

emkornfield commented on a change in pull request #7704:
URL: https://github.com/apache/arrow/pull/7704#discussion_r453150926



##########
File path: cpp/src/parquet/arrow/schema.h
##########
@@ -163,24 +165,28 @@ struct PARQUET_EXPORT SchemaManifest {
     return it->second;
   }
 
-  bool GetFieldIndices(const std::vector<int>& column_indices, std::vector<int>* out) {
+  ::arrow::Result<std::vector<int>> GetFieldIndices(
+      const std::vector<int>& column_indices) {
     // Coalesce a list of schema field indices which are the roots of the
     // columns referred to by a list of column indices
     const schema::GroupNode* group = descr->group_node();
     std::unordered_set<int> already_added;
-    out->clear();
-    for (auto& column_idx : column_indices) {
+
+    std::vector<int> out;
+    for (int column_idx : column_indices) {
+      if (column_idx < 0 || column_idx >= descr->num_columns()) {
+        return ::arrow::Status::IndexError("Column index ", column_idx, " is not valid");
+      }
       auto field_node = descr->GetColumnRoot(column_idx);
       auto field_idx = group->FieldIndex(*field_node);
       if (field_idx < 0) {

Review comment:
       not your code, but is this really checking for equality to -1?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz commented on a change in pull request #7704: ARROW-9297: [C++][Parquet] Support chunked row groups in RowGroupRecordBatchReader

Posted by GitBox <gi...@apache.org>.

bkietz commented on a change in pull request #7704:
URL: https://github.com/apache/arrow/pull/7704#discussion_r453223708



##########
File path: cpp/src/parquet/arrow/schema.h
##########
@@ -163,24 +165,28 @@ struct PARQUET_EXPORT SchemaManifest {
     return it->second;
   }
 
-  bool GetFieldIndices(const std::vector<int>& column_indices, std::vector<int>* out) {
+  ::arrow::Result<std::vector<int>> GetFieldIndices(
+      const std::vector<int>& column_indices) {
     // Coalesce a list of schema field indices which are the roots of the

Review comment:
       will do




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] emkornfield commented on a change in pull request #7704: ARROW-9297: [C++][Parquet] Support chunked row groups in RowGroupRecordBatchReader

Posted by GitBox <gi...@apache.org>.

emkornfield commented on a change in pull request #7704:
URL: https://github.com/apache/arrow/pull/7704#discussion_r453148787



##########
File path: cpp/src/parquet/arrow/arrow_reader_writer_test.cc
##########
@@ -2877,11 +2877,22 @@ TEST(TestArrowReaderAdHoc, LARGE_MEMORY_TEST(LargeStringColumn)) {
   array.reset();
 
   auto reader = ParquetFileReader::Open(std::make_shared<BufferReader>(tables_buffer));
-
   std::unique_ptr<FileReader> arrow_reader;
   ASSERT_OK(FileReader::Make(default_memory_pool(), std::move(reader), &arrow_reader));
   ASSERT_OK_NO_THROW(arrow_reader->ReadTable(&table));
   ASSERT_OK(table->ValidateFull());
+
+  // ARROW-9297: ensure RecordBatchReader also works
+  reader = ParquetFileReader::Open(std::make_shared<BufferReader>(tables_buffer));
+  ASSERT_OK(FileReader::Make(default_memory_pool(), std::move(reader), &arrow_reader));
+  std::shared_ptr<::arrow::RecordBatchReader> batch_reader;
+  auto all_row_groups = ::arrow::internal::Iota(reader->metadata()->num_row_groups());
+  ASSERT_OK_NO_THROW(arrow_reader->GetRecordBatchReader(all_row_groups, &batch_reader));
+  ASSERT_OK_AND_ASSIGN(auto batched_table,

Review comment:
       nm.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] emkornfield commented on a change in pull request #7704: ARROW-9297: [C++][Parquet] Support chunked row groups in RowGroupRecordBatchReader

Posted by GitBox <gi...@apache.org>.

emkornfield commented on a change in pull request #7704:
URL: https://github.com/apache/arrow/pull/7704#discussion_r453149711



##########
File path: cpp/src/parquet/arrow/reader.cc
##########
@@ -301,62 +302,50 @@ class FileReaderImpl : public FileReader {
 
 class RowGroupRecordBatchReader : public ::arrow::RecordBatchReader {
  public:
-  RowGroupRecordBatchReader(std::vector<std::unique_ptr<ColumnReaderImpl>> field_readers,
-                            std::shared_ptr<::arrow::Schema> schema, int64_t batch_size)
-      : field_readers_(std::move(field_readers)),
-        schema_(std::move(schema)),
-        batch_size_(batch_size) {}
-
   ~RowGroupRecordBatchReader() override {}
 
   std::shared_ptr<::arrow::Schema> schema() const override { return schema_; }
 
-  static Status Make(const std::vector<int>& row_groups,
-                     const std::vector<int>& column_indices, FileReaderImpl* reader,
-                     int64_t batch_size,
-                     std::unique_ptr<::arrow::RecordBatchReader>* out) {
-    std::vector<int> field_indices;
-    if (!reader->manifest_.GetFieldIndices(column_indices, &field_indices)) {
-      return Status::Invalid("Invalid column index");
-    }
+  static ::arrow::Result<std::unique_ptr<RowGroupRecordBatchReader>> Make(
+      const std::vector<int>& row_group_indices, const std::vector<int>& column_indices,
+      int64_t batch_size, FileReaderImpl* reader) {
+    std::unique_ptr<RowGroupRecordBatchReader> out(new RowGroupRecordBatchReader);
 
-    std::vector<std::unique_ptr<ColumnReaderImpl>> field_readers(field_indices.size());
-    std::vector<std::shared_ptr<Field>> fields;
+    RETURN_NOT_OK(FromParquetSchema(
+        reader->parquet_reader()->metadata()->schema(), default_arrow_reader_properties(),

Review comment:
       its probably not great that we don't propagate reader properties here?  I'm not sure I see if this was a problem in the old code as well?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] emkornfield commented on a change in pull request #7704: ARROW-9297: [C++][Parquet] Support chunked row groups in RowGroupRecordBatchReader

Posted by GitBox <gi...@apache.org>.

emkornfield commented on a change in pull request #7704:
URL: https://github.com/apache/arrow/pull/7704#discussion_r453150104



##########
File path: cpp/src/parquet/arrow/reader.cc
##########
@@ -819,10 +812,7 @@ Status FileReaderImpl::ReadRowGroups(const std::vector<int>& row_groups,
 
   // We only need to read schema fields which have columns indicated
   // in the indices vector
-  std::vector<int> field_indices;
-  if (!manifest_.GetFieldIndices(indices, &field_indices)) {
-    return Status::Invalid("Invalid column index");
-  }
+  ARROW_ASSIGN_OR_RAISE(auto field_indices, manifest_.GetFieldIndices(indices));

Review comment:
       please spell out type.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz commented on a change in pull request #7704: ARROW-9297: [C++][Parquet] Support chunked row groups in RowGroupRecordBatchReader

Posted by GitBox <gi...@apache.org>.

bkietz commented on a change in pull request #7704:
URL: https://github.com/apache/arrow/pull/7704#discussion_r453715480



##########
File path: cpp/src/parquet/arrow/reader.cc
##########
@@ -780,11 +741,29 @@ Status GetReader(const SchemaField& field, const std::shared_ptr<ReaderContext>&
 Status FileReaderImpl::GetRecordBatchReader(const std::vector<int>& row_group_indices,
                                             const std::vector<int>& column_indices,
                                             std::unique_ptr<RecordBatchReader>* out) {
-  // column indices check
-  for (auto row_group_index : row_group_indices) {
+  // row group indices check
+  for (int row_group_index : row_group_indices) {
     RETURN_NOT_OK(BoundsCheckRowGroup(row_group_index));
   }
 
+  // column indices check
+  ARROW_ASSIGN_OR_RAISE(std::vector<int> field_indices,
+                        manifest_.GetFieldIndices(column_indices));
+
+  std::shared_ptr<::arrow::Schema> batch_schema;
+  RETURN_NOT_OK(GetSchema(&batch_schema));
+
+  // filter to only arrow::Fields which contain the selected physical columns
+  {
+    ::arrow::FieldVector selected_fields;
+
+    for (int field_idx : field_indices) {
+      selected_fields.push_back(batch_schema->field(field_idx));
+    }
+
+    batch_schema = ::arrow::schema(std::move(selected_fields));

Review comment:
       It should. I'll fix this in #7534




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz commented on pull request #7704: ARROW-9297: [C++][Parquet] Support chunked row groups in RowGroupRecordBatchReader

Posted by GitBox <gi...@apache.org>.

bkietz commented on pull request #7704:
URL: https://github.com/apache/arrow/pull/7704#issuecomment-657260443


   @emkornfield PTAL


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #7704: ARROW-9297: [C++][Parquet] Support chunked row groups in RowGroupRecordBatchReader

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #7704:
URL: https://github.com/apache/arrow/pull/7704#issuecomment-656821476


   https://issues.apache.org/jira/browse/ARROW-9297


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] emkornfield commented on pull request #7704: ARROW-9297: [C++][Parquet] Support chunked row groups in RowGroupRecordBatchReader

Posted by GitBox <gi...@apache.org>.

emkornfield commented on pull request #7704:
URL: https://github.com/apache/arrow/pull/7704#issuecomment-657119271


   I prefer checks to assertions for 'in the wild' errors.  Especially around
   code that is acting on passed data.
   
   On Saturday, July 11, 2020, Benjamin Kietzman <no...@github.com>
   wrote:
   
   > *@bkietz* commented on this pull request.
   > ------------------------------
   >
   > In cpp/src/parquet/arrow/schema.h
   > <https://github.com/apache/arrow/pull/7704#discussion_r453228008>:
   >
   > >      // Coalesce a list of schema field indices which are the roots of the
   >      // columns referred to by a list of column indices
   >      const schema::GroupNode* group = descr->group_node();
   >      std::unordered_set<int> already_added;
   > -    out->clear();
   > -    for (auto& column_idx : column_indices) {
   > +
   > +    std::vector<int> out;
   > +    for (int column_idx : column_indices) {
   > +      if (column_idx < 0 || column_idx >= descr->num_columns()) {
   > +        return ::arrow::Status::IndexError("Column index ", column_idx, " is not valid");
   > +      }
   >        auto field_node = descr->GetColumnRoot(column_idx);
   >        auto field_idx = group->FieldIndex(*field_node);
   >        if (field_idx < 0) {
   >
   > Worth noting: replacing this check with an assertion locally broke no
   > tests in C++ or python. I'll leave it as an assertion for now and see what
   > CI thinks
   >
   > —
   > You are receiving this because you commented.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/arrow/pull/7704#discussion_r453228008>, or
   > unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AEIKYDUCHLANLHU73PTVCEDR3C6MTANCNFSM4OW2OZ6A>
   > .
   >
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7704: ARROW-9297: [C++][Parquet] Support chunked row groups in RowGroupRecordBatchReader

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #7704:
URL: https://github.com/apache/arrow/pull/7704#discussion_r453580484



##########
File path: cpp/src/parquet/arrow/reader.cc
##########
@@ -780,11 +741,29 @@ Status GetReader(const SchemaField& field, const std::shared_ptr<ReaderContext>&
 Status FileReaderImpl::GetRecordBatchReader(const std::vector<int>& row_group_indices,
                                             const std::vector<int>& column_indices,
                                             std::unique_ptr<RecordBatchReader>* out) {
-  // column indices check
-  for (auto row_group_index : row_group_indices) {
+  // row group indices check
+  for (int row_group_index : row_group_indices) {
     RETURN_NOT_OK(BoundsCheckRowGroup(row_group_index));
   }
 
+  // column indices check
+  ARROW_ASSIGN_OR_RAISE(std::vector<int> field_indices,
+                        manifest_.GetFieldIndices(column_indices));
+
+  std::shared_ptr<::arrow::Schema> batch_schema;
+  RETURN_NOT_OK(GetSchema(&batch_schema));
+
+  // filter to only arrow::Fields which contain the selected physical columns
+  {
+    ::arrow::FieldVector selected_fields;
+
+    for (int field_idx : field_indices) {
+      selected_fields.push_back(batch_schema->field(field_idx));
+    }
+
+    batch_schema = ::arrow::schema(std::move(selected_fields));

Review comment:
       should this preserver the metadata of the original `batch_schema`? (I don't know where the `batch_schema` is further used, though)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz commented on a change in pull request #7704: ARROW-9297: [C++][Parquet] Support chunked row groups in RowGroupRecordBatchReader

Posted by GitBox <gi...@apache.org>.

bkietz commented on a change in pull request #7704:
URL: https://github.com/apache/arrow/pull/7704#discussion_r453224951



##########
File path: cpp/src/parquet/arrow/schema.h
##########
@@ -163,24 +165,28 @@ struct PARQUET_EXPORT SchemaManifest {
     return it->second;
   }
 
-  bool GetFieldIndices(const std::vector<int>& column_indices, std::vector<int>* out) {
+  ::arrow::Result<std::vector<int>> GetFieldIndices(
+      const std::vector<int>& column_indices) {
     // Coalesce a list of schema field indices which are the roots of the
     // columns referred to by a list of column indices
     const schema::GroupNode* group = descr->group_node();
     std::unordered_set<int> already_added;
-    out->clear();
-    for (auto& column_idx : column_indices) {
+
+    std::vector<int> out;
+    for (int column_idx : column_indices) {
+      if (column_idx < 0 || column_idx >= descr->num_columns()) {
+        return ::arrow::Status::IndexError("Column index ", column_idx, " is not valid");
+      }
       auto field_node = descr->GetColumnRoot(column_idx);
       auto field_idx = group->FieldIndex(*field_node);
       if (field_idx < 0) {

Review comment:
       You're right that the only negative value which `FieldIndex` returns is -1. IIUC this is may be overly defensive, though; is there a situation where we might fail to find a `Node` within a `GroupNode` when we'd just retrieved it from that same `GroupNode`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz commented on a change in pull request #7704: ARROW-9297: [C++][Parquet] Support chunked row groups in RowGroupRecordBatchReader

Posted by GitBox <gi...@apache.org>.

bkietz commented on a change in pull request #7704:
URL: https://github.com/apache/arrow/pull/7704#discussion_r453223663



##########
File path: cpp/src/parquet/arrow/reader.cc
##########
@@ -301,62 +302,50 @@ class FileReaderImpl : public FileReader {
 
 class RowGroupRecordBatchReader : public ::arrow::RecordBatchReader {
  public:
-  RowGroupRecordBatchReader(std::vector<std::unique_ptr<ColumnReaderImpl>> field_readers,
-                            std::shared_ptr<::arrow::Schema> schema, int64_t batch_size)
-      : field_readers_(std::move(field_readers)),
-        schema_(std::move(schema)),
-        batch_size_(batch_size) {}
-
   ~RowGroupRecordBatchReader() override {}
 
   std::shared_ptr<::arrow::Schema> schema() const override { return schema_; }
 
-  static Status Make(const std::vector<int>& row_groups,
-                     const std::vector<int>& column_indices, FileReaderImpl* reader,
-                     int64_t batch_size,
-                     std::unique_ptr<::arrow::RecordBatchReader>* out) {
-    std::vector<int> field_indices;
-    if (!reader->manifest_.GetFieldIndices(column_indices, &field_indices)) {
-      return Status::Invalid("Invalid column index");
-    }
+  static ::arrow::Result<std::unique_ptr<RowGroupRecordBatchReader>> Make(
+      const std::vector<int>& row_group_indices, const std::vector<int>& column_indices,
+      int64_t batch_size, FileReaderImpl* reader) {
+    std::unique_ptr<RowGroupRecordBatchReader> out(new RowGroupRecordBatchReader);

Review comment:
       In this case the constructor isn't private, so I'll refactor to use `make_unique` here




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] emkornfield commented on a change in pull request #7704: ARROW-9297: [C++][Parquet] Support chunked row groups in RowGroupRecordBatchReader

Posted by GitBox <gi...@apache.org>.

emkornfield commented on a change in pull request #7704:
URL: https://github.com/apache/arrow/pull/7704#discussion_r453227795



##########
File path: cpp/src/parquet/arrow/reader.cc
##########
@@ -781,6 +770,9 @@ Status FileReaderImpl::GetRecordBatchReader(const std::vector<int>& row_group_in
                                             const std::vector<int>& column_indices,
                                             std::unique_ptr<RecordBatchReader>* out) {
   // column indices check
+  ARROW_ASSIGN_OR_RAISE(std::ignore, manifest_.GetFieldIndices(column_indices));

Review comment:
       Just to clarify doesn't need to be for this pr




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] emkornfield commented on a change in pull request #7704: ARROW-9297: [C++][Parquet] Support chunked row groups in RowGroupRecordBatchReader

Posted by GitBox <gi...@apache.org>.

emkornfield commented on a change in pull request #7704:
URL: https://github.com/apache/arrow/pull/7704#discussion_r453149597



##########
File path: cpp/src/parquet/arrow/reader.cc
##########
@@ -301,62 +302,50 @@ class FileReaderImpl : public FileReader {
 
 class RowGroupRecordBatchReader : public ::arrow::RecordBatchReader {
  public:
-  RowGroupRecordBatchReader(std::vector<std::unique_ptr<ColumnReaderImpl>> field_readers,
-                            std::shared_ptr<::arrow::Schema> schema, int64_t batch_size)
-      : field_readers_(std::move(field_readers)),
-        schema_(std::move(schema)),
-        batch_size_(batch_size) {}
-
   ~RowGroupRecordBatchReader() override {}
 
   std::shared_ptr<::arrow::Schema> schema() const override { return schema_; }
 
-  static Status Make(const std::vector<int>& row_groups,
-                     const std::vector<int>& column_indices, FileReaderImpl* reader,
-                     int64_t batch_size,
-                     std::unique_ptr<::arrow::RecordBatchReader>* out) {
-    std::vector<int> field_indices;
-    if (!reader->manifest_.GetFieldIndices(column_indices, &field_indices)) {
-      return Status::Invalid("Invalid column index");
-    }
+  static ::arrow::Result<std::unique_ptr<RowGroupRecordBatchReader>> Make(
+      const std::vector<int>& row_group_indices, const std::vector<int>& column_indices,
+      int64_t batch_size, FileReaderImpl* reader) {
+    std::unique_ptr<RowGroupRecordBatchReader> out(new RowGroupRecordBatchReader);

Review comment:
       *rant* it annoys me that that make_unique can't access private constructors.
   nit: please construct using empty parens at the end.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wesm commented on pull request #7704: ARROW-9297: [C++][Parquet] Support chunked row groups in RowGroupRecordBatchReader

Posted by GitBox <gi...@apache.org>.

wesm commented on pull request #7704:
URL: https://github.com/apache/arrow/pull/7704#issuecomment-657280807


   +1. Merging this and will leave it to you to rebase the other PR


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] emkornfield commented on a change in pull request #7704: ARROW-9297: [C++][Parquet] Support chunked row groups in RowGroupRecordBatchReader

Posted by GitBox <gi...@apache.org>.

emkornfield commented on a change in pull request #7704:
URL: https://github.com/apache/arrow/pull/7704#discussion_r453149320



##########
File path: cpp/src/parquet/arrow/arrow_schema_test.cc
##########
@@ -644,6 +649,78 @@ TEST_F(TestConvertParquetSchema, ParquetRepeatedNestedSchema) {
   ASSERT_NO_FATAL_FAILURE(CheckFlatSchema(arrow_schema));
 }
 
+TEST_F(TestConvertParquetSchema, ColumnSubselection) {
+  std::vector<NodePtr> parquet_fields;
+  std::vector<std::shared_ptr<Field>> arrow_fields;
+  {
+    //   optional int32 leaf1;
+    //   repeated group outerGroup {
+    //     optional int32 leaf2;
+    //     repeated group innerGroup {
+    //       optional int32 leaf3;
+    //     }
+    //   }
+    parquet_fields.push_back(
+        PrimitiveNode::Make("leaf1", Repetition::OPTIONAL, ParquetType::INT32));
+    parquet_fields.push_back(GroupNode::Make(
+        "outerGroup", Repetition::REPEATED,
+        {PrimitiveNode::Make("leaf2", Repetition::OPTIONAL, ParquetType::INT32),
+         GroupNode::Make(
+             "innerGroup", Repetition::REPEATED,
+             {PrimitiveNode::Make("leaf3", Repetition::OPTIONAL, ParquetType::INT32)})}));
+
+    auto inner_group_fields = {::arrow::field("leaf3", INT32, true)};
+    auto inner_group_type = ::arrow::struct_(inner_group_fields);
+    auto outer_group_fields = {
+        ::arrow::field("leaf2", INT32, true),
+        ::arrow::field(
+            "innerGroup",
+            ::arrow::list(::arrow::field("innerGroup", inner_group_type, false)), false)};
+    auto outer_group_type = ::arrow::struct_(outer_group_fields);
+
+    arrow_fields.push_back(::arrow::field("leaf1", INT32, true));
+    arrow_fields.push_back(::arrow::field(
+        "outerGroup",
+        ::arrow::list(::arrow::field("outerGroup", outer_group_type, false)), false));
+  }
+  std::shared_ptr<::arrow::Schema> arrow_schema;
+  std::vector<int> column_indices;
+
+  column_indices = {};
+  ASSERT_OK(
+      ConvertSchema(parquet_fields, /*key_value_metadata=*/nullptr, &column_indices));
+  arrow_schema = ::arrow::schema({});
+  ASSERT_NO_FATAL_FAILURE(CheckFlatSchema(arrow_schema));
+
+  column_indices = {0, 1, 2};
+  ASSERT_OK(
+      ConvertSchema(parquet_fields, /*key_value_metadata=*/nullptr, &column_indices));
+  arrow_schema = ::arrow::schema(arrow_fields);
+  ASSERT_NO_FATAL_FAILURE(CheckFlatSchema(arrow_schema));
+
+  column_indices = {0};
+  ASSERT_OK(
+      ConvertSchema(parquet_fields, /*key_value_metadata=*/nullptr, &column_indices));
+  arrow_schema = ::arrow::schema({arrow_fields[0]});
+  ASSERT_NO_FATAL_FAILURE(CheckFlatSchema(arrow_schema));
+
+  column_indices = {1};
+  ASSERT_OK(
+      ConvertSchema(parquet_fields, /*key_value_metadata=*/nullptr, &column_indices));
+  arrow_schema = ::arrow::schema({arrow_fields[1]});
+  ASSERT_NO_FATAL_FAILURE(CheckFlatSchema(arrow_schema));
+
+  column_indices = {2};

Review comment:
       does it pay to test at least one pair?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] emkornfield commented on a change in pull request #7704: ARROW-9297: [C++][Parquet] Support chunked row groups in RowGroupRecordBatchReader

Posted by GitBox <gi...@apache.org>.

emkornfield commented on a change in pull request #7704:
URL: https://github.com/apache/arrow/pull/7704#discussion_r453149789



##########
File path: cpp/src/parquet/arrow/reader.cc
##########
@@ -301,62 +302,50 @@ class FileReaderImpl : public FileReader {
 
 class RowGroupRecordBatchReader : public ::arrow::RecordBatchReader {
  public:
-  RowGroupRecordBatchReader(std::vector<std::unique_ptr<ColumnReaderImpl>> field_readers,
-                            std::shared_ptr<::arrow::Schema> schema, int64_t batch_size)
-      : field_readers_(std::move(field_readers)),
-        schema_(std::move(schema)),
-        batch_size_(batch_size) {}
-
   ~RowGroupRecordBatchReader() override {}
 
   std::shared_ptr<::arrow::Schema> schema() const override { return schema_; }
 
-  static Status Make(const std::vector<int>& row_groups,
-                     const std::vector<int>& column_indices, FileReaderImpl* reader,
-                     int64_t batch_size,
-                     std::unique_ptr<::arrow::RecordBatchReader>* out) {
-    std::vector<int> field_indices;
-    if (!reader->manifest_.GetFieldIndices(column_indices, &field_indices)) {
-      return Status::Invalid("Invalid column index");
-    }
+  static ::arrow::Result<std::unique_ptr<RowGroupRecordBatchReader>> Make(
+      const std::vector<int>& row_group_indices, const std::vector<int>& column_indices,
+      int64_t batch_size, FileReaderImpl* reader) {
+    std::unique_ptr<RowGroupRecordBatchReader> out(new RowGroupRecordBatchReader);
 
-    std::vector<std::unique_ptr<ColumnReaderImpl>> field_readers(field_indices.size());
-    std::vector<std::shared_ptr<Field>> fields;
+    RETURN_NOT_OK(FromParquetSchema(
+        reader->parquet_reader()->metadata()->schema(), default_arrow_reader_properties(),
+        /*key_value_metadata=*/nullptr, column_indices, &out->schema_));
 
-    auto included_leaves = VectorToSharedSet(column_indices);
-    for (size_t i = 0; i < field_indices.size(); ++i) {
-      RETURN_NOT_OK(reader->GetFieldReader(field_indices[i], included_leaves, row_groups,
-                                           &field_readers[i]));
-      fields.push_back(field_readers[i]->field());
-    }
-    out->reset(new RowGroupRecordBatchReader(std::move(field_readers),
-                                             ::arrow::schema(fields), batch_size));
-    return Status::OK();
+    using ::arrow::RecordBatchIterator;
+
+    auto row_group_index_to_batch_iterator =
+        [=](const int* i) -> ::arrow::Result<RecordBatchIterator> {

Review comment:
       this binding everything by value here intentional?  If so could you add a comment why?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz commented on a change in pull request #7704: ARROW-9297: [C++][Parquet] Support chunked row groups in RowGroupRecordBatchReader

Posted by GitBox <gi...@apache.org>.

bkietz commented on a change in pull request #7704:
URL: https://github.com/apache/arrow/pull/7704#discussion_r453237839



##########
File path: cpp/src/parquet/arrow/reader.cc
##########
@@ -301,62 +302,50 @@ class FileReaderImpl : public FileReader {
 
 class RowGroupRecordBatchReader : public ::arrow::RecordBatchReader {
  public:
-  RowGroupRecordBatchReader(std::vector<std::unique_ptr<ColumnReaderImpl>> field_readers,
-                            std::shared_ptr<::arrow::Schema> schema, int64_t batch_size)
-      : field_readers_(std::move(field_readers)),
-        schema_(std::move(schema)),
-        batch_size_(batch_size) {}
-
   ~RowGroupRecordBatchReader() override {}
 
   std::shared_ptr<::arrow::Schema> schema() const override { return schema_; }
 
-  static Status Make(const std::vector<int>& row_groups,
-                     const std::vector<int>& column_indices, FileReaderImpl* reader,
-                     int64_t batch_size,
-                     std::unique_ptr<::arrow::RecordBatchReader>* out) {
-    std::vector<int> field_indices;
-    if (!reader->manifest_.GetFieldIndices(column_indices, &field_indices)) {
-      return Status::Invalid("Invalid column index");
-    }
+  static ::arrow::Result<std::unique_ptr<RowGroupRecordBatchReader>> Make(
+      const std::vector<int>& row_group_indices, const std::vector<int>& column_indices,
+      int64_t batch_size, FileReaderImpl* reader) {
+    std::unique_ptr<RowGroupRecordBatchReader> out(new RowGroupRecordBatchReader);
 
-    std::vector<std::unique_ptr<ColumnReaderImpl>> field_readers(field_indices.size());
-    std::vector<std::shared_ptr<Field>> fields;
+    RETURN_NOT_OK(FromParquetSchema(
+        reader->parquet_reader()->metadata()->schema(), default_arrow_reader_properties(),

Review comment:
       No, this is an issue with my change. Easy enough to fix; I need to use `reader->properties()` instead.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz commented on a change in pull request #7704: ARROW-9297: [C++][Parquet] Support chunked row groups in RowGroupRecordBatchReader

Posted by GitBox <gi...@apache.org>.

bkietz commented on a change in pull request #7704:
URL: https://github.com/apache/arrow/pull/7704#discussion_r453225253



##########
File path: cpp/src/parquet/arrow/reader.cc
##########
@@ -301,62 +302,50 @@ class FileReaderImpl : public FileReader {
 
 class RowGroupRecordBatchReader : public ::arrow::RecordBatchReader {
  public:
-  RowGroupRecordBatchReader(std::vector<std::unique_ptr<ColumnReaderImpl>> field_readers,
-                            std::shared_ptr<::arrow::Schema> schema, int64_t batch_size)
-      : field_readers_(std::move(field_readers)),
-        schema_(std::move(schema)),
-        batch_size_(batch_size) {}
-
   ~RowGroupRecordBatchReader() override {}
 
   std::shared_ptr<::arrow::Schema> schema() const override { return schema_; }
 
-  static Status Make(const std::vector<int>& row_groups,
-                     const std::vector<int>& column_indices, FileReaderImpl* reader,
-                     int64_t batch_size,
-                     std::unique_ptr<::arrow::RecordBatchReader>* out) {
-    std::vector<int> field_indices;
-    if (!reader->manifest_.GetFieldIndices(column_indices, &field_indices)) {
-      return Status::Invalid("Invalid column index");
-    }
+  static ::arrow::Result<std::unique_ptr<RowGroupRecordBatchReader>> Make(
+      const std::vector<int>& row_group_indices, const std::vector<int>& column_indices,
+      int64_t batch_size, FileReaderImpl* reader) {
+    std::unique_ptr<RowGroupRecordBatchReader> out(new RowGroupRecordBatchReader);
 
-    std::vector<std::unique_ptr<ColumnReaderImpl>> field_readers(field_indices.size());
-    std::vector<std::shared_ptr<Field>> fields;
+    RETURN_NOT_OK(FromParquetSchema(
+        reader->parquet_reader()->metadata()->schema(), default_arrow_reader_properties(),
+        /*key_value_metadata=*/nullptr, column_indices, &out->schema_));
 
-    auto included_leaves = VectorToSharedSet(column_indices);
-    for (size_t i = 0; i < field_indices.size(); ++i) {
-      RETURN_NOT_OK(reader->GetFieldReader(field_indices[i], included_leaves, row_groups,
-                                           &field_readers[i]));
-      fields.push_back(field_readers[i]->field());
-    }
-    out->reset(new RowGroupRecordBatchReader(std::move(field_readers),
-                                             ::arrow::schema(fields), batch_size));
-    return Status::OK();
+    using ::arrow::RecordBatchIterator;
+
+    auto row_group_index_to_batch_iterator =
+        [=](const int* i) -> ::arrow::Result<RecordBatchIterator> {

Review comment:
       It is intentional, will add a comment




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] emkornfield commented on a change in pull request #7704: ARROW-9297: [C++][Parquet] Support chunked row groups in RowGroupRecordBatchReader

Posted by GitBox <gi...@apache.org>.

emkornfield commented on a change in pull request #7704:
URL: https://github.com/apache/arrow/pull/7704#discussion_r453148755



##########
File path: cpp/src/parquet/arrow/arrow_reader_writer_test.cc
##########
@@ -2877,11 +2877,22 @@ TEST(TestArrowReaderAdHoc, LARGE_MEMORY_TEST(LargeStringColumn)) {
   array.reset();
 
   auto reader = ParquetFileReader::Open(std::make_shared<BufferReader>(tables_buffer));
-
   std::unique_ptr<FileReader> arrow_reader;
   ASSERT_OK(FileReader::Make(default_memory_pool(), std::move(reader), &arrow_reader));
   ASSERT_OK_NO_THROW(arrow_reader->ReadTable(&table));
   ASSERT_OK(table->ValidateFull());
+
+  // ARROW-9297: ensure RecordBatchReader also works
+  reader = ParquetFileReader::Open(std::make_shared<BufferReader>(tables_buffer));
+  ASSERT_OK(FileReader::Make(default_memory_pool(), std::move(reader), &arrow_reader));
+  std::shared_ptr<::arrow::RecordBatchReader> batch_reader;
+  auto all_row_groups = ::arrow::internal::Iota(reader->metadata()->num_row_groups());
+  ASSERT_OK_NO_THROW(arrow_reader->GetRecordBatchReader(all_row_groups, &batch_reader));
+  ASSERT_OK_AND_ASSIGN(auto batched_table,

Review comment:
       type here I think is still useful (i.e. not auto).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] emkornfield commented on a change in pull request #7704: ARROW-9297: [C++][Parquet] Support chunked row groups in RowGroupRecordBatchReader

Posted by GitBox <gi...@apache.org>.

emkornfield commented on a change in pull request #7704:
URL: https://github.com/apache/arrow/pull/7704#discussion_r453148949



##########
File path: cpp/src/parquet/arrow/arrow_schema_test.cc
##########
@@ -644,6 +649,78 @@ TEST_F(TestConvertParquetSchema, ParquetRepeatedNestedSchema) {
   ASSERT_NO_FATAL_FAILURE(CheckFlatSchema(arrow_schema));
 }
 
+TEST_F(TestConvertParquetSchema, ColumnSubselection) {
+  std::vector<NodePtr> parquet_fields;
+  std::vector<std::shared_ptr<Field>> arrow_fields;
+  {
+    //   optional int32 leaf1;
+    //   repeated group outerGroup {
+    //     optional int32 leaf2;
+    //     repeated group innerGroup {
+    //       optional int32 leaf3;
+    //     }
+    //   }
+    parquet_fields.push_back(

Review comment:
       are these intentionally not using logical annotations (i.e. LIST?).  Maybe ad a comment if so.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz commented on a change in pull request #7704: ARROW-9297: [C++][Parquet] Support chunked row groups in RowGroupRecordBatchReader

Posted by GitBox <gi...@apache.org>.

bkietz commented on a change in pull request #7704:
URL: https://github.com/apache/arrow/pull/7704#discussion_r453228008



##########
File path: cpp/src/parquet/arrow/schema.h
##########
@@ -163,24 +165,28 @@ struct PARQUET_EXPORT SchemaManifest {
     return it->second;
   }
 
-  bool GetFieldIndices(const std::vector<int>& column_indices, std::vector<int>* out) {
+  ::arrow::Result<std::vector<int>> GetFieldIndices(
+      const std::vector<int>& column_indices) {
     // Coalesce a list of schema field indices which are the roots of the
     // columns referred to by a list of column indices
     const schema::GroupNode* group = descr->group_node();
     std::unordered_set<int> already_added;
-    out->clear();
-    for (auto& column_idx : column_indices) {
+
+    std::vector<int> out;
+    for (int column_idx : column_indices) {
+      if (column_idx < 0 || column_idx >= descr->num_columns()) {
+        return ::arrow::Status::IndexError("Column index ", column_idx, " is not valid");
+      }
       auto field_node = descr->GetColumnRoot(column_idx);
       auto field_idx = group->FieldIndex(*field_node);
       if (field_idx < 0) {

Review comment:
       Worth noting: replacing this check with an assertion locally broke no tests in C++ or python. I'll leave it as an assertion for now and see what CI thinks




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz commented on a change in pull request #7704: ARROW-9297: [C++][Parquet] Support chunked row groups in RowGroupRecordBatchReader

Posted by GitBox <gi...@apache.org>.

bkietz commented on a change in pull request #7704:
URL: https://github.com/apache/arrow/pull/7704#discussion_r453224969



##########
File path: cpp/src/parquet/arrow/reader.cc
##########
@@ -781,6 +770,9 @@ Status FileReaderImpl::GetRecordBatchReader(const std::vector<int>& row_group_in
                                             const std::vector<int>& column_indices,
                                             std::unique_ptr<RecordBatchReader>* out) {
   // column indices check
+  ARROW_ASSIGN_OR_RAISE(std::ignore, manifest_.GetFieldIndices(column_indices));

Review comment:
       alright




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] emkornfield commented on a change in pull request #7704: ARROW-9297: [C++][Parquet] Support chunked row groups in RowGroupRecordBatchReader

Posted by GitBox <gi...@apache.org>.

emkornfield commented on a change in pull request #7704:
URL: https://github.com/apache/arrow/pull/7704#discussion_r453150835



##########
File path: cpp/src/parquet/arrow/schema.h
##########
@@ -163,24 +165,28 @@ struct PARQUET_EXPORT SchemaManifest {
     return it->second;
   }
 
-  bool GetFieldIndices(const std::vector<int>& column_indices, std::vector<int>* out) {
+  ::arrow::Result<std::vector<int>> GetFieldIndices(
+      const std::vector<int>& column_indices) {
     // Coalesce a list of schema field indices which are the roots of the

Review comment:
       could we move the comment to a docstring and give an example?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz commented on a change in pull request #7704: ARROW-9297: [C++][Parquet] Support chunked row groups in RowGroupRecordBatchReader

Posted by GitBox <gi...@apache.org>.

bkietz commented on a change in pull request #7704:
URL: https://github.com/apache/arrow/pull/7704#discussion_r453239004



##########
File path: cpp/src/parquet/arrow/reader.cc
##########
@@ -301,62 +302,50 @@ class FileReaderImpl : public FileReader {
 
 class RowGroupRecordBatchReader : public ::arrow::RecordBatchReader {
  public:
-  RowGroupRecordBatchReader(std::vector<std::unique_ptr<ColumnReaderImpl>> field_readers,
-                            std::shared_ptr<::arrow::Schema> schema, int64_t batch_size)
-      : field_readers_(std::move(field_readers)),
-        schema_(std::move(schema)),
-        batch_size_(batch_size) {}
-
   ~RowGroupRecordBatchReader() override {}
 
   std::shared_ptr<::arrow::Schema> schema() const override { return schema_; }
 
-  static Status Make(const std::vector<int>& row_groups,
-                     const std::vector<int>& column_indices, FileReaderImpl* reader,
-                     int64_t batch_size,
-                     std::unique_ptr<::arrow::RecordBatchReader>* out) {
-    std::vector<int> field_indices;
-    if (!reader->manifest_.GetFieldIndices(column_indices, &field_indices)) {
-      return Status::Invalid("Invalid column index");
-    }
+  static ::arrow::Result<std::unique_ptr<RowGroupRecordBatchReader>> Make(
+      const std::vector<int>& row_group_indices, const std::vector<int>& column_indices,

Review comment:
       It seems to me the more logical place for doc would be the public method `parquet::arrow::FileReader::GetRecordBatchReader()`. I'll add comments about lifetime and memory usage there.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] emkornfield commented on a change in pull request #7704: ARROW-9297: [C++][Parquet] Support chunked row groups in RowGroupRecordBatchReader

Posted by GitBox <gi...@apache.org>.

emkornfield commented on a change in pull request #7704:
URL: https://github.com/apache/arrow/pull/7704#discussion_r453148683



##########
File path: cpp/src/parquet/arrow/arrow_reader_writer_test.cc
##########
@@ -2877,11 +2877,22 @@ TEST(TestArrowReaderAdHoc, LARGE_MEMORY_TEST(LargeStringColumn)) {
   array.reset();
 
   auto reader = ParquetFileReader::Open(std::make_shared<BufferReader>(tables_buffer));
-
   std::unique_ptr<FileReader> arrow_reader;
   ASSERT_OK(FileReader::Make(default_memory_pool(), std::move(reader), &arrow_reader));
   ASSERT_OK_NO_THROW(arrow_reader->ReadTable(&table));
   ASSERT_OK(table->ValidateFull());
+
+  // ARROW-9297: ensure RecordBatchReader also works
+  reader = ParquetFileReader::Open(std::make_shared<BufferReader>(tables_buffer));
+  ASSERT_OK(FileReader::Make(default_memory_pool(), std::move(reader), &arrow_reader));
+  std::shared_ptr<::arrow::RecordBatchReader> batch_reader;
+  auto all_row_groups = ::arrow::internal::Iota(reader->metadata()->num_row_groups());

Review comment:
       Could this be simplified to:
   `::arrow::internal::Iota all_row_groups(...);
   
   if so please do it.  If not please spell out the type instead of auto.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org