You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "agoncharuk (via GitHub)" <gi...@apache.org> on 2023/05/15 17:22:15 UTC

[GitHub] [arrow] agoncharuk opened a new issue, #35595: [C++] Dataset: Invalid output schema when both projected schema and filter are set

agoncharuk opened a new issue, #35595:
URL: https://github.com/apache/arrow/issues/35595

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   Hello, Arrow community!
   I think I am facing what appears to be a bug in Arrow 12 when using `ParquetFileFormat`. The issue can be demonstrated with the following test (I am using a parquet file that was generated using DuckDB TPCH extension):
   ```
   #include <gtest/gtest.h>
   
   #include <arrow/filesystem/api.h>
   #include <arrow/dataset/file_parquet.h>
   
   std::shared_ptr<arrow::Schema> makeTestSchema(
       const std::vector<std::string>& colNames, const std::vector<std::shared_ptr<arrow::DataType>>& colTypes) {
       assert(colNames.size() == colTypes.size());
       arrow::FieldVector fields;
       for (int f = 0; f < colNames.size(); ++f) { fields.emplace_back(arrow::field(colNames[f], colTypes[f])); }
       return std::make_shared<arrow::Schema>(std::move(fields));
   }
   
   TEST(FileFormatTest, TestProjectionAndFilter) {
       auto schema = makeTestSchema(
           {
               "c_custkey",
               "c_name",
               "c_address",
               "c_nationkey",
               "c_phone",
               "c_acctbal",
               "c_mktsegment",
               "c_comment",
           },
           {
               arrow::int32(),
               arrow::utf8(),
               arrow::utf8(),
               arrow::int32(),
               arrow::utf8(),
               arrow::decimal128(15, 2),
               arrow::utf8(),
               arrow::utf8(),
           });
   
       auto descr = arrow::dataset::ProjectionDescr::FromNames(
           {"c_custkey", "c_nationkey", "c_name", "c_address"}, *schema.get())ValueOrDie();
   
       auto scanOpts = std::make_shared<arrow::dataset::ScanOptions>();
       scanOpts->projected_schema = descr.schema;
       scanOpts->projection = descr.expression;
   
       auto unbound = arrow::compute::call(
           "equal", 
           {arrow::compute::field_ref(arrow::FieldRef{"c_name"}), arrow::compute::literal("Customer#000001186")});
       scanOpts->filter = unbound.Bind(*schema).ValueOrDie();
   
       auto fs = std::make_shared<arrow::fs::LocalFileSystem>();
       auto format = std::make_shared<arrow::dataset::ParquetFileFormat>();
       auto file = "testing/tpch/customer/part.0.parquet";
   
       arrow::dataset::FileSource source(file, fs);
       auto fragment = format->MakeFragment(source, schema).ValueOrDie();
   
       auto batchGenerator = fragment->ScanBatchesAsync(scanOpts).ValueOrDie();
       auto batch = batchGenerator().result().ValueOrDie();
       ASSERT_TRUE(batch != nullptr);
       EXPECT_EQ(4, batch->columns().size());
       EXPECT_TRUE(arrow::int32()->Equals(*batch->column(0)->type()));
       EXPECT_TRUE(arrow::int32()->Equals(*batch->column(1)->type()));
       EXPECT_TRUE(arrow::utf8()->Equals(*batch->column(2)->type()));
       EXPECT_TRUE(arrow::utf8()->Equals(*batch->column(3)->type()));
   } 
   ```
   The test fails because the returned batch has types `{string, int32, int32, string}` instead of expected `{int32, in32, string, string}`.
   After a quick debug, I see that `InferColumnProjection` in `file_parquet.cpp` returns duplicated projected columns because it does not handle duplicates from `ScanOptions::MaterializedFields()`, which in turn returns a union of fields used in a filter and a projection, in that order (this is an expected behavior according to the documentation).
   Another thing that is not clear to me is that `InferColumnProjection` returns indices for 5 fields, while the resulting batch generator produces batches with 4 columns: I did not catch where an extra column is truncated.
   
   A few questions: 
    * Is this indeed a bug and my use of the API is correct, are there any workarounds for this?
    * Where is the logic that truncates 5 fields of inferred schema to 4 fields returned from the batch generator?
    * If this is a bug, what would be a correct fix (I do not mind contributing one)? I assume that `InferColumnProjection` should take into account duplicated column refs, and also `ScanOptions::MaterializedFields()` should return projected columns first, and filtered columns last.
   
   ### Component(s)
   
   C++, Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] agoncharuk commented on issue #35595: [C++] Dataset: Invalid output schema when both projected schema and filter are set

Posted by "agoncharuk (via GitHub)" <gi...@apache.org>.

agoncharuk commented on issue #35595:
URL: https://github.com/apache/arrow/issues/35595#issuecomment-1554300875

   I am sure that `makeTestSchema()` creates a correct schema because `ParquetFileFragment::EnsureCompleteMetadata` validates the schema and returns an error status when the schemas do not match.
   
   However, I see your point regarding the API usage and will try to reproduce/debug the issue using a recommended flow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #35595: [C++] Dataset: Invalid output schema when both projected schema and filter are set

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #35595:
URL: https://github.com/apache/arrow/issues/35595#issuecomment-1552998924

   > Is this indeed a bug and my use of the API is correct, are there any workarounds for this?
   
   Hmm, the suspicious part to me here is the call to `format->MakeFragment`.  This function is primarily intended for internal use.  The normal flow is:
   
    * Create a dataset
    * Scan the dataset
   
   Scanning a fragment directly should be technically possible.  However, the call to `MakeFragment` expects to receive the "physical schema".  This must be the schema of the file itself.  My best guess is that your definition of `makeTestSchema` is not matching the column order stored in the parquet file.
   
   The schema provided to `scanOpts` is the dataset schema (not the physical schema), and is free to be in whatever order you want.  The method `InferColumnSchema` is attempting to map between the two.  Since it thinks the physical schema is identical it is not doing any reordering.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] agoncharuk commented on issue #35595: [C++] Dataset: Invalid output schema when both projected schema and filter are set

Posted by "agoncharuk (via GitHub)" <gi...@apache.org>.

agoncharuk commented on issue #35595:
URL: https://github.com/apache/arrow/issues/35595#issuecomment-1570181158

   I was not able to reproduce this using the recommended API. Closing for now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] agoncharuk closed issue #35595: [C++] Dataset: Invalid output schema when both projected schema and filter are set

Posted by "agoncharuk (via GitHub)" <gi...@apache.org>.

agoncharuk closed issue #35595: [C++] Dataset: Invalid output schema when both projected schema and filter are set
URL: https://github.com/apache/arrow/issues/35595


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org