You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/09 12:34:01 UTC

[GitHub] [arrow] lidavidm commented on a diff in pull request #13088: ARROW-16085: [C++][R] InMemoryDataset::ReplaceSchema does not alter scan output

lidavidm commented on code in PR #13088:
URL: https://github.com/apache/arrow/pull/13088#discussion_r867960074


##########
cpp/src/arrow/dataset/dataset_test.cc:
##########
@@ -107,6 +111,38 @@ TEST_F(TestInMemoryDataset, InMemoryFragment) {
   AssertSchemaEqual(batch->schema(), schema);
 }
 
+TEST_F(TestInMemoryDataset, HandlesDifferingSchemas) {
+  constexpr int64_t kBatchSize = 1024;
+
+  // These schemas can be merged
+  SetSchema({field("i32", int32()), field("f64", float64())});
+  auto batch1 = ConstantArrayGenerator::Zeroes(kBatchSize, schema_);
+  SetSchema({field("i32", int64())});
+  auto batch2 = ConstantArrayGenerator::Zeroes(kBatchSize, schema_);
+  RecordBatchVector batches{batch1, batch2};
+
+  auto dataset = std::make_shared<InMemoryDataset>(schema_, batches);
+
+  ASSERT_OK_AND_ASSIGN(auto scanner_builder, dataset->NewScan());
+  ASSERT_OK_AND_ASSIGN(auto scanner, scanner_builder->Finish());
+  ASSERT_OK_AND_ASSIGN(auto table, scanner->ToTable());
+  ASSERT_EQ(*table->schema(), *schema_);
+  ASSERT_EQ(table->num_rows(), 2 * kBatchSize);
+
+  // These cannot be merged
+  SetSchema({field("i32", int32()), field("f64", float64())});
+  batch1 = ConstantArrayGenerator::Zeroes(kBatchSize, schema_);
+  SetSchema({field("i32", struct_({field("x", date32())}))});
+  batch2 = ConstantArrayGenerator::Zeroes(kBatchSize, schema_);
+  batches = RecordBatchVector({batch1, batch2});
+
+  dataset = std::make_shared<InMemoryDataset>(schema_, batches);
+
+  ASSERT_OK_AND_ASSIGN(scanner_builder, dataset->NewScan());
+  ASSERT_OK_AND_ASSIGN(scanner, scanner_builder->Finish());
+  ASSERT_NOT_OK(scanner->ToTable());

Review Comment:
   Can we have a more explicit check, e.g. with `EXPECT_RAISES_WITH_MESSAGE_THAT`?



##########
cpp/src/arrow/dataset/dataset_test.cc:
##########
@@ -107,6 +111,38 @@ TEST_F(TestInMemoryDataset, InMemoryFragment) {
   AssertSchemaEqual(batch->schema(), schema);
 }
 
+TEST_F(TestInMemoryDataset, HandlesDifferingSchemas) {
+  constexpr int64_t kBatchSize = 1024;
+
+  // These schemas can be merged
+  SetSchema({field("i32", int32()), field("f64", float64())});
+  auto batch1 = ConstantArrayGenerator::Zeroes(kBatchSize, schema_);
+  SetSchema({field("i32", int64())});
+  auto batch2 = ConstantArrayGenerator::Zeroes(kBatchSize, schema_);
+  RecordBatchVector batches{batch1, batch2};
+
+  auto dataset = std::make_shared<InMemoryDataset>(schema_, batches);

Review Comment:
   Do we actually want this to be valid though? I would expect the batches of a dataset to have a consistent schema



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org