You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "westonpace (via GitHub)" <gi...@apache.org> on 2023/05/31 21:18:11 UTC

[GitHub] [arrow] westonpace opened a new pull request, #35860: GH-35730: [C++] Add the ability to specify custom schema on a dataset write

westonpace opened a new pull request, #35860:
URL: https://github.com/apache/arrow/pull/35860

   ### Rationale for this change
   
   The dataset write node previously allowed you to specify custom key/value metadata on a write node.  This was added to support saving schema metadata.  However, it doesn't capture field metadata or field nullability.  This PR replaces that capability with the ability to specify a custom schema instead.  The custom schema must have the same number of fields as the input to the write node and each field must have the same type.
   
   ### What changes are included in this PR?
   
   Added `custom_schema` to `WriteNodeOptions` and removed `custom_metadata`.
   
   ### Are these changes tested?
   
   Yes, I added a new C++ unit test to verify that the custom info is applied to written files.
   
   ### Are there any user-facing changes?
   
   BREAKING CHANGE: `WriteNodeOptions::custom_metadata` is now `WriteNodeOptions::custom_schema`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] anjakefala commented on a diff in pull request #35860: GH-35730: [C++] Add the ability to specify custom schema on a dataset write

Posted by "anjakefala (via GitHub)" <gi...@apache.org>.

anjakefala commented on code in PR #35860:
URL: https://github.com/apache/arrow/pull/35860#discussion_r1213642872


##########
python/pyarrow/tests/test_dataset.py:
##########
@@ -5192,13 +5188,45 @@ def test_preserve_nullability_parquet(tempdir):
     # nullability of field is preserved
     assert dataset.to_table().schema.equals(schema_nullable)
 
-    table_no_null = pa.Table.from_arrays(array, schema=schema)
-
-    # we can specify the nullability of a field through the schema
-    pa.dataset.write_dataset(table_no_null, tempdir/"nulltest2", schema=schema_nullable)
+    pa.dataset.write_dataset(table, tempdir/"nulltest2", format="parquet")
     dataset = ds.dataset(tempdir/"nulltest2", format="parquet")
     assert dataset.to_table().schema.equals(schema_nullable)
 
+    pa.dataset.write_dataset([table, table], tempdir/"nulltest3", format="parquet")
+    dataset = ds.dataset(tempdir/"nulltest3", format="parquet")
+    assert dataset.to_table().schema.equals(schema_nullable)
+
+
+def test_preserve_field_metadata(tempdir):
+    schema_metadata = pa.schema([
+        pa.field("x", pa.int64(), metadata={b'foo': b'bar'}),
+        pa.field("y", pa.int64())])
+
+    schema_no_meta = pa.schema([
+        pa.field("x", pa.int64()),
+        pa.field("y", pa.int64())])
+
+    array = [[1, 2, 3], [None, 5, None]]
+    table = pa.Table.from_arrays(array, schema=schema_metadata)
+    table_no_meta = pa.Table.from_arrays(array, schema=schema_no_meta)
+
+    # If no schema is provided the schema of the first table will be used
+    pa.dataset.write_dataset([table, table_no_meta],

Review Comment:
   ```
   ds.write_dataset([table, table_no_meta],
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on pull request #35860: GH-35730: [C++] Add the ability to specify custom schema on a dataset write

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on PR #35860:
URL: https://github.com/apache/arrow/pull/35860#issuecomment-1572742013

   I'm going to go ahead and merge if green.  I think it's quite late for Joris and I believe Raul will be starting the patch build soon.  If Joris finds an issue later then we can still scrap the patch release candidate for a fix (though hopefully we don't need to).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a diff in pull request #35860: GH-35730: [C++] Add the ability to specify custom schema on a dataset write

Posted by "pitrou (via GitHub)" <gi...@apache.org>.

pitrou commented on code in PR #35860:
URL: https://github.com/apache/arrow/pull/35860#discussion_r1213355251


##########
cpp/src/arrow/dataset/file_base.cc:
##########
@@ -475,16 +475,38 @@ Result<acero::ExecNode*> MakeWriteNode(acero::ExecPlan* plan,
 
   const WriteNodeOptions write_node_options =
       checked_cast<const WriteNodeOptions&>(options);
-  const std::shared_ptr<const KeyValueMetadata>& custom_metadata =
-      write_node_options.custom_metadata;
+  const std::shared_ptr<Schema>& custom_schema = write_node_options.custom_schema;
   const FileSystemDatasetWriteOptions& write_options = write_node_options.write_options;
 
+  const std::shared_ptr<Schema>& input_schema = inputs[0]->output_schema();
+
+  if (custom_schema != nullptr) {
+    if (custom_schema->num_fields() != input_schema->num_fields()) {
+      return Status::Invalid(

Review Comment:
   A schema error is a type error IMHO.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a diff in pull request #35860: GH-35730: [C++] Add the ability to specify custom schema on a dataset write

Posted by "pitrou (via GitHub)" <gi...@apache.org>.

pitrou commented on code in PR #35860:
URL: https://github.com/apache/arrow/pull/35860#discussion_r1212851930


##########
cpp/src/arrow/dataset/file_base.cc:
##########
@@ -475,16 +475,38 @@ Result<acero::ExecNode*> MakeWriteNode(acero::ExecPlan* plan,
 
   const WriteNodeOptions write_node_options =
       checked_cast<const WriteNodeOptions&>(options);
-  const std::shared_ptr<const KeyValueMetadata>& custom_metadata =
-      write_node_options.custom_metadata;
+  const std::shared_ptr<Schema>& custom_schema = write_node_options.custom_schema;
   const FileSystemDatasetWriteOptions& write_options = write_node_options.write_options;
 
+  const std::shared_ptr<Schema>& input_schema = inputs[0]->output_schema();
+
+  if (custom_schema != nullptr) {
+    if (custom_schema->num_fields() != input_schema->num_fields()) {
+      return Status::Invalid(

Review Comment:
   `Status::TypeError` here?



##########
cpp/src/arrow/dataset/file_base.cc:
##########
@@ -475,16 +475,38 @@ Result<acero::ExecNode*> MakeWriteNode(acero::ExecPlan* plan,
 
   const WriteNodeOptions write_node_options =
       checked_cast<const WriteNodeOptions&>(options);
-  const std::shared_ptr<const KeyValueMetadata>& custom_metadata =
-      write_node_options.custom_metadata;
+  const std::shared_ptr<Schema>& custom_schema = write_node_options.custom_schema;
   const FileSystemDatasetWriteOptions& write_options = write_node_options.write_options;
 
+  const std::shared_ptr<Schema>& input_schema = inputs[0]->output_schema();
+
+  if (custom_schema != nullptr) {
+    if (custom_schema->num_fields() != input_schema->num_fields()) {
+      return Status::Invalid(
+          "The provided custom_schema did not have the same number of fields as the "
+          "data.  The custom schema can only be used to add metadata / nullability to "
+          "fields and cannot change the type or number of fields.");
+    }
+    for (int field_idx = 0; field_idx < input_schema->num_fields(); field_idx++) {
+      if (!input_schema->field(field_idx)->type()->Equals(
+              custom_schema->field(field_idx)->type())) {
+        return Status::Invalid("The provided custom_schema specified type ",

Review Comment:
   Same here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] anjakefala commented on pull request #35860: GH-35730: [C++] Add the ability to specify custom schema on a dataset write

Posted by "anjakefala (via GitHub)" <gi...@apache.org>.

anjakefala commented on PR #35860:
URL: https://github.com/apache/arrow/pull/35860#issuecomment-1571033248

   I have confirmed that
   
   ```
    pa.dataset.write_dataset(table, tempdir/"nulltest2", schema=schema_nullable, format="parquet") 
   ```
   has a field with nullability.
   
   The following do not:
   
   ```
   pa.dataset.write_dataset([table_no_null, table], tempdir/"nulltest2", schema=schema_nullable, format="parquet")  
   ```
   
   or 
   
   ```
   pa.dataset.write_dataset([table, table_no_null], tempdir/"nulltest2", schema=schema_nullable, format="parquet") 
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #35860: GH-35730: [C++] Add the ability to specify custom schema on a dataset write

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on code in PR #35860:
URL: https://github.com/apache/arrow/pull/35860#discussion_r1212891709


##########
cpp/src/arrow/dataset/file_base.h:
##########
@@ -463,15 +463,21 @@ struct ARROW_DS_EXPORT FileSystemDatasetWriteOptions {
 /// \brief Wraps FileSystemDatasetWriteOptions for consumption as compute::ExecNodeOptions
 class ARROW_DS_EXPORT WriteNodeOptions : public acero::ExecNodeOptions {
  public:
-  explicit WriteNodeOptions(
-      FileSystemDatasetWriteOptions options,
-      std::shared_ptr<const KeyValueMetadata> custom_metadata = NULLPTR)
-      : write_options(std::move(options)), custom_metadata(std::move(custom_metadata)) {}
+  explicit WriteNodeOptions(FileSystemDatasetWriteOptions options,
+                            std::shared_ptr<Schema> custom_schema = NULLPTR)
+      : write_options(std::move(options)), custom_schema(std::move(custom_schema)) {}
 
   /// \brief Options to control how to write the dataset
   FileSystemDatasetWriteOptions write_options;
-  /// \brief Optional metadata to attach to written batches
-  std::shared_ptr<const KeyValueMetadata> custom_metadata;

Review Comment:
   Instead of removing this option (as a breaking change), we could in theory still allow the user to specify one of both?
   
   (I am not using the C++ API for this, so I don't know how useful this would be / how cumbersome it is to specify the schema if you only want to specify metadata. From the DatasetWriter point of view, this is a fine change of course since there we already have the full schema)
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #35860: GH-35730: [C++] Add the ability to specify custom schema on a dataset write

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on PR #35860:
URL: https://github.com/apache/arrow/pull/35860#issuecomment-1572741334

   > Do you want to take a look at the Python test changes?
   
   There were no changes to this since my previous review (and push), so all good!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #35860: GH-35730: [C++] Add the ability to specify custom schema on a dataset write

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on code in PR #35860:
URL: https://github.com/apache/arrow/pull/35860#discussion_r1213336641


##########
cpp/src/arrow/dataset/file_base.cc:
##########
@@ -475,16 +475,38 @@ Result<acero::ExecNode*> MakeWriteNode(acero::ExecPlan* plan,
 
   const WriteNodeOptions write_node_options =
       checked_cast<const WriteNodeOptions&>(options);
-  const std::shared_ptr<const KeyValueMetadata>& custom_metadata =
-      write_node_options.custom_metadata;
+  const std::shared_ptr<Schema>& custom_schema = write_node_options.custom_schema;
   const FileSystemDatasetWriteOptions& write_options = write_node_options.write_options;
 
+  const std::shared_ptr<Schema>& input_schema = inputs[0]->output_schema();
+
+  if (custom_schema != nullptr) {
+    if (custom_schema->num_fields() != input_schema->num_fields()) {
+      return Status::Invalid(

Review Comment:
   Yes, to be honest for having a wrong number of fields, I would also raise a ValueError/Invalid (it's a wrong value, not a wrong type)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] raulcd commented on pull request #35860: GH-35730: [C++] Add the ability to specify custom schema on a dataset write

Posted by "raulcd (via GitHub)" <gi...@apache.org>.

raulcd commented on PR #35860:
URL: https://github.com/apache/arrow/pull/35860#issuecomment-1573534518

   @github-actions crossbow submit test-r-ubuntu-22.04


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #35860: GH-35730: [C++] Add the ability to specify custom schema on a dataset write

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on PR #35860:
URL: https://github.com/apache/arrow/pull/35860#issuecomment-1570971176

   :warning: GitHub issue #35730 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] anjakefala commented on pull request #35860: GH-35730: [C++] Add the ability to specify custom schema on a dataset write

Posted by "anjakefala (via GitHub)" <gi...@apache.org>.

anjakefala commented on PR #35860:
URL: https://github.com/apache/arrow/pull/35860#issuecomment-1571438510

   > If this is the same error you were getting then I think we can call this an invalid scenario and we don't have to support it (at least for this PR. Arguably, you could evolve a table into the correct schema if adding it to an InMemoryDataset but that's a different feature).
   
   Good to know!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #35860: GH-35730: [C++] Add the ability to specify custom schema on a dataset write

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on code in PR #35860:
URL: https://github.com/apache/arrow/pull/35860#discussion_r1212752418


##########
python/pyarrow/tests/test_dataset.py:
##########
@@ -5172,6 +5172,62 @@ def test_dataset_partition_with_slash(tmpdir):
     assert encoded_paths == file_paths
 
 
+def test_preserve_nullability_parquet(tempdir):
+    # GH-35730
+    schema_nullable = pa.schema([
+        pa.field("x", pa.int64(), nullable=False),
+        pa.field("y", pa.int64(), nullable=True)])
+
+    array = [[1, 2, 3], [None, 5, None]]
+
+    table = pa.Table.from_arrays(array,
+                                 schema=schema_nullable)
+
+    pq.write_to_dataset(table, tempdir/"nulltest1")
+    dataset = ds.dataset(tempdir/"nulltest1", format="parquet")
+    # nullability of field is preserved
+    assert dataset.to_table().schema.equals(schema_nullable)
+
+    pa.dataset.write_dataset(table, tempdir/"nulltest2", format="parquet")

Review Comment:
   ```suggestion
       ds.write_dataset(table, tempdir/"nulltest2", format="parquet")
   ```



##########
python/pyarrow/tests/test_dataset.py:
##########
@@ -5172,6 +5172,62 @@ def test_dataset_partition_with_slash(tmpdir):
     assert encoded_paths == file_paths
 
 
+def test_preserve_nullability_parquet(tempdir):

Review Comment:
   ```suggestion
   def test_write_dataset_preserve_nullability_parquet(tempdir):
   ```



##########
python/pyarrow/tests/test_dataset.py:
##########
@@ -5172,6 +5172,62 @@ def test_dataset_partition_with_slash(tmpdir):
     assert encoded_paths == file_paths
 
 
+def test_preserve_nullability_parquet(tempdir):
+    # GH-35730
+    schema_nullable = pa.schema([
+        pa.field("x", pa.int64(), nullable=False),
+        pa.field("y", pa.int64(), nullable=True)])
+
+    array = [[1, 2, 3], [None, 5, None]]
+
+    table = pa.Table.from_arrays(array,
+                                 schema=schema_nullable)
+
+    pq.write_to_dataset(table, tempdir/"nulltest1")
+    dataset = ds.dataset(tempdir/"nulltest1", format="parquet")
+    # nullability of field is preserved
+    assert dataset.to_table().schema.equals(schema_nullable)
+
+    pa.dataset.write_dataset(table, tempdir/"nulltest2", format="parquet")
+    dataset = ds.dataset(tempdir/"nulltest2", format="parquet")
+    assert dataset.to_table().schema.equals(schema_nullable)
+
+    pa.dataset.write_dataset([table, table], tempdir/"nulltest3", format="parquet")

Review Comment:
   ```suggestion
       ds.write_dataset([table, table], tempdir/"nulltest3", format="parquet")
   ```



##########
cpp/src/arrow/dataset/file_base.cc:
##########
@@ -475,16 +475,38 @@ Result<acero::ExecNode*> MakeWriteNode(acero::ExecPlan* plan,
 
   const WriteNodeOptions write_node_options =
       checked_cast<const WriteNodeOptions&>(options);
-  const std::shared_ptr<const KeyValueMetadata>& custom_metadata =
-      write_node_options.custom_metadata;
+  const std::shared_ptr<Schema>& custom_schema = write_node_options.custom_schema;
   const FileSystemDatasetWriteOptions& write_options = write_node_options.write_options;
 
+  const std::shared_ptr<Schema>& input_schema = inputs[0]->output_schema();
+
+  if (custom_schema != nullptr) {
+    if (custom_schema->num_fields() != input_schema->num_fields()) {
+      return Status::Invalid(
+          "The provided custom_schema did not have the same number of fields as the "
+          "data.  The custom schema can only be used to add metadata / nullability to "
+          "fields and cannot change the type or number of fields.");
+    }
+    for (int field_idx = 0; field_idx < input_schema->num_fields(); field_idx++) {
+      if (!input_schema->field(field_idx)->type()->Equals(
+              custom_schema->field(field_idx)->type())) {

Review Comment:
   Should we also test that the names of the fields are equal?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on a diff in pull request #35860: GH-35730: [C++] Add the ability to specify custom schema on a dataset write

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on code in PR #35860:
URL: https://github.com/apache/arrow/pull/35860#discussion_r1213340455


##########
cpp/src/arrow/dataset/file_base.cc:
##########
@@ -475,16 +475,38 @@ Result<acero::ExecNode*> MakeWriteNode(acero::ExecPlan* plan,
 
   const WriteNodeOptions write_node_options =
       checked_cast<const WriteNodeOptions&>(options);
-  const std::shared_ptr<const KeyValueMetadata>& custom_metadata =
-      write_node_options.custom_metadata;
+  const std::shared_ptr<Schema>& custom_schema = write_node_options.custom_schema;
   const FileSystemDatasetWriteOptions& write_options = write_node_options.write_options;
 
+  const std::shared_ptr<Schema>& input_schema = inputs[0]->output_schema();
+
+  if (custom_schema != nullptr) {
+    if (custom_schema->num_fields() != input_schema->num_fields()) {
+      return Status::Invalid(
+          "The provided custom_schema did not have the same number of fields as the "
+          "data.  The custom schema can only be used to add metadata / nullability to "
+          "fields and cannot change the type or number of fields.");
+    }
+    for (int field_idx = 0; field_idx < input_schema->num_fields(); field_idx++) {
+      if (!input_schema->field(field_idx)->type()->Equals(
+              custom_schema->field(field_idx)->type())) {

Review Comment:
   Changing the names should be safe.  Admittedly, a user could also do this name change by inserting a project node before the write node.
   
   I could be convinced otherwise but I don't think this does any harm and I think, as a user, I would expect this behavior, so it wouldn't be surprising that the names changed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #35860: GH-35730: [C++] Add the ability to specify custom schema on a dataset write

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on code in PR #35860:
URL: https://github.com/apache/arrow/pull/35860#discussion_r1213335241


##########
cpp/src/arrow/dataset/file_base.h:
##########
@@ -463,13 +463,21 @@ struct ARROW_DS_EXPORT FileSystemDatasetWriteOptions {
 /// \brief Wraps FileSystemDatasetWriteOptions for consumption as compute::ExecNodeOptions
 class ARROW_DS_EXPORT WriteNodeOptions : public acero::ExecNodeOptions {
  public:
-  explicit WriteNodeOptions(
-      FileSystemDatasetWriteOptions options,
-      std::shared_ptr<const KeyValueMetadata> custom_metadata = NULLPTR)
-      : write_options(std::move(options)), custom_metadata(std::move(custom_metadata)) {}
+  explicit WriteNodeOptions(FileSystemDatasetWriteOptions options,
+                            std::shared_ptr<Schema> custom_schema = NULLPTR)

Review Comment:
   Should still constructor also have custom_metadata be added back?  
   Or do you only use this in attribute-setting style (`write_options.custom_metadata = ..`) ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot commented on pull request #35860: GH-35730: [C++] Add the ability to specify custom schema on a dataset write

Posted by "ursabot (via GitHub)" <gi...@apache.org>.

ursabot commented on PR #35860:
URL: https://github.com/apache/arrow/pull/35860#issuecomment-1575171108

   Benchmark runs are scheduled for baseline = 3fe4a315633c2d959a1e1c069e15688db4d8e2d1 and contender = 018e7d3f9c4bcacce716dd607994486a31ee71bb. 018e7d3f9c4bcacce716dd607994486a31ee71bb is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/420203d0e26342f6b2a62b31fecb7be4...c55f25d0633943619cc10858e299e6bb/)
   [Finished :arrow_down:0.8% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/9c4c857ce0aa47efb8b80570c797612d...3725d6d47fd14ebab787cdb21d184e1a/)
   [Finished :arrow_down:0.33% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/6646aacbc3714a1a8be7d1730f4009a8...e6953777bb48473ea97ab0c354582baf/)
   [Finished :arrow_down:0.6% :arrow_up:0.0%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/c08006b7b9f14025a186edd3b6e73cf8...bfcf868aa0094e8e97d0f6b0ff3800e6/)
   Buildkite builds:
   [Finished] [`018e7d3f` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2972)
   [Finished] [`018e7d3f` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/3008)
   [Finished] [`018e7d3f` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2973)
   [Finished] [`018e7d3f` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2998)
   [Finished] [`3fe4a315` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2971)
   [Finished] [`3fe4a315` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/3007)
   [Finished] [`3fe4a315` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2972)
   [Finished] [`3fe4a315` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2997)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on pull request #35860: GH-35730: [C++] Add the ability to specify custom schema on a dataset write

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on PR #35860:
URL: https://github.com/apache/arrow/pull/35860#issuecomment-1570971208

   Starting in draft stage as I believe we should get some python unit tests in here and @anjakefala volunteered to do some.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a diff in pull request #35860: GH-35730: [C++] Add the ability to specify custom schema on a dataset write

Posted by "pitrou (via GitHub)" <gi...@apache.org>.

pitrou commented on code in PR #35860:
URL: https://github.com/apache/arrow/pull/35860#discussion_r1213536961


##########
cpp/src/arrow/dataset/file_base.h:
##########
@@ -470,6 +471,15 @@ class ARROW_DS_EXPORT WriteNodeOptions : public acero::ExecNodeOptions {
 
   /// \brief Options to control how to write the dataset
   FileSystemDatasetWriteOptions write_options;
+  /// \brief Optional schema to attach to all written batches
+  ///
+  /// By default, we will use the output schema of the input.
+  ///
+  /// This can be used to alter schema metadata, field nullability, or field metadata

Review Comment:
   Nit
   ```suggestion
     /// This can be used to alter schema metadata, field nullability, or field metadata.
   ```



##########
cpp/src/arrow/dataset/file_base.h:
##########
@@ -34,6 +34,7 @@
 #include "arrow/filesystem/filesystem.h"
 #include "arrow/io/file.h"
 #include "arrow/util/compression.h"
+#include "arrow/util/key_value_metadata.h"

Review Comment:
   This shouldn't be necessary if `arrow/type_fwd.h` is included.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace merged pull request #35860: GH-35730: [C++] Add the ability to specify custom schema on a dataset write

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace merged PR #35860:
URL: https://github.com/apache/arrow/pull/35860


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on pull request #35860: GH-35730: [C++] Add the ability to specify custom schema on a dataset write

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on PR #35860:
URL: https://github.com/apache/arrow/pull/35860#issuecomment-1572586017

   The failing test on AMD64 is unrelated (flight ucx test).  There is still one R test running but I think this is good to go now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] anjakefala commented on pull request #35860: GH-35730: [C++] Add the ability to specify custom schema on a dataset write

Posted by "anjakefala (via GitHub)" <gi...@apache.org>.

anjakefala commented on PR #35860:
URL: https://github.com/apache/arrow/pull/35860#issuecomment-1571027387

   So far, I added 2 basic tests, based on my understanding of the feature!
   
   The basic case where you write a single table, which contains a field with nullability specified, passes.
   
   Note that this one:
   
   ```
     # we can specify the nullability of a field through the schema                                                                                   
       pa.dataset.write_dataset(table_no_null, tempdir/"nulltest2", schema=schema_nullable)                                                             
       dataset = ds.dataset(tempdir/"nulltest2", format="parquet")                                                                                      
       assert dataset.to_table().schema.equals(schema_nullable) 
   ```
   
   is failing for now. I did not specify nullability in the table's schema, but then specified it in `write_dataset(schema=)`.
   
   Is it expected that the returned dataset would have a field with nullability?
   
   In this example `table` has a field specified with nullability, while `table_no_null` does not:
   
   ```
   pa.dataset.write_dataset([table_no_null, table], tempdir/"nulltest2", schema=schema_nullable) 
   ```
   
   The resulting schema also does not have nullability.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on a diff in pull request #35860: GH-35730: [C++] Add the ability to specify custom schema on a dataset write

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on code in PR #35860:
URL: https://github.com/apache/arrow/pull/35860#discussion_r1213346867


##########
cpp/src/arrow/dataset/file_base.h:
##########
@@ -463,13 +463,21 @@ struct ARROW_DS_EXPORT FileSystemDatasetWriteOptions {
 /// \brief Wraps FileSystemDatasetWriteOptions for consumption as compute::ExecNodeOptions
 class ARROW_DS_EXPORT WriteNodeOptions : public acero::ExecNodeOptions {
  public:
-  explicit WriteNodeOptions(
-      FileSystemDatasetWriteOptions options,
-      std::shared_ptr<const KeyValueMetadata> custom_metadata = NULLPTR)
-      : write_options(std::move(options)), custom_metadata(std::move(custom_metadata)) {}
+  explicit WriteNodeOptions(FileSystemDatasetWriteOptions options,
+                            std::shared_ptr<Schema> custom_schema = NULLPTR)

Review Comment:
   Good catch.  Going forwards, the rule I would like to follow is:
   
    * There is only one constructor and it contains each required field.
   
   Trying to include optional fields gets a bit crazy (look at `HashJoinNodeOptions`).
   
   However, you are absolutely right.  For backwards compatibility purposes this legacy constructor should remain as-is.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on pull request #35860: GH-35730: [C++] Add the ability to specify custom schema on a dataset write

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on PR #35860:
URL: https://github.com/apache/arrow/pull/35860#issuecomment-1571343515

   Oh, and thank you for adding the test case :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #35860: GH-35730: [C++] Add the ability to specify custom schema on a dataset write

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on PR #35860:
URL: https://github.com/apache/arrow/pull/35860#issuecomment-1573536836

   Revision: 875dfed1427e124bd73f3f3f630b8e6dd3c285fc
   
   Submitted crossbow builds: [ursacomputing/crossbow @ actions-42981f190e](https://github.com/ursacomputing/crossbow/branches/all?query=actions-42981f190e)
   
   |Task|Status|
   |----|------|
   |test-r-ubuntu-22.04|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-42981f190e-github-test-r-ubuntu-22.04)](https://github.com/ursacomputing/crossbow/actions/runs/5154711065/jobs/9283486371)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #35860: GH-35730: [C++] Add the ability to specify custom schema on a dataset write

Posted by "pitrou (via GitHub)" <gi...@apache.org>.

pitrou commented on PR #35860:
URL: https://github.com/apache/arrow/pull/35860#issuecomment-1572594662

   @jorisvandenbossche Do you want to take a look at the Python test changes?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on a diff in pull request #35860: GH-35730: [C++] Add the ability to specify custom schema on a dataset write

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on code in PR #35860:
URL: https://github.com/apache/arrow/pull/35860#discussion_r1213594152


##########
cpp/src/arrow/dataset/file_base.h:
##########
@@ -34,6 +34,7 @@
 #include "arrow/filesystem/filesystem.h"
 #include "arrow/io/file.h"
 #include "arrow/util/compression.h"
+#include "arrow/util/key_value_metadata.h"

Review Comment:
   I removed this include and explicitly included `arrow/type_fwd.h`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] anjakefala commented on a diff in pull request #35860: GH-35730: [C++] Add the ability to specify custom schema on a dataset write

Posted by "anjakefala (via GitHub)" <gi...@apache.org>.

anjakefala commented on code in PR #35860:
URL: https://github.com/apache/arrow/pull/35860#discussion_r1213642872


##########
python/pyarrow/tests/test_dataset.py:
##########
@@ -5192,13 +5188,45 @@ def test_preserve_nullability_parquet(tempdir):
     # nullability of field is preserved
     assert dataset.to_table().schema.equals(schema_nullable)
 
-    table_no_null = pa.Table.from_arrays(array, schema=schema)
-
-    # we can specify the nullability of a field through the schema
-    pa.dataset.write_dataset(table_no_null, tempdir/"nulltest2", schema=schema_nullable)
+    pa.dataset.write_dataset(table, tempdir/"nulltest2", format="parquet")
     dataset = ds.dataset(tempdir/"nulltest2", format="parquet")
     assert dataset.to_table().schema.equals(schema_nullable)
 
+    pa.dataset.write_dataset([table, table], tempdir/"nulltest3", format="parquet")
+    dataset = ds.dataset(tempdir/"nulltest3", format="parquet")
+    assert dataset.to_table().schema.equals(schema_nullable)
+
+
+def test_preserve_field_metadata(tempdir):
+    schema_metadata = pa.schema([
+        pa.field("x", pa.int64(), metadata={b'foo': b'bar'}),
+        pa.field("y", pa.int64())])
+
+    schema_no_meta = pa.schema([
+        pa.field("x", pa.int64()),
+        pa.field("y", pa.int64())])
+
+    array = [[1, 2, 3], [None, 5, None]]
+    table = pa.Table.from_arrays(array, schema=schema_metadata)
+    table_no_meta = pa.Table.from_arrays(array, schema=schema_no_meta)
+
+    # If no schema is provided the schema of the first table will be used
+    pa.dataset.write_dataset([table, table_no_meta],

Review Comment:
   ```
   ds.write_dataset([table, table_no_meta],
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] anjakefala commented on a diff in pull request #35860: GH-35730: [C++] Add the ability to specify custom schema on a dataset write

Posted by "anjakefala (via GitHub)" <gi...@apache.org>.

anjakefala commented on code in PR #35860:
URL: https://github.com/apache/arrow/pull/35860#discussion_r1213643484


##########
python/pyarrow/tests/test_dataset.py:
##########
@@ -5192,13 +5188,45 @@ def test_preserve_nullability_parquet(tempdir):
     # nullability of field is preserved
     assert dataset.to_table().schema.equals(schema_nullable)
 
-    table_no_null = pa.Table.from_arrays(array, schema=schema)
-
-    # we can specify the nullability of a field through the schema
-    pa.dataset.write_dataset(table_no_null, tempdir/"nulltest2", schema=schema_nullable)
+    pa.dataset.write_dataset(table, tempdir/"nulltest2", format="parquet")
     dataset = ds.dataset(tempdir/"nulltest2", format="parquet")
     assert dataset.to_table().schema.equals(schema_nullable)
 
+    pa.dataset.write_dataset([table, table], tempdir/"nulltest3", format="parquet")
+    dataset = ds.dataset(tempdir/"nulltest3", format="parquet")
+    assert dataset.to_table().schema.equals(schema_nullable)
+
+
+def test_preserve_field_metadata(tempdir):
+    schema_metadata = pa.schema([
+        pa.field("x", pa.int64(), metadata={b'foo': b'bar'}),
+        pa.field("y", pa.int64())])
+
+    schema_no_meta = pa.schema([
+        pa.field("x", pa.int64()),
+        pa.field("y", pa.int64())])
+
+    array = [[1, 2, 3], [None, 5, None]]
+    table = pa.Table.from_arrays(array, schema=schema_metadata)
+    table_no_meta = pa.Table.from_arrays(array, schema=schema_no_meta)
+
+    # If no schema is provided the schema of the first table will be used
+    pa.dataset.write_dataset([table, table_no_meta],
+                             tempdir / "field_metatest", format="parquet")
+    dataset = ds.dataset(tempdir / "field_metatest", format="parquet")
+    assert dataset.to_table().schema.equals(schema_metadata, check_metadata=True)
+
+    pa.dataset.write_dataset([table_no_meta, table],
+                             tempdir / "field_metatest2", format="parquet")
+    dataset = ds.dataset(tempdir / "field_metatest2", format="parquet")
+    assert not dataset.to_table().schema.equals(schema_metadata, check_metadata=True)
+
+    pa.dataset.write_dataset([table_no_meta, table],

Review Comment:
   ```
   ds.write_dataset([table_no_meta, table],
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] anjakefala commented on a diff in pull request #35860: GH-35730: [C++] Add the ability to specify custom schema on a dataset write

Posted by "anjakefala (via GitHub)" <gi...@apache.org>.

anjakefala commented on code in PR #35860:
URL: https://github.com/apache/arrow/pull/35860#discussion_r1213643209


##########
python/pyarrow/tests/test_dataset.py:
##########
@@ -5192,13 +5188,45 @@ def test_preserve_nullability_parquet(tempdir):
     # nullability of field is preserved
     assert dataset.to_table().schema.equals(schema_nullable)
 
-    table_no_null = pa.Table.from_arrays(array, schema=schema)
-
-    # we can specify the nullability of a field through the schema
-    pa.dataset.write_dataset(table_no_null, tempdir/"nulltest2", schema=schema_nullable)
+    pa.dataset.write_dataset(table, tempdir/"nulltest2", format="parquet")
     dataset = ds.dataset(tempdir/"nulltest2", format="parquet")
     assert dataset.to_table().schema.equals(schema_nullable)
 
+    pa.dataset.write_dataset([table, table], tempdir/"nulltest3", format="parquet")
+    dataset = ds.dataset(tempdir/"nulltest3", format="parquet")
+    assert dataset.to_table().schema.equals(schema_nullable)
+
+
+def test_preserve_field_metadata(tempdir):
+    schema_metadata = pa.schema([
+        pa.field("x", pa.int64(), metadata={b'foo': b'bar'}),
+        pa.field("y", pa.int64())])
+
+    schema_no_meta = pa.schema([
+        pa.field("x", pa.int64()),
+        pa.field("y", pa.int64())])
+
+    array = [[1, 2, 3], [None, 5, None]]
+    table = pa.Table.from_arrays(array, schema=schema_metadata)
+    table_no_meta = pa.Table.from_arrays(array, schema=schema_no_meta)
+
+    # If no schema is provided the schema of the first table will be used
+    pa.dataset.write_dataset([table, table_no_meta],
+                             tempdir / "field_metatest", format="parquet")
+    dataset = ds.dataset(tempdir / "field_metatest", format="parquet")
+    assert dataset.to_table().schema.equals(schema_metadata, check_metadata=True)
+
+    pa.dataset.write_dataset([table_no_meta, table],

Review Comment:
   ```
   ds.write_dataset([table_no_meta, table],
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on pull request #35860: GH-35730: [C++] Add the ability to specify custom schema on a dataset write

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on PR #35860:
URL: https://github.com/apache/arrow/pull/35860#issuecomment-1571343210

   > The following do not:
   > 
   > pa.dataset.write_dataset([table_no_null, table], tempdir/"nulltest2", schema=schema_nullable, format="parquet")  
   > 
   > or
   > 
   > pa.dataset.write_dataset([table, table_no_null], tempdir/"nulltest2", schema=schema_nullable, format="parquet") 
   > 
   
   These lines failed for me with the following error:
   
   ```
   pyarrow/dataset.py:936: in write_dataset
       data = InMemoryDataset(data, schema=schema)
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
   
   >   raise ArrowTypeError(
   E   pyarrow.lib.ArrowTypeError: Item has schema
   E   x: int64
   E   y: int64
   E   which does not match expected schema
   E   x: int64 not null
   E   y: int64
   ```
   
   I thought this was supported and it took me a moment to track down what was going on.  The error is actually being raised before the C++ call to write the dataset.  Pyarrow is taking the two inputs (`table`, `table_no_null`) and trying to put them in an `InMemoryDataset` and specifying the schema.  The constructor for `InMemoryDataset` is verifying that all the tables it has been given have the same schema and throwing an error because it was given a table whose schema does not match the dataset's schema.
   
   If this is the same error you were getting then I think we can call this an invalid scenario and we don't have to support it (at least for this PR.  Arguably, you could evolve a table into the correct schema if adding it to an InMemoryDataset but that's a different feature).
   
   This is kind of confusing because @anjakefala and I were testing earlier and you are allowed to create an `InMemoryDataset` with tables / batches who have the same types / nullability but different field metadata.  So I created an additional python test case for field metadata which does verify the "two tables but mixed metadata can be overridden by an explicit schema" call.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on a diff in pull request #35860: GH-35730: [C++] Add the ability to specify custom schema on a dataset write

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on code in PR #35860:
URL: https://github.com/apache/arrow/pull/35860#discussion_r1213329783


##########
cpp/src/arrow/dataset/file_base.h:
##########
@@ -463,15 +463,21 @@ struct ARROW_DS_EXPORT FileSystemDatasetWriteOptions {
 /// \brief Wraps FileSystemDatasetWriteOptions for consumption as compute::ExecNodeOptions
 class ARROW_DS_EXPORT WriteNodeOptions : public acero::ExecNodeOptions {
  public:
-  explicit WriteNodeOptions(
-      FileSystemDatasetWriteOptions options,
-      std::shared_ptr<const KeyValueMetadata> custom_metadata = NULLPTR)
-      : write_options(std::move(options)), custom_metadata(std::move(custom_metadata)) {}
+  explicit WriteNodeOptions(FileSystemDatasetWriteOptions options,
+                            std::shared_ptr<Schema> custom_schema = NULLPTR)
+      : write_options(std::move(options)), custom_schema(std::move(custom_schema)) {}
 
   /// \brief Options to control how to write the dataset
   FileSystemDatasetWriteOptions write_options;
-  /// \brief Optional metadata to attach to written batches
-  std::shared_ptr<const KeyValueMetadata> custom_metadata;

Review Comment:
   If it were a new feature I would argue it's not worth it (A user could technically use `DeclarationToSchema` to get the output schema of the plan leading up to the write and then attach custom metadata to that).  However, given we have already released `custom_metadata`, and I would like Acero's API to start being stable, I suppose I should set an example.  Thanks for the nudge.  I have restored `custom_metadata`



##########
cpp/src/arrow/dataset/file_base.cc:
##########
@@ -475,16 +475,38 @@ Result<acero::ExecNode*> MakeWriteNode(acero::ExecPlan* plan,
 
   const WriteNodeOptions write_node_options =
       checked_cast<const WriteNodeOptions&>(options);
-  const std::shared_ptr<const KeyValueMetadata>& custom_metadata =
-      write_node_options.custom_metadata;
+  const std::shared_ptr<Schema>& custom_schema = write_node_options.custom_schema;
   const FileSystemDatasetWriteOptions& write_options = write_node_options.write_options;
 
+  const std::shared_ptr<Schema>& input_schema = inputs[0]->output_schema();
+
+  if (custom_schema != nullptr) {
+    if (custom_schema->num_fields() != input_schema->num_fields()) {
+      return Status::Invalid(
+          "The provided custom_schema did not have the same number of fields as the "
+          "data.  The custom schema can only be used to add metadata / nullability to "
+          "fields and cannot change the type or number of fields.");
+    }
+    for (int field_idx = 0; field_idx < input_schema->num_fields(); field_idx++) {
+      if (!input_schema->field(field_idx)->type()->Equals(
+              custom_schema->field(field_idx)->type())) {
+        return Status::Invalid("The provided custom_schema specified type ",

Review Comment:
   Switched.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on pull request #35860: GH-35730: [C++] Add the ability to specify custom schema on a dataset write

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on PR #35860:
URL: https://github.com/apache/arrow/pull/35860#issuecomment-1572270978

   > I looked at the R bits and the change seems reasonable; however, I would second that we make this a non-source-breaking change.
   
   Thank you.  I have restored the old field so this is no longer a breaking change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on a diff in pull request #35860: GH-35730: [C++] Add the ability to specify custom schema on a dataset write

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on code in PR #35860:
URL: https://github.com/apache/arrow/pull/35860#discussion_r1213350552


##########
cpp/src/arrow/dataset/file_base.h:
##########
@@ -463,13 +463,21 @@ struct ARROW_DS_EXPORT FileSystemDatasetWriteOptions {
 /// \brief Wraps FileSystemDatasetWriteOptions for consumption as compute::ExecNodeOptions
 class ARROW_DS_EXPORT WriteNodeOptions : public acero::ExecNodeOptions {
  public:
-  explicit WriteNodeOptions(
-      FileSystemDatasetWriteOptions options,
-      std::shared_ptr<const KeyValueMetadata> custom_metadata = NULLPTR)
-      : write_options(std::move(options)), custom_metadata(std::move(custom_metadata)) {}
+  explicit WriteNodeOptions(FileSystemDatasetWriteOptions options,
+                            std::shared_ptr<Schema> custom_schema = NULLPTR)

Review Comment:
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org