You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2022/10/24 18:18:00 UTC

[jira] [Commented] (ARROW-18140) The metadata info will lost in parquet file schema after writing the parquet file by calling the FileSystemDataset::Write() method.

    [ https://issues.apache.org/jira/browse/ARROW-18140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17623326#comment-17623326 ] 

Weston Pace commented on ARROW-18140:
-------------------------------------

This could definitely be improved.  The write node, in Acero, takes a single {{std::shared_ptr<const KeyValueMetadata> custom_metadata;
}} which is attached to all written files.  At the moment the FileSystemDataset::Write method uses metadata from the dataset's projected schema as input to the write node for this field:

{noformat}
  // The projected_schema is currently used by pyarrow to preserve the custom metadata
  // when reading from a single input file.
  const auto& custom_metadata = scanner->options()->projected_schema->metadata();

  RETURN_NOT_OK(
      compute::Declaration::Sequence(
          {
              {"scan", ScanNodeOptions{dataset, scanner->options()}},
              {"filter", compute::FilterNodeOptions{scanner->options()->filter}},
              {"project",
               compute::ProjectNodeOptions{std::move(exprs), std::move(names)}},
              {"write", WriteNodeOptions{write_options, custom_metadata}},
          })
          .AddToPlan(plan.get()));
{noformat}

This is not very user friendly and is currently only this way due to slow migration from the old capabilities and this just happens to be the way pyarrow invokes the datasets API.  I think it would be possible to use this today but you would have to create scan options without the ScannerBuilder because the ScannerBuilder doesn't allow you to set the projected schema directly.

That being said, it should be fairly simple to add a "custom_metadata" argument to {{FileSystemDataset::Write}}.  As long as this isn't null then we should use that instead of the projected schema (and probably even migrate pyarrow to using this call too).

> The metadata info will lost in parquet file schema after writing the parquet file by calling the FileSystemDataset::Write() method.
> -----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-18140
>                 URL: https://issues.apache.org/jira/browse/ARROW-18140
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>            Reporter: Ke Jia
>            Priority: Major
>
> This issue can be reproduced by the following code.
> auto format = std::make_shared<ParquetFileFormat>();
> auto fs = std::make_shared<fs::internal::MockFileSystem>(fs::kNoTime);
> FileSystemDatasetWriteOptions write_options;
> write_options.file_write_options = format->DefaultWriteOptions();
> write_options.filesystem = fs;
> write_options.base_dir = "root";
> write_options.partitioning = std::make_shared<HivePartitioning>(schema({}));
> write_options.basename_template = "\{i}.parquet";
> auto metadata =
>     std::shared_ptr<KeyValueMetadata>(new KeyValueMetadata(\{"foo"}, \{"bar"}));
> auto dataset_schema = schema(\{field("a", int64())}, metadata);
> RecordBatchVector batches{
>     ConstantArrayGenerator::Zeroes(kRowsPerBatch, dataset_schema)};
> ASSERT_EQ(0, batches[0]->column(0)->null_count());
> auto dataset = std::make_shared<InMemoryDataset>(dataset_schema, batches);
> ASSERT_OK_AND_ASSIGN(auto scanner_builder, dataset->NewScan());
> ASSERT_OK(scanner_builder->Project(
>     \{compute::call("add", {compute::field_ref("a"), compute::literal(1)})},
>     \{"a_plus_one"}));
> ASSERT_OK_AND_ASSIGN(auto scanner, scanner_builder->Finish());
> // Before write the schema has the metadata info.
> ASSERT_EQ(1, dataset_schema->HasMetadata());
> ASSERT_OK(FileSystemDataset::Write(write_options, scanner));
> ASSERT_OK_AND_ASSIGN(auto dataset_factory, FileSystemDatasetFactory::Make(
>                                                fs, \{"root/0.parquet"}, format, {}));
> ASSERT_OK_AND_ASSIGN(auto written_dataset, dataset_factory->Finish(FinishOptions{}));
> // After write the schema does not has the metadata info.
> ASSERT_EQ(0, written_dataset->schema()->HasMetadata());



--
This message was sent by Atlassian Jira
(v8.20.10#820010)