You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "westonpace (via GitHub)" <gi...@apache.org> on 2023/06/01 05:03:15 UTC

[GitHub] [arrow] westonpace commented on pull request #35860: GH-35730: [C++] Add the ability to specify custom schema on a dataset write

westonpace commented on PR #35860:
URL: https://github.com/apache/arrow/pull/35860#issuecomment-1571343210

   > The following do not:
   > 
   > pa.dataset.write_dataset([table_no_null, table], tempdir/"nulltest2", schema=schema_nullable, format="parquet")  
   > 
   > or
   > 
   > pa.dataset.write_dataset([table, table_no_null], tempdir/"nulltest2", schema=schema_nullable, format="parquet") 
   > 
   
   These lines failed for me with the following error:
   
   ```
   pyarrow/dataset.py:936: in write_dataset
       data = InMemoryDataset(data, schema=schema)
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
   
   >   raise ArrowTypeError(
   E   pyarrow.lib.ArrowTypeError: Item has schema
   E   x: int64
   E   y: int64
   E   which does not match expected schema
   E   x: int64 not null
   E   y: int64
   ```
   
   I thought this was supported and it took me a moment to track down what was going on.  The error is actually being raised before the C++ call to write the dataset.  Pyarrow is taking the two inputs (`table`, `table_no_null`) and trying to put them in an `InMemoryDataset` and specifying the schema.  The constructor for `InMemoryDataset` is verifying that all the tables it has been given have the same schema and throwing an error because it was given a table whose schema does not match the dataset's schema.
   
   If this is the same error you were getting then I think we can call this an invalid scenario and we don't have to support it (at least for this PR.  Arguably, you could evolve a table into the correct schema if adding it to an InMemoryDataset but that's a different feature).
   
   This is kind of confusing because @anjakefala and I were testing earlier and you are allowed to create an `InMemoryDataset` with tables / batches who have the same types / nullability but different field metadata.  So I created an additional python test case for field metadata which does verify the "two tables but mixed metadata can be overridden by an explicit schema" call.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org