You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "ildipo (via GitHub)" <gi...@apache.org> on 2023/05/24 03:12:29 UTC

[GitHub] [arrow] ildipo opened a new issue, #35730: [Python] write_dataset does not preserve non-nullable columns in schema

ildipo opened a new issue, #35730:
URL: https://github.com/apache/arrow/issues/35730

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   When writing a table whose schema has nullable columns using write_dataset the nullable info is not saved
   
   To reproduce
   
   ```
   import pyarrow as pa
   import pyarrow.parquet as pq
   import datetime as dt
   import pyarrow.dataset as ds
   
   table = pa.Table.from_arrays([[1,2,3],[None,5,None], [dt.date(2023,1,1), dt.date(2023,1,2), dt.date(2023,1,3)]],
       schema=pa.schema([pa.field("x", pa.int64(), nullable=False), pa.field("y", pa.int64(), nullable=True), pa.field("date", pa.date32(), nullable=True)]))
   print(table.schema)
   # schema shows  column 'x' as not nullable
   
   pq.write_to_dataset(table, parquet_test1", partitioning=['date'], partitioning_flavor='hive')
   dataset = ds.dataset("parquet_test1", format="parquet", partitioning="hive")
   dataset.to_table().schema
   # column 'x' is nullable
   
   pa.dataset.write_dataset(table, "parquet_test2", partitioning=['date'], partitioning_flavor='hive', format='parquet')
   dataset = ds.dataset("parquet_test2", format="parquet", partitioning="hive")
   dataset.to_table().schema
   # column 'x' is nullable
   
   ```
   
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on issue #35730: [Python] write_dataset does not preserve non-nullable columns in schema

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on issue #35730:
URL: https://github.com/apache/arrow/issues/35730#issuecomment-1561372323

   @ildipo Thanks for the report!
   
   The parquet format doesn't has such a flag directly, but it stores nulls as a repetition level, and you can indicate a field to be "required", and it seems that for reading writing individual tables to Parquet files, we translate "not null" into required parquet types, and also when reading convert a required field back to "not null":
   
   ```python
   >>> pq.write_table(table, "test_nullability.parquet")
   >>> pq.read_metadata("test_nullability.parquet").schema
   <pyarrow._parquet.ParquetSchema object at 0x7f21b778fec0>
   required group field_id=-1 schema {
     required int64 field_id=-1 x;
     optional int64 field_id=-1 y;
     optional int32 field_id=-1 date (Date);
   }
   >>> pq.read_table("test_nullability.parquet").schema
   Out[28]: 
   x: int64 not null
   y: int64
   date: date32[day]
   ```
   
   So it seems this is supported in the Parquet module itself, and so this should be something in the dataset API that loses this information. Quick guess is that it has to do with partitioning:
   
   ```python
   >>> pq.write_to_dataset(table, "test_dataset_nullability"')
   # reading directory -> lost "not null"
   >>> ds.dataset("test_dataset_nullability/", format="parquet").schema
   x: int64
   y: int64
   date: date32[day]
   
   # reading single file -> preserved "not null"
   >>> ds.dataset("test_nullability.parquet", format="parquet").schema
   Out[37]: 
   x: int64 not null
   y: int64
   date: date32[day]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] raulcd commented on issue #35730: [Python] write_dataset does not preserve non-nullable columns in schema

Posted by "raulcd (via GitHub)" <gi...@apache.org>.

raulcd commented on issue #35730:
URL: https://github.com/apache/arrow/issues/35730#issuecomment-1572889015

   @github-actions crossbow submit test-r-ubuntu-22.04


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] anjakefala commented on issue #35730: [Python] write_dataset does not preserve non-nullable columns in schema

Posted by "anjakefala (via GitHub)" <gi...@apache.org>.

anjakefala commented on issue #35730:
URL: https://github.com/apache/arrow/issues/35730#issuecomment-1572913429

   @thisisnic Would you be able to take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #35730: [Python] write_dataset does not preserve non-nullable columns in schema

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #35730:
URL: https://github.com/apache/arrow/issues/35730#issuecomment-1573147552

   @raulcd 
   
   The error is a bit of a red herring.  It is not building Arrow-C++.  Instead it is downloading Arrow-C++.  If you look at a passing build (e.g. from the nightly tests) you can see:
   
   ```
   2023-05-30T01:07:19.3429074Z * installing *source* package ‘arrow’ ...
   2023-05-30T01:07:19.3429654Z ** using staged installation
   2023-05-30T01:07:19.3429994Z *** Found libcurl and OpenSSL >= 1.1
   2023-05-30T01:07:19.3430691Z trying URL 'https://nightlies.apache.org/arrow/r/libarrow/bin/linux-openssl-1.1/arrow-12.0.0.9000.zip'
   2023-05-30T01:07:19.3431226Z Error in download.file(from_url, to_file, quiet = hush) : 
   2023-05-30T01:07:19.3431942Z   cannot open URL 'https://nightlies.apache.org/arrow/r/libarrow/bin/linux-openssl-1.1/arrow-12.0.0.9000.zip'
   2023-05-30T01:07:19.3432612Z *** Downloading libarrow binary failed for version 12.0.0.9000 (linux-openssl-1.1)
   2023-05-30T01:07:19.3433276Z     at https://nightlies.apache.org/arrow/r/libarrow/bin/linux-openssl-1.1/arrow-12.0.0.9000.zip
   2023-05-30T01:07:19.3433789Z *** Found local C++ source: '/arrow/cpp'
   2023-05-30T01:07:19.3434126Z *** Building libarrow from source
   2023-05-30T01:07:19.3434552Z     For build options and troubleshooting, see the install guide:
   2023-05-30T01:07:19.3435014Z     https://arrow.apache.org/docs/r/articles/install.html
   ```
   
   On the other hand, if you look at these failing builds, you see:
   
   ```
   2023-06-01T22:45:52.2820480Z * installing *source* package ‘arrow’ ...
   2023-06-01T22:45:52.2820835Z ** using staged installation
   2023-06-01T22:45:52.2826960Z **** pkg-config not installed, setting ARROW_DEPENDENCY_SOURCE=BUNDLED
   2023-06-01T22:45:52.2827523Z *** Found libcurl and OpenSSL >= 1.1
   2023-06-01T22:45:52.2830096Z trying URL 'https://apache.jfrog.io/artifactory/arrow/r/12.0.0/libarrow/bin/linux-openssl-1.1/arrow-12.0.0.zip'
   2023-06-01T22:45:52.2830790Z Content type 'application/zip' length 40016664 bytes (38.2 MB)
   2023-06-01T22:45:52.2831184Z ==================================================
   2023-06-01T22:45:52.2835569Z downloaded 38.2 MB
   2023-06-01T22:45:52.2835774Z 
   2023-06-01T22:45:52.2836129Z *** Successfully retrieved C++ binaries (linux-openssl-1.1)
   ```
   
   So the nightly test looks for `12.0.0.9000` which, of course, doesn't exist.  Then it falls back to building from source.  This is what we want.
   
   The test build you've shared is looking for `12.0.0` (shouldn't this be `12.0.1`?)  It finds it, and then it doesn't build Arrow-C++ from source.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] daniel-shields commented on issue #35730: [Python] write_dataset does not preserve non-nullable columns in schema

Posted by "daniel-shields (via GitHub)" <gi...@apache.org>.

daniel-shields commented on issue #35730:
URL: https://github.com/apache/arrow/issues/35730#issuecomment-1561645015

   Note that this is also broken when the 'schema' parameter is passed explicitly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace closed issue #35730: [Python] write_dataset does not preserve non-nullable columns in schema

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace closed issue #35730: [Python] write_dataset does not preserve non-nullable columns in schema
URL: https://github.com/apache/arrow/issues/35730


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on issue #35730: [Python] write_dataset does not preserve non-nullable columns in schema

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on issue #35730:
URL: https://github.com/apache/arrow/issues/35730#issuecomment-1562446055

   Digging a bit further, this nullable field information is lost in acero's ProjectNode (the `FileSystemDataset::Write` call is essentially a combination of source+project+filter+write nodes). 
   
   Small reproducer in python:
   
   ```python
   from pyarrow.acero import Declaration, TableSourceNodeOptions, ProjectNodeOptions, field
   
   schema = pa.schema([pa.field("col1", "int64", nullable=True), pa.field("col2", "int64", nullable=False)])
   table = pa.table({"col1": [1, 2, 3], "col2": [2, 3, 4]}, schema=schema)
   table_source = Declaration("table_source", options=TableSourceNodeOptions(table))
   project = Declaration("project", ProjectNodeOptions([field("col1"), field("col2")]))
   decl = Declaration.from_sequence([table_source, project])
   
   >>> table.schema
   col1: int64
   col2: int64 not null
   >>> decl.to_table().schema
   col1: int64
   col2: int64
   ```
   
   This happens because the ProjectNode naively recreates the schema from the names/exprs, ignoring the field information of the original input schema:
   
   https://github.com/apache/arrow/blob/6bd31f37ae66bd35594b077cb2f830be57e08acd/cpp/src/arrow/acero/project_node.cc#L64-L75
   
   So this only preserves the type of the original input schema, but will ignore any nullable flag or field metadata information (and then we only do some special code to preserve the custom metadata of the full schema)
   
   @westonpace rereading your original comment, while your explanation first focused on the schema metadata, you actually also already said essentially the above:
   
   > That being said, `custom_metadata` may not be sufficient here. It only allows you to specify the key/value metadata for the schema, and not individual field metadata.
   
   But for what we need to do about this: shouldn't the ProjectNode just try to preserve this information for trivial field ref expressions?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #35730: [Python] write_dataset does not preserve non-nullable columns in schema

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #35730:
URL: https://github.com/apache/arrow/issues/35730#issuecomment-1561496846

Yes, write_dataset is a bit tricky when it comes to schema information. If the input is multiple tables, then write_dataset is probably going to be combining them into a single output table, so which metadata do we use? What the write node does today is allow a `custom_metadata` to be supplied, in addition to the dataset, which it will attach to all written batches.

Then we have a bit of a hack in place today for "If the input is a single table then preserve the metadata". This is in `FileSystemDataset::Write` which is what pyarrow is using today:

```
// The projected_schema is currently used by pyarrow to preserve the custom metadata
// when reading from a single input file.
const auto& custom_metadata = scanner->options()->projected_schema->metadata();
```

This `custom_metadata` is not currently exposed to `pyarrow`. So I think we probably want to add it.

That being said, `custom_metadata` may not be sufficient here. It only allows you to specify the key/value metadata for the schema, and not individual field metadata. So we'd need to change that too. All put together we have:

* Change `WriteNodeOptions::custom_metadata` to `WriteNodeOptions::schema`
* Do one of the following:
* Add `custom_schema` to `FileSystemDataset::Write`
* Change `pyarrow` to use Acero (and WriteNodeOptions) directly instead of `FileSystemDataset::Write`
* Add pyarrow bindings for whichever approach we did in the previous step

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] raulcd commented on issue #35730: [Python] write_dataset does not preserve non-nullable columns in schema

Posted by "raulcd (via GitHub)" <gi...@apache.org>.

raulcd commented on issue #35730:
URL: https://github.com/apache/arrow/issues/35730#issuecomment-1573534316

   ok, I've finally realised this is the issue, not the PR :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] anjakefala commented on issue #35730: [Python] write_dataset does not preserve non-nullable columns in schema

Posted by "anjakefala (via GitHub)" <gi...@apache.org>.

anjakefala commented on issue #35730:
URL: https://github.com/apache/arrow/issues/35730#issuecomment-1572903060

   ```
   compute-exec.cpp: In function ‘void ExecPlan_Write(const std::shared_ptr<arrow::acero::ExecPlan>&, const std::shared_ptr<arrow::acero::ExecNode>&, const std::shared_ptr<arrow::Schema>&, const std::shared_ptr<arrow::dataset::FileWriteOptions>&, const std::shared_ptr<arrow::fs::FileSystem>&, std::string, const std::shared_ptr<arrow::dataset::Partitioning>&, std::string, arrow::dataset::ExistingDataBehavior, int, uint32_t, uint64_t, uint64_t, uint64_t)’:
   compute-exec.cpp:340:11: error: ‘class arrow::dataset::WriteNodeOptions’ has no member named ‘custom_schema’
     340 |   options.custom_schema = std::move(schema);
         |           ^~~~~~~~~~~~~
   make: *** [/usr/lib/R/etc/Makeconf:200: compute-exec.o] Error 1
   make: *** Waiting for unfinished jobs....
   ERROR: compilation failed for package ‘arrow’
   * removing ‘/arrow/r/check/arrow.Rcheck/arrow’
   
   Error: R CMD check found ERRORs
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ildipo commented on issue #35730: [Python] write_dataset does not preserve non-nullable columns in schema

Posted by "ildipo (via GitHub)" <gi...@apache.org>.

ildipo commented on issue #35730:
URL: https://github.com/apache/arrow/issues/35730#issuecomment-1561535722

   The behavior changed sometime between arrow 7 and 12 since it used to work with arrow 7


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ildipo commented on issue #35730: [Python] write_dataset does not preserve non-nullable columns in schema

Posted by "ildipo (via GitHub)" <gi...@apache.org>.

ildipo commented on issue #35730:
URL: https://github.com/apache/arrow/issues/35730#issuecomment-1561543222

   
   I think we want the solution that is easier to backport to arrow 12


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #35730: [Python] write_dataset does not preserve non-nullable columns in schema

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #35730:
URL: https://github.com/apache/arrow/issues/35730#issuecomment-1562199897

   > 
   > I think we want the solution that is easier to backport to arrow 12
   
   Does it work if you set `use_legacy_dataset=True`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #35730: [Python] write_dataset does not preserve non-nullable columns in schema

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #35730:
URL: https://github.com/apache/arrow/issues/35730#issuecomment-1565014569

So here is the change that introduced this: https://github.com/apache/arrow/issues/31452

Before the change we used to require the schema be specified on the write node options. This was a unnecessary burden when you didn't care about any custom field information (since we've already calculated the schema).

> But for what we need to do about this: shouldn't the ProjectNode just try to preserve this information for trivial field ref expressions?

I think there is still the problem that we largely ignore nullability. We can't usually assume that all batches will have the same nullability. For example, imagine a scan node where we are scanning two different parquet files. One of the parquet files marks a column as nullable and the other does not. I suppose the correct answer, if Acero were nulalbility-aware and once evolution is a little more robust, would be to "evolve" the schema of the file with a nullable type to a non-nullable type so that we have a common input schema.

In the meantime, the quickest simple fix to this regression is to allow the user to specify an output schema instead of just key / value metadata.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] anjakefala commented on issue #35730: [Python] write_dataset does not preserve non-nullable columns in schema

Posted by "anjakefala (via GitHub)" <gi...@apache.org>.

anjakefala commented on issue #35730:
URL: https://github.com/apache/arrow/issues/35730#issuecomment-1572909025

   Failure is occurring in this context: https://github.com/apache/arrow/pull/35860/files#diff-0d1ff6f17f571f6a348848af7de9c05ed588d3339f46dd3bcf2808489f7dca92R340


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on issue #35730: [Python] write_dataset does not preserve non-nullable columns in schema

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on issue #35730:
URL: https://github.com/apache/arrow/issues/35730#issuecomment-1562399718

   @weston note that this is not (AFAIU) about custom metadata, but just about how the arrow schema gets translated to a Parquet schema (or how the arrow schema gets changed throughout dataset writing).
   
   If we write a single file (directly using the Parquet file writer, not going through datasets), then a pyarrow field with nullable=False gets translated into a "required" parquet field:
   
   ```python
   >>> schema = pa.schema([pa.field("col1", "int64", nullable=True), pa.field("col2", "int64", nullable=False)])
   >>> table = pa.table({"col1": [1, 2, 3], "col2": [2, 3, 4]}, schema=schema)
   >>> table.schema
   col1: int64
   col2: int64 not null
   
   >>> pq.write_table(table, "test_nullability.parquet")
   >>> pq.read_metadata("test_nullability.parquet").schema
   <pyarrow._parquet.ParquetSchema object at 0x7f21957c9700>
   required group field_id=-1 schema {
     optional int64 field_id=-1 col1;
     required int64 field_id=-1 col2;       # <--- this is "required" instead of "optional"
   }
   ```
   
   But if we write this as a single file (in a directory) through the dataset API (so not even using a partitioning column), the non-nullable column is no longer "required" in the parquet field:
   
   ```python
   >>> ds.write_dataset(table, "test_dataset_nullability/", format="parquet")
   >>> pq.read_metadata("test_dataset_nullability/part-0.parquet").schema
   Out[68]: 
   <pyarrow._parquet.ParquetSchema object at 0x7f219d16cfc0>
   required group field_id=-1 schema {
     optional int64 field_id=-1 col1;
     optional int64 field_id=-1 col2;        # <--- no longer "required" !
   }
   ```
   
   So I suppose that somewhere in the dataset writing code path, the schema looses the field nullability information
   
   > The behavior changed sometime between arrow 7 and 12 since it used to work with arrow 7
   
   I suppose this is because we now use `pyarrow.dataset.write_dataset` under the hood in `pq.write_to_dataset`, i.e. going through the dataset API, while the "legacy" implementation of `pq.write_to_dataset` used a custom implementation using the direct parquet file writer (and then it comes down to the difference between those two as illustrated above).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #35730: [Python] write_dataset does not preserve non-nullable columns in schema

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #35730:
URL: https://github.com/apache/arrow/issues/35730#issuecomment-1562200611

   Nvm, I see this is `write_dataset` and not `write_to_dataset`.  Were you using `write_dataset` in 7.0.0?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] raulcd commented on issue #35730: [Python] write_dataset does not preserve non-nullable columns in schema

Posted by "raulcd (via GitHub)" <gi...@apache.org>.

raulcd commented on issue #35730:
URL: https://github.com/apache/arrow/issues/35730#issuecomment-1573531379

   @github-actions crossbow submit test-r-ubuntu-22.04


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] anjakefala commented on issue #35730: [Python] write_dataset does not preserve non-nullable columns in schema

Posted by "anjakefala (via GitHub)" <gi...@apache.org>.

anjakefala commented on issue #35730:
URL: https://github.com/apache/arrow/issues/35730#issuecomment-1572911995

   I do see `custom_schema` being declared in the header for WriteNodeOptions: https://github.com/apache/arrow/pull/35860/files#diff-16fb00f643139995febaccf5315f70d5173f2c378b883cb3c2d25b614f6f4493R482


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ildipo commented on issue #35730: [Python] write_dataset does not preserve non-nullable columns in schema

Posted by "ildipo (via GitHub)" <gi...@apache.org>.

ildipo commented on issue #35730:
URL: https://github.com/apache/arrow/issues/35730#issuecomment-1562779504

   > But for what we need to do about this: shouldn't the ProjectNode just try to preserve this information for trivial field ref expressions?
   If this is enough it should be pretty quick


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ildipo commented on issue #35730: [Python] write_dataset does not preserve non-nullable columns in schema

Posted by "ildipo (via GitHub)" <gi...@apache.org>.

ildipo commented on issue #35730:
URL: https://github.com/apache/arrow/issues/35730#issuecomment-1562778495

   in 7.0 we were using `write_dataset` and it was working with 12 we tried using `write_dataset`, `write_to_dataset` as well as  `use_legacy_dataset=True` and none is working


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] raulcd commented on issue #35730: [Python] write_dataset does not preserve non-nullable columns in schema

Posted by "raulcd (via GitHub)" <gi...@apache.org>.

raulcd commented on issue #35730:
URL: https://github.com/apache/arrow/issues/35730#issuecomment-1572889933

   the two jobs above have failed on the maintenance branch (https://github.com/ursacomputing/crossbow/actions/runs/5149898329/jobs/9273436127 and https://github.com/ursacomputing/crossbow/actions/runs/5149898179/jobs/9273436054). I am validating what is the status here as they seem related with this change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] raulcd commented on issue #35730: [Python] write_dataset does not preserve non-nullable columns in schema

Posted by "raulcd (via GitHub)" <gi...@apache.org>.

raulcd commented on issue #35730:
URL: https://github.com/apache/arrow/issues/35730#issuecomment-1572889185

   @github-actions crossbow submit test-r-versions


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org