You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Arthur Andres <ar...@gmail.com> on 2022/07/11 21:20:42 UTC

[Python] pa.Field.nullable

Hi all,

Is the behaviour of pa.Field.nullable documented somewhere?

I had some expectations of what it does. For example it should make sure
that you can't have null/missing value in a column that is declared with
nullable=False. But it doesn't seem to be the case.

```
import pyarrow as pa

schema = pa.schema(
    [
        pa.field("nullable_true", pa.string(), nullable=True),
        pa.field("nullable_false", pa.string(), nullable=False),
    ]
)

table = pa.Table.from_arrays(
    [
        pa.array(["", "foo", None], pa.string()),
        pa.array(["", "foo", None], pa.string()),
    ],
    schema=schema,
)

assert table.schema == schema
assert table['nullable_true'].null_count == 1
assert table['nullable_false'].null_count == 1
assert table.validate() is None
assert table.validate(full=True) is None
```

The only place where I've seen the nullable flag being used is when casting
nested column from nullable to non-nullable:

```
import pyarrow as pa

struct_array = pa.StructArray.from_arrays(
    [
        pa.array(["", "foo", None], pa.string()),
    ],
    names=["nested_col_level_1"],
)
nested_table = pa.Table.from_arrays([struct_array],
names=["nested_col_level_0"])
assert nested_table.validate(full=True) is None
assert nested_table.validate() is None

nested_table.cast(
    pa.schema(
        [
            pa.field(
                "nested_col_level_0",
                pa.struct(
                    [pa.field("nested_col_level_1", pa.string(),
nullable=False)]
                ),
            )
        ]
    )
)
```

Thanks for your help!

Re: [Python] pa.Field.nullable

Posted by Arthur Andres <ar...@gmail.com>.
Hi David,

Thanks for your reply, I'll keep an eye on that PR.

On Wed, 13 Jul 2022 at 17:43, David Li <li...@apache.org> wrote:

> At the moment I think it's mostly metadata, but there is a PR that
> validates non-nullable fields indeed do not contain nulls. [1]
>
> There are places in compute kernels that optimize based on the
> presence/absence of nulls but they do so mostly by looking at the physical
> data and not the type (so the optimization will still apply if there just
> happen to not be nulls).
>
> [1]: https://github.com/apache/arrow/pull/12706
>
> On Mon, Jul 11, 2022, at 17:20, Arthur Andres wrote:
>
> Hi all,
>
> Is the behaviour of pa.Field.nullable documented somewhere?
>
> I had some expectations of what it does. For example it should make sure
> that you can't have null/missing value in a column that is declared with
> nullable=False. But it doesn't seem to be the case.
>
> ```
> import pyarrow as pa
>
> schema = pa.schema(
>     [
>         pa.field("nullable_true", pa.string(), nullable=True),
>         pa.field("nullable_false", pa.string(), nullable=False),
>     ]
> )
>
> table = pa.Table.from_arrays(
>     [
>         pa.array(["", "foo", None], pa.string()),
>         pa.array(["", "foo", None], pa.string()),
>     ],
>     schema=schema,
> )
>
> assert table.schema == schema
> assert table['nullable_true'].null_count == 1
> assert table['nullable_false'].null_count == 1
> assert table.validate() is None
> assert table.validate(full=True) is None
> ```
>
> The only place where I've seen the nullable flag being used is when
> casting nested column from nullable to non-nullable:
>
> ```
> import pyarrow as pa
>
> struct_array = pa.StructArray.from_arrays(
>     [
>         pa.array(["", "foo", None], pa.string()),
>     ],
>     names=["nested_col_level_1"],
> )
> nested_table = pa.Table.from_arrays([struct_array],
> names=["nested_col_level_0"])
> assert nested_table.validate(full=True) is None
> assert nested_table.validate() is None
>
> nested_table.cast(
>     pa.schema(
>         [
>             pa.field(
>                 "nested_col_level_0",
>                 pa.struct(
>                     [pa.field("nested_col_level_1", pa.string(),
> nullable=False)]
>                 ),
>             )
>         ]
>     )
> )
> ```
>
> Thanks for your help!
>
>
>
>
>

Re: [Python] pa.Field.nullable

Posted by David Li <li...@apache.org>.
At the moment I think it's mostly metadata, but there is a PR that validates non-nullable fields indeed do not contain nulls. [1]

There are places in compute kernels that optimize based on the presence/absence of nulls but they do so mostly by looking at the physical data and not the type (so the optimization will still apply if there just happen to not be nulls).

[1]: https://github.com/apache/arrow/pull/12706

On Mon, Jul 11, 2022, at 17:20, Arthur Andres wrote:
> Hi all,
> 
> Is the behaviour of pa.Field.nullable documented somewhere? 
> 
> I had some expectations of what it does. For example it should make sure that you can't have null/missing value in a column that is declared with nullable=False. But it doesn't seem to be the case.
> 
> ```
> import pyarrow as pa
> 
> schema = pa.schema(
>     [
>         pa.field("nullable_true", pa.string(), nullable=True),
>         pa.field("nullable_false", pa.string(), nullable=False),
>     ]
> )
> 
> table = pa.Table.from_arrays(
>     [
>         pa.array(["", "foo", None], pa.string()),
>         pa.array(["", "foo", None], pa.string()),
>     ],
>     schema=schema,
> )
> 
> assert table.schema == schema
> assert table['nullable_true'].null_count == 1
> assert table['nullable_false'].null_count == 1
> assert table.validate() is None
> assert table.validate(full=True) is None
> ```
> 
> The only place where I've seen the nullable flag being used is when casting nested column from nullable to non-nullable:
> 
> ```
> import pyarrow as pa
> 
> struct_array = pa.StructArray.from_arrays(
>     [
>         pa.array(["", "foo", None], pa.string()),
>     ],
>     names=["nested_col_level_1"],
> )
> nested_table = pa.Table.from_arrays([struct_array], names=["nested_col_level_0"])
> assert nested_table.validate(full=True) is None
> assert nested_table.validate() is None
> 
> nested_table.cast(
>     pa.schema(
>         [
>             pa.field(
>                 "nested_col_level_0",
>                 pa.struct(
>                     [pa.field("nested_col_level_1", pa.string(), nullable=False)]
>                 ),
>             )
>         ]
>     )
> )
> ```
> 
> Thanks for your help!
> 
> 
>