You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Antoine Pitrou (Jira)" <ji...@apache.org> on 2022/03/10 11:37:00 UTC
[jira] [Commented] (ARROW-15899) [C++] Writes broken file or incorrect data when nullable=False

    [ https://issues.apache.org/jira/browse/ARROW-15899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17504201#comment-17504201 ] 

Antoine Pitrou commented on ARROW-15899:
----------------------------------------

Hmm, we should probably check the {{nullable}} flag when validating (currently we don't).

> [C++] Writes broken file or incorrect data when nullable=False
> --------------------------------------------------------------
>
>                 Key: ARROW-15899
>                 URL: https://issues.apache.org/jira/browse/ARROW-15899
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>    Affects Versions: 6.0.0, 6.0.1, 7.0.0, 7.0.1
>            Reporter: Rácz Dániel
>            Priority: Major
>
> In such cases, when trying to write a pyarrow table to parquet with provided schema, and the provided schema contains a field with `nullable=false`, but contains an actual null value , the resulting parquet either
>  * cannot be read or
>  * the columns are somewhat get `pushed up`, and the whole table becomes inconsistent. The column changes by seemingly dropping the null value, and pushing together the complete dataset based on the provided row_group_size (and starting over from the start when runs out of values). Different row group sizes will lead to different results as well.  This off-by-one problem is persistent in a single row group, the next could be perfectly fine if it contains no null values.
>  
> I believe, none of these behaviours are intentional, but they easily overseen by the user as one might think that providing a schema with constraints would lead to at least a warning/ (or better) an exception when writing the file. Using provided validation methods also see no problem with this particular problem.
> You can find a snippet below explaining this weird behaviour.
> {code:java}
> import pyarrow as pa
> import pyarrow.parquet as pq
> field_name = 'a_string'
> schema = pa.schema([
>     pa.field(name=field_name, type=pa.string(), nullable=False) # not nullable
> ])
> # Arrow Columnar Format doesn't care if a non-nullable field holds a null
> t_out = pa.table([['0', '1', None, '3', '4']], schema=schema) # OK
> t_out.validate(full=True) # OK
> t_out.cast(schema, safe=True) # OK
> # Parquet writing does not raise, but silently kills the null string
> # because of the REQUIRED-ness of the field in the schema.
> # Then you either cannot read the parquet back, or the returned data
> # is invented, depending on the written row_group_size.
> pq.write_table(t_out, where='pq_1', row_group_size=1)
> pq.read_table('pq_1')
> # -> OSError: Unexpected end of stream
> pq.write_table(t_out, where='pq_2', row_group_size=2)
> pq.read_table('pq_2')
> # -> OSError: Unexpected end of stream
> # -> or sometimes: pyarrow.lib.ArrowInvalid: Index not in dictionary bounds
> pq.write_table(t_out, where='pq_3', row_group_size=3)
> print(pq.read_table('pq_3')[field_name])
> # -> [["0","1","0"],["3","4"]]
> pq.write_table(t_out, where='pq_4', row_group_size=4)
> print(pq.read_table('pq_4')[field_name])
> # -> [["0","1","3","0"],["4"]]
> pq.write_table(t_out, where='pq_5', row_group_size=5)
> print(pq.read_table('pq_5')[field_name])
> # -> [["0","1","3","4","0"]]{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)