You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/12/10 10:43:00 UTC
[jira] [Commented] (ARROW-10862) Overriding Parquet schema when loading to SageMaker to inspect bad data

    [ https://issues.apache.org/jira/browse/ARROW-10862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17247158#comment-17247158 ] 

Joris Van den Bossche commented on ARROW-10862:
-----------------------------------------------

Hi [~jimmy_ds], thanks for the report. Would you be able to provide a reproducible example? For example, could you share a (subset of) parquet file that shows the issue? Or, provide some code that generates a dummy parquet file that has the same behaviour?

bq. Can you either allow schemas to be overridden or, even better, suggest a way to load a Parquet file where some of the dates in a date column are ='0001-01-01'

Currently when passing a {{schema}} to ParquetDataset, this schema is actually (I think) only used to _validate_ that the parquet files indeed have this schema, i.e. as a safety check, and not to be able to "override" the schema (and doing this safety check can be controlled by {{validate_schema=False}}). 

Now, I wondering for your specific case, what is the schema incompatibility precisely? You mention some dates being "0001-01-01"? But generally speaking, both parquet and pyarrow should be able to handle such dates:

{code:python}
# workaround to create an array of dates from a string
>>> arr = pa.array(["0001-01-01"]).cast(pa.timestamp('s')).cast(pa.date32())
>>> table = pa.table({"a": arr})
>>> import pyarrow.parquet as pq
>>> pq.write_table(table, "test_date.parquet")
>>> pq.read_metadata("test_date.parquet").schema
<pyarrow._parquet.ParquetSchema object at 0x7ff3f3e90e00>
required group field_id=0 schema {
  optional int32 field_id=1 a (Date);
}
>>> pq.read_table("test_date.parquet")
pyarrow.Table
a: date32[day]
{code}



> Overriding Parquet schema when loading to SageMaker to inspect bad data
> -----------------------------------------------------------------------
>
>                 Key: ARROW-10862
>                 URL: https://issues.apache.org/jira/browse/ARROW-10862
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Python
>         Environment: AWS SageMaker / S3
>            Reporter: Mr James Kelly
>            Priority: Major
>
> Following SO post: [https://stackoverflow.com/questions/53725691/pyarrow-lib-schema-vs-pyarrow-parquet-schema]
> I am attempting to find a way to override a parquet schema for a parquet file stored in S3. One date column has some bad dates which causes the load of the entire parquet file to fail.
> I have tried defining a schema, but get AttributeError: 'pyarrow.lib.schema' object has no attribute 'to_arrow_schema'. Same error as in SO post above,
> I have attempted to use the workaround suggested by Wes McKinney above by creating a dummy df, saving that to parquet, reading the schema from it and replacing the embedded schema in the parquet file with my replacement:
> pq.ParquetDataset(my_filepath,filesystem = s3, schema=dummy_schema).read_pandas().to_pandas()
> I get an error message telling me that me schema is different! (It was supposed to be!)
>  
> Can you either allow schemas to be overridden or, even better, suggest a way to load a Parquet file where some of the dates in a date column are ='0001-01-01'
>  
> Thanks,
> James Kelly
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)