You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/12/10 10:43:00 UTC

[jira] [Updated] (ARROW-10862) [Python] Overriding Parquet schema when loading to SageMaker to inspect bad data

     [ https://issues.apache.org/jira/browse/ARROW-10862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joris Van den Bossche updated ARROW-10862:
------------------------------------------
    Summary: [Python] Overriding Parquet schema when loading to SageMaker to inspect bad data  (was: Overriding Parquet schema when loading to SageMaker to inspect bad data)

> [Python] Overriding Parquet schema when loading to SageMaker to inspect bad data
> --------------------------------------------------------------------------------
>
>                 Key: ARROW-10862
>                 URL: https://issues.apache.org/jira/browse/ARROW-10862
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Python
>         Environment: AWS SageMaker / S3
>            Reporter: Mr James Kelly
>            Priority: Major
>
> Following SO post: [https://stackoverflow.com/questions/53725691/pyarrow-lib-schema-vs-pyarrow-parquet-schema]
> I am attempting to find a way to override a parquet schema for a parquet file stored in S3. One date column has some bad dates which causes the load of the entire parquet file to fail.
> I have tried defining a schema, but get AttributeError: 'pyarrow.lib.schema' object has no attribute 'to_arrow_schema'. Same error as in SO post above,
> I have attempted to use the workaround suggested by Wes McKinney above by creating a dummy df, saving that to parquet, reading the schema from it and replacing the embedded schema in the parquet file with my replacement:
> pq.ParquetDataset(my_filepath,filesystem = s3, schema=dummy_schema).read_pandas().to_pandas()
> I get an error message telling me that me schema is different! (It was supposed to be!)
>  
> Can you either allow schemas to be overridden or, even better, suggest a way to load a Parquet file where some of the dates in a date column are ='0001-01-01'
>  
> Thanks,
> James Kelly
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)