You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/12/10 10:43:00 UTC
[jira] [Updated] (ARROW-10862) [Python] Overriding Parquet schema
when loading to SageMaker to inspect bad data
[ https://issues.apache.org/jira/browse/ARROW-10862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joris Van den Bossche updated ARROW-10862:
------------------------------------------
Summary: [Python] Overriding Parquet schema when loading to SageMaker to inspect bad data (was: Overriding Parquet schema when loading to SageMaker to inspect bad data)
> [Python] Overriding Parquet schema when loading to SageMaker to inspect bad data
> --------------------------------------------------------------------------------
>
> Key: ARROW-10862
> URL: https://issues.apache.org/jira/browse/ARROW-10862
> Project: Apache Arrow
> Issue Type: New Feature
> Components: Python
> Environment: AWS SageMaker / S3
> Reporter: Mr James Kelly
> Priority: Major
>
> Following SO post: [https://stackoverflow.com/questions/53725691/pyarrow-lib-schema-vs-pyarrow-parquet-schema]
> I am attempting to find a way to override a parquet schema for a parquet file stored in S3. One date column has some bad dates which causes the load of the entire parquet file to fail.
> I have tried defining a schema, but get AttributeError: 'pyarrow.lib.schema' object has no attribute 'to_arrow_schema'. Same error as in SO post above,
> I have attempted to use the workaround suggested by Wes McKinney above by creating a dummy df, saving that to parquet, reading the schema from it and replacing the embedded schema in the parquet file with my replacement:
> pq.ParquetDataset(my_filepath,filesystem = s3, schema=dummy_schema).read_pandas().to_pandas()
> I get an error message telling me that me schema is different! (It was supposed to be!)
>
> Can you either allow schemas to be overridden or, even better, suggest a way to load a Parquet file where some of the dates in a date column are ='0001-01-01'
>
> Thanks,
> James Kelly
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)