You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "David Lee (JIRA)" <ji...@apache.org> on 2018/12/03 05:13:00 UTC
[jira] [Commented] (ARROW-3918) [Python] ParquetWriter.write_table doesn't support coerce_timestamps or allow_truncated_timestamps

    [ https://issues.apache.org/jira/browse/ARROW-3918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16706664#comment-16706664 ] 

David Lee commented on ARROW-3918:
----------------------------------

Passed them into ParquetWriter and it still gives the same error..

File "../python3.6/site-packages/pyarrow/parquet.py", line 374, in write_table
 raise ValueError(msg)
ValueError: Table schema does not match schema used to create file:
table:
Id: string
modified: timestamp[ms]
converter: string
records: int32
metadata
--------
{b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [{"name":'
 b' "Id", "field_name": "Id", "'
 b'pandas_type": "unicode", "numpy_type": "object", "metadata": nul'
 b'l}, {"name": "modified", "field_name": "modified", "pandas_type"'
 b': "datetime", "numpy_type": "datetime64[ns]", "metadata": null},'
 b' {"name": "converter", "field_name": "converter", "pandas_type":'
 b' "unicode", "numpy_type": "object", "metadata": null}, {"name": '
 b'"records", "field_name": "records", "pandas_type": "int32", "num'
 b'py_type": "int64", "metadata": null}], "pandas_version": "0.23.4'
 b'"}'} vs.
file:
Id: string
modified: timestamp[ms]
converter: string
records: int32

Code:

 
{code:java}
processed_schema = pa.schema([
    pa.field('Id', pa.string()),
    pa.field('modified', pa.timestamp('ms')),
    pa.field('converter', pa.string()),
    pa.field('records', pa.int32())
])

if len(arrow_tables) > 0:
    writer = pq.ParquetWriter(os.path.join(self.conf['work_dir'], processed_file), schema=processed_schema, use_dictionary=True, compression='snappy', coerce_timestamps='ms', allow_truncated_timestamps=True)

    for v in arrow_tables:
        writer.write_table(v)
    writer.close()
{code}
 

 

> [Python] ParquetWriter.write_table doesn't support coerce_timestamps or allow_truncated_timestamps
> --------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-3918
>                 URL: https://issues.apache.org/jira/browse/ARROW-3918
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.11.1
>            Reporter: David Lee
>            Priority: Major
>
> Error: Table Schema does not match schema used to create file.
> The 0.11.1 release added these parameters to pyarrow.parquet.write_table(), but they are missing from pyarrow.parquet.ParquetWriter.write_table().. I'm seeing mismatches between the table schema and the file schema, but they are identical in the error message with modified: timestamp[ms] column types in both schemas. The only thing which looks odd is the Pandas metadata that has a modified column with a panda datatype of datetime and a numpy datatype of datetime64[ns]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)