You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/08/22 16:44:32 UTC

[GitHub] [arrow] losze1cj opened a new issue #8025: ParquetWriter creates bad files when passed pyarrow schema from pyarrow table?

losze1cj opened a new issue #8025:
URL: https://github.com/apache/arrow/issues/8025


   It seems that the ParquetWriter doesn't behave as expected when I am passing a pyarrow schema that comes out of a pyarrow table.  Approaching a problem in two ways, I notice unexpected behavior. 
   
   If I construct a pyarrow schema from the datatypes, so I get a schema that has no metadata attached:
   ```
   print(pyarrow_schema)
   ---
   sample_column1: string
   sample_column2: date32[day]
   sample_column3: float
   ``` 
   Binding that to the ParquetWriter and a pyarrow table and writing it out:
   ```
   pqwriter = pq.ParquetWriter(out_io, schema=pyarrow_schema, compression='snappy')
   df = pa.Table.from_pandas(df, schema=pyarrow_schema)
   pqwriter.write_table(table=df)
   ```
   I get the an expected result, a queryable, well-formed, parquet file.  I'm adding an external schema on top of the file to query through redshift spectrum.
   
   However, if I create the schema, bind it to the table, and then bind the table schema to the ParquetWriter.  The result is a bad parquet file.
   ```
   df = pa.Table.from_pandas(df, schema=pyarrow_schema)
   pqwriter = pq.ParquetWriter(out_io, schema=df.schema, compression='snappy')
   pqwriter.write_table(table=df)
   ```
   
   What I notice is that the schema coming from the pyarrow table comes with attached metadata, but removing the metadata does not seem to solve the issue.
   ```
   df = pa.Table.from_pandas(df, schema=pyarrow_schema)
   pqwriter = pq.ParquetWriter(out_io, schema=df.schema.remove_metadata(), compression='snappy')
   pqwriter.write_table(table=df)
   ```
   
   Should I report a bug?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] losze1cj commented on issue #8025: ParquetWriter creates bad files when passed pyarrow schema from pyarrow table?

Posted by GitBox <gi...@apache.org>.
losze1cj commented on issue #8025:
URL: https://github.com/apache/arrow/issues/8025#issuecomment-678663628


   Using pyarrow==0.15.1
   pandas==0.25.3
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm closed issue #8025: ParquetWriter creates bad files when passed pyarrow schema from pyarrow table?

Posted by GitBox <gi...@apache.org>.
wesm closed issue #8025:
URL: https://github.com/apache/arrow/issues/8025


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on issue #8025: ParquetWriter creates bad files when passed pyarrow schema from pyarrow table?

Posted by GitBox <gi...@apache.org>.
wesm commented on issue #8025:
URL: https://github.com/apache/arrow/issues/8025#issuecomment-688901876


   Ping. Closing in the meantime 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on issue #8025: ParquetWriter creates bad files when passed pyarrow schema from pyarrow table?

Posted by GitBox <gi...@apache.org>.
emkornfield commented on issue #8025:
URL: https://github.com/apache/arrow/issues/8025#issuecomment-678675831


   Thank you for the report.  A few questions.  Does parquet tools indicate of the file is readable?  Can you read it back with pyarrow?   Do you still see the issue with arrow 1.0.1.  If the issue still exists in arrow 1.0.1 please open a jira (in general we track all issues in jira). Also please provide example data that can reproduce the issue.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on issue #8025: ParquetWriter creates bad files when passed pyarrow schema from pyarrow table?

Posted by GitBox <gi...@apache.org>.
emkornfield commented on issue #8025:
URL: https://github.com/apache/arrow/issues/8025#issuecomment-681330368


   @losze1cj where you able to test out the suggestions?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org