You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "tooptoop4 (via GitHub)" <gi...@apache.org> on 2023/05/12 02:05:15 UTC

[GitHub] [arrow] tooptoop4 opened a new issue, #35569: python - read multiple parquets that have different schema?

tooptoop4 opened a new issue, #35569:
URL: https://github.com/apache/arrow/issues/35569

   ### Describe the usage question you have. Please include as many useful details as  possible.
   
   
   because of schema evolution some parquet files have more columns than others.
   i try to read them all in one go (there are over 20000 small files under different partition folders) because i want to write out their data into a single big file
   
   ```
   original_files = pq.ParquetDataset("bucket/folder", filesystem=s3_src, partitioning="hive")
   ```
   
   i get error:
   ```
   ValueError: Schema in partition[year=0, month=0, day=14, hour=9] bucket/folder/year=2023/month=02/day=22/hour=14/redact.snappy.parquet was different.
   ```
   
   how can i achieve this in pyarrow?
   pyspark has an automatic mergeSchema option - https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#schema-merging
   
   ### Component(s)
   
   Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] tooptoop4 commented on issue #35569: python - read multiple parquets that have different schema?

Posted by "tooptoop4 (via GitHub)" <gi...@apache.org>.

tooptoop4 commented on issue #35569:
URL: https://github.com/apache/arrow/issues/35569#issuecomment-1548762282

   i'm on v7, seems "AttributeError: module 'pyarrow' has no attribute 'dataset'" . do u think validate_schema=False with pq.ParquetDataset can achieve it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #35569: python - read multiple parquets that have different schema?

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #35569:
URL: https://github.com/apache/arrow/issues/35569#issuecomment-1548361801

   Which version of pyarrow is this?  Any schema evolution is going to be provided by the new datasets feature (`pyarrow.dataset`) and probably not added to `parquet.ParquetDataset`.
   
   Do you get an error with:
   
   ```
   pyarrow.dataset.dataset("bucket/folder", filesystem=s3_src, partitioning="hive")
   ```
   
   Is it possible to manually specify a schema that includes all of the fields (even if some of those fields are missing in some files)?  Do you still get this error?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #35569: python - read multiple parquets that have different schema?

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #35569:
URL: https://github.com/apache/arrow/issues/35569#issuecomment-1553005949

   You may need to manually specify the schema when you call `ds.dataset("redact/month=02/day=22", filesystem=s3_src, partitioning="hive")`.  Otherwise I believe it will use the schema from the first parquet file it happens to read.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] PoojaRavi1105 commented on issue #35569: python - read multiple parquets that have different schema?

Posted by "PoojaRavi1105 (via GitHub)" <gi...@apache.org>.

PoojaRavi1105 commented on issue #35569:
URL: https://github.com/apache/arrow/issues/35569#issuecomment-1736826834

   Pyarrow has a function called [unify_schemas()](https://arrow.apache.org/docs/python/generated/pyarrow.unify_schemas.html) to help merge schema from multiple files similar to mergeSchema in pySpark. Have you given a shot at this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on issue #35569: python - read multiple parquets that have different schema?

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on issue #35569:
URL: https://github.com/apache/arrow/issues/35569#issuecomment-1736854406

   Currently arrow dataset's `unifySchema` can only merge the "null" and "non-null". and Cannot handle the schema promotion. I guess currently the better way on this is to explicit set a schema.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org