You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by GitBox <gi...@apache.org> on 2023/01/02 15:03:06 UTC

[GitHub] [arrow] MMCMA opened a new issue, #15153: OSError: Couldn't deserialize thrift: TProtocolException

MMCMA opened a new issue, #15153:
URL: https://github.com/apache/arrow/issues/15153

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   I run a daily data processes in python 3.10 and Docker. I create around 500 parquet files a day with the same process. However, once a week (or 1 in 2500 files) a random file gets corrupted with the following error log when trying to read from pyarrow 10.0.1:
   
   ```
   OSError: Couldn't deserialize thrift: TProtocolException: Invalid data
   Deserializing page header failed.
   ```
   
   When I try to open the file with pandas (and  engine='fastparquet')  I get the following error:
   
   ```
     File "fastparquet\cencoding.pyx", line 336, in fastparquet.cencoding.NumpyIO.read
   TypeError: an integer is required
   ```
   
   I am not sure where to start and what could be the root cause. 
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] MMCMA closed issue #15153: [Python] OSError: Couldn't deserialize thrift: TProtocolException

Posted by GitBox <gi...@apache.org>.
MMCMA closed issue #15153: [Python] OSError: Couldn't deserialize thrift: TProtocolException
URL: https://github.com/apache/arrow/issues/15153


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] MMCMA commented on issue #15153: [Python] OSError: Couldn't deserialize thrift: TProtocolException

Posted by GitBox <gi...@apache.org>.
MMCMA commented on issue #15153:
URL: https://github.com/apache/arrow/issues/15153#issuecomment-1396725954

   I can close the issue - I just discovered by chance that in very rare circumstances two processes we writing to the same file at the same time. Sorry about this. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on issue #15153: OSError: Couldn't deserialize thrift: TProtocolException

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on issue #15153:
URL: https://github.com/apache/arrow/issues/15153#issuecomment-1387397032

   > I am not sure where to start and what could be the root cause.
   
   Some questions that might help you to get to a reproducible example, or might give some pointers of the direction to look for:
   
   - Can you identify the file for which it fails, and does it then fail reproducible with this file? If you can identify the file, can you also trace it back to the data that was used to create this file?
   - Can you share such a file? 
   - How do you create the parquet files? (using pyarrow?)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org