You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/06/10 08:52:15 UTC
[GitHub] [arrow-rs] garyanaplan commented on issue #349: parquet reading hangs when row_group contains more than 2048 rows of data
garyanaplan commented on issue #349:
URL: https://github.com/apache/arrow-rs/issues/349#issuecomment-858440550
Yep. If I update my test to remove BOOLEAN from the schema, the problem goes away. I've done some digging around today and noticed that it looks like the problem might lie in the generation of the file.
I previously reported that parquet-tools dump <file> would happily process the file, however I trimmed down the example to just include BOOLEAN field in schema and increased the number of rows in the group and noted the following when dumping:
`value 2039: R:0 D:0 V:true
value 2040: R:0 D:0 V:false
value 2041: R:0 D:0 V:true
value 2042: R:0 D:0 V:false
value 2043: R:0 D:0 V:true
value 2044: R:0 D:0 V:false
value 2045: R:0 D:0 V:true
value 2046: R:0 D:0 V:false
value 2047: R:0 D:0 V:true
value 2048: R:0 D:0 V:false
value 2049: R:0 D:0 V:false
value 2050: R:0 D:0 V:false
value 2051: R:0 D:0 V:false
value 2052: R:0 D:0 V:false
value 2053: R:0 D:0 V:false
value 2054: R:0 D:0 V:false
value 2055: R:0 D:0 V:false
`
All the values after 2048 are false and they continue to be false until the end of the file.
It looks like the generated input file is invalid, so I'm going to poke around there a little next.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org