You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/06/10 08:52:15 UTC

[GitHub] [arrow-rs] garyanaplan commented on issue #349: parquet reading hangs when row_group contains more than 2048 rows of data

garyanaplan commented on issue #349:
URL: https://github.com/apache/arrow-rs/issues/349#issuecomment-858440550


   Yep. If I update my test to remove BOOLEAN from the schema, the problem goes away. I've done some digging around today and noticed that it looks like the problem might lie in the generation of the file.
   
   I previously reported that parquet-tools dump <file> would happily process the file, however I trimmed down the example to just include BOOLEAN field in schema and increased the number of rows in the group and noted the following when dumping:
   
   `value 2039:   R:0 D:0 V:true
   value 2040:   R:0 D:0 V:false
   value 2041:   R:0 D:0 V:true
   value 2042:   R:0 D:0 V:false
   value 2043:   R:0 D:0 V:true
   value 2044:   R:0 D:0 V:false
   value 2045:   R:0 D:0 V:true
   value 2046:   R:0 D:0 V:false
   value 2047:   R:0 D:0 V:true
   value 2048:   R:0 D:0 V:false
   value 2049:   R:0 D:0 V:false
   value 2050:   R:0 D:0 V:false
   value 2051:   R:0 D:0 V:false
   value 2052:   R:0 D:0 V:false
   value 2053:   R:0 D:0 V:false
   value 2054:   R:0 D:0 V:false
   value 2055:   R:0 D:0 V:false
   `
   All the values after 2048 are false and they continue to be false until the end of the file.
   It looks like the generated input file is invalid, so I'm going to poke around there a little next.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org