You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/04/26 11:24:52 UTC

[GitHub] [arrow-rs] alamb opened a new issue #58: [Parquet] Reading UTF-8/JSON/ENUM field results in a lot of vec allocation

alamb opened a new issue #58:
URL: https://github.com/apache/arrow-rs/issues/58


   *Note*: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-7252
   
   While reading a very large parquet file with basically all string fields was very slow(430MB gzipped), after profiling with osx instruments, I noticed that a lot of time is spent in "convert_byte_array", in particular, "reserving" and allocating Vec::with_capacity, which is done before String::from_utf8_unchecked.
   
   It seems like using String as the underlying storage is causing this(String uses Vec<u8> for its underlying storage), this also requires copying from slice to vec.
   
   "Field::Str<String>" is a pub enum so I am not sure how "refactorable" is the String part, for example, converting it into a &str(we can perhaps then defer the conversion from &[u8] to Vec<u8> until the user really needs a String)
   
   But of course, changing it to &str can result in quite a bit of interface changes... So I am wondering if there are already some plans or solution on the way to improve the handling of the "Field::Str" case?
   
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] alamb commented on issue #58: [Parquet] Reading UTF-8/JSON/ENUM field results in a lot of vec allocation

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #58:
URL: https://github.com/apache/arrow-rs/issues/58#issuecomment-826757724


   Comment from Wong Shek Hei(shekhei) @ 2019-11-27T08:05:42.844+0000:
   <pre>I have modified the Field::Str locally to hold the ByteArray instead, that removes the copying. Reading a 1.5MM, 1000 column file(440mb) gz.parquet file, on MacBook Pro (15-inch, 2019), improved from 3m20s to 2m20s.
   
    
   
   But the problem is this will modify the signature of the Field::Str Variant.</pre>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org