You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/04/04 00:09:00 UTC

[jira] [Commented] (PARQUET-1261) Parquet-format interns strings when reading filemetadata

    [ https://issues.apache.org/jira/browse/PARQUET-1261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16424791#comment-16424791 ] 

ASF GitHub Bot commented on PARQUET-1261:
-----------------------------------------

julienledem commented on issue #92: PARQUET-1261 - Remove string interning
URL: https://github.com/apache/parquet-format/pull/92#issuecomment-378438017
 
 
   FYI, this was done to save memory since we refer to columns using their name in the metadata. Which can become quite big when loading a lot of files. If interning is causing problems we should replace it by a different mechanisms to serve the same purpose of deduping strings.
   Ideally we would change the metadata to refer to column by their index instead but that's a breaking change.
   I replied on the mailing list as well.
   Thank you 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Parquet-format interns strings when reading filemetadata
> --------------------------------------------------------
>
>                 Key: PARQUET-1261
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1261
>             Project: Parquet
>          Issue Type: Bug
>    Affects Versions: 1.9.0
>            Reporter: Robert Kruszewski
>            Assignee: Robert Kruszewski
>            Priority: Major
>
> Parquet-format when deserializing metadata will intern strings. References I could find suggested that it had been done to reduce memory pressure early on. Java (and jvm in particular) went a long way since then and interning is generally discouraged, see [https://shipilev.net/jvm-anatomy-park/10-string-intern/] for a good explanation. What is more since java 8 there's string deduplication implemented at GC level per [http://openjdk.java.net/jeps/192.] During our usage and testing we found the interning to cause significant gc pressure for long running applications due to bigger GC root set.
> This issue proposes removing interning given it's questionable whether it should be used in modern jvms.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)