You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Gabor Szadovszky (Jira)" <ji...@apache.org> on 2020/05/07 12:34:00 UTC

[jira] [Resolved] (PARQUET-1808) SimpleGroup.toString() uses String += and so has poor performance

     [ https://issues.apache.org/jira/browse/PARQUET-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gabor Szadovszky resolved PARQUET-1808.
---------------------------------------
    Resolution: Fixed

> SimpleGroup.toString() uses String += and so has poor performance
> -----------------------------------------------------------------
>
>                 Key: PARQUET-1808
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1808
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.11.0
>            Reporter: Randy Tidd
>            Assignee: Shankar Koirala
>            Priority: Minor
>              Labels: pull-request-available
>
> This method in SimpleGroup uses `+=` for String concatenation which is a known performance problem in Java, the performance degrades exponentially the more strings that are added.
> [https://github.com/apache/parquet-mr/blob/d69192809d0d5ec36c0d8c126c8bed09ee3cee35/parquet-column/src/main/java/org/apache/parquet/example/data/simple/SimpleGroup.java#L50]
> We ran into a performance problem whereby a single column in a Parquet file was defined as a group:
> {code:java}
>     optional group customer_ids (LIST) {
>         repeated group list { 
>         optional binary element (STRING); 
>       }
>     }{code}
>  
> and had over 31,000 values. Reading this single column took over 8 minutes due to time spent in the `toString()` method.  Using a different implementation that uses `StringBuffer` like this:
> {code:java}
>  StringBuffer result = new StringBuffer();
>  int i = 0;
>  for (Type field : schema.getFields()) {
>    String name = field.getName();
>    List<Object> values = data[i];
>    ++i;
>    if (values != null) {
>      if (values.size() > 0) {
>        for (Object value : values) {
>          result.append(indent);
>          result.append(name);
>          if (value == null) { 
>            result.append(": NULL\n");
>          } else if (value instanceof Group){ 
>            result.append("\n"); 
>            result.append(betterToString((SimpleGroup)value, indent+" "));
>          } else { 
>            result.append(": "); 
>            result.append(value.toString()); 
>            result.append("\n"); 
>          }
>        }
>      }
>    }
>  }
>  return result.toString();{code}
> reduced that time to less than 500 milliseconds. 
> The existing implementation is really poor and exhibits an infamous Java string performance issue and should be fixed.
> This was a significant problem for us but we were able to work around it so I am marking this issue as "Minor".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)