You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Randy Tidd (Jira)" <ji...@apache.org> on 2020/03/02 18:25:00 UTC

[jira] [Created] (PARQUET-1808) SimpleGroup.toString() uses String += and so has poor performance

Randy Tidd created PARQUET-1808:
-----------------------------------

             Summary: SimpleGroup.toString() uses String += and so has poor performance
                 Key: PARQUET-1808
                 URL: https://issues.apache.org/jira/browse/PARQUET-1808
             Project: Parquet
          Issue Type: Bug
          Components: parquet-mr
    Affects Versions: 1.11.0
            Reporter: Randy Tidd


This method in SimpleGroup uses `+=` for String concatenation which is a known performance problem in Java, the performance degrades exponentially the more strings that are added.

[https://github.com/apache/parquet-mr/blob/d69192809d0d5ec36c0d8c126c8bed09ee3cee35/parquet-column/src/main/java/org/apache/parquet/example/data/simple/SimpleGroup.java#L50]

We ran into a performance problem whereby a single column in a Parquet file was defined as a group:
{quote}    optional group customer_ids (LIST) {
        repeated group list {
        optional binary element (STRING);
      }
    }
{quote}
and had over 31,000 values. Reading this single column took over 8 minutes due to time spent in the `toString()` method.  Using a different implementation that uses `StringBuffer` like this:
    StringBuffer result = new StringBuffer();
    int i = 0;
    for (Type field : schema.getFields()) {
      String name = field.getName();
      List<Object> values = data[i];
      ++i;
      if (values != null) {
        if (values.size() > 0) {
          for (Object value : values) {
            result.append(indent);
            result.append(name);
            if (value == null) {
              result.append(": NULL\n");
            } else if (value instanceof Group) {
              result.append("\n");
              result.append(betterToString((SimpleGroup)value, indent+"  "));
            } else {
              result.append(": ");
              result.append(value.toString());
              result.append("\n");
            }
          }
        }
      }
    }
    return result.toString();
reduced that time to less than 500 milliseconds.

 

The existing implementation is really poor and exhibits an infamous Java string performance issue and should be fixed.

This was a significant problem for us but we were able to work around it so I am marking this issue as "Minor".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)