You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Randy Tidd (Jira)" <ji...@apache.org> on 2020/03/02 19:26:00 UTC

[jira] [Updated] (PARQUET-1808) SimpleGroup.toString() uses String += and so has poor performance

     [ https://issues.apache.org/jira/browse/PARQUET-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Randy Tidd updated PARQUET-1808:
--------------------------------
    Description: 
This method in SimpleGroup uses `+=` for String concatenation which is a known performance problem in Java, the performance degrades exponentially the more strings that are added.

[https://github.com/apache/parquet-mr/blob/d69192809d0d5ec36c0d8c126c8bed09ee3cee35/parquet-column/src/main/java/org/apache/parquet/example/data/simple/SimpleGroup.java#L50]

We ran into a performance problem whereby a single column in a Parquet file was defined as a group:
{code:java}
    optional group customer_ids (LIST) {
        repeated group list { 
        optional binary element (STRING); 
      }
    }{code}
 

and had over 31,000 values. Reading this single column took over 8 minutes due to time spent in the `toString()` method.  Using a different implementation that uses `StringBuffer` like this:
{code:java}
 StringBuffer result = new StringBuffer();
 int i = 0;
 for (Type field : schema.getFields()) {
   String name = field.getName();
   List<Object> values = data[i];
   ++i;
   if (values != null) {
     if (values.size() > 0) {
       for (Object value : values) {
         result.append(indent);
         result.append(name);
         if (value == null) { 
           result.append(": NULL\n");
         } else if (value instanceof Group){ 
           result.append("\n"); 
           result.append(betterToString((SimpleGroup)value, indent+" "));
         } else { 
           result.append(": "); 
           result.append(value.toString()); 
           result.append("\n"); 
         }
       }
     }
   }
 }
 return result.toString();{code}
reduced that time to less than 500 milliseconds. 

The existing implementation is really poor and exhibits an infamous Java string performance issue and should be fixed.

This was a significant problem for us but we were able to work around it so I am marking this issue as "Minor".

  was:
This method in SimpleGroup uses `+=` for String concatenation which is a known performance problem in Java, the performance degrades exponentially the more strings that are added.

[https://github.com/apache/parquet-mr/blob/d69192809d0d5ec36c0d8c126c8bed09ee3cee35/parquet-column/src/main/java/org/apache/parquet/example/data/simple/SimpleGroup.java#L50]

We ran into a performance problem whereby a single column in a Parquet file was defined as a group:
{quote}    optional group customer_ids (LIST) {
        repeated group list {
        optional binary element (STRING);
      }
    }
{quote}
and had over 31,000 values. Reading this single column took over 8 minutes due to time spent in the `toString()` method.  Using a different implementation that uses `StringBuffer` like this:
    StringBuffer result = new StringBuffer();
    int i = 0;
    for (Type field : schema.getFields()) {
      String name = field.getName();
      List<Object> values = data[i];
      ++i;
      if (values != null) {
        if (values.size() > 0) {
          for (Object value : values) {
            result.append(indent);
            result.append(name);
            if (value == null) {
              result.append(": NULL\n");
            } else if (value instanceof Group) {
              result.append("\n");
              result.append(betterToString((SimpleGroup)value, indent+"  "));
            } else {
              result.append(": ");
              result.append(value.toString());
              result.append("\n");
            }
          }
        }
      }
    }
    return result.toString();
reduced that time to less than 500 milliseconds.

 

The existing implementation is really poor and exhibits an infamous Java string performance issue and should be fixed.

This was a significant problem for us but we were able to work around it so I am marking this issue as "Minor".


> SimpleGroup.toString() uses String += and so has poor performance
> -----------------------------------------------------------------
>
>                 Key: PARQUET-1808
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1808
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.11.0
>            Reporter: Randy Tidd
>            Priority: Minor
>
> This method in SimpleGroup uses `+=` for String concatenation which is a known performance problem in Java, the performance degrades exponentially the more strings that are added.
> [https://github.com/apache/parquet-mr/blob/d69192809d0d5ec36c0d8c126c8bed09ee3cee35/parquet-column/src/main/java/org/apache/parquet/example/data/simple/SimpleGroup.java#L50]
> We ran into a performance problem whereby a single column in a Parquet file was defined as a group:
> {code:java}
>     optional group customer_ids (LIST) {
>         repeated group list { 
>         optional binary element (STRING); 
>       }
>     }{code}
>  
> and had over 31,000 values. Reading this single column took over 8 minutes due to time spent in the `toString()` method.  Using a different implementation that uses `StringBuffer` like this:
> {code:java}
>  StringBuffer result = new StringBuffer();
>  int i = 0;
>  for (Type field : schema.getFields()) {
>    String name = field.getName();
>    List<Object> values = data[i];
>    ++i;
>    if (values != null) {
>      if (values.size() > 0) {
>        for (Object value : values) {
>          result.append(indent);
>          result.append(name);
>          if (value == null) { 
>            result.append(": NULL\n");
>          } else if (value instanceof Group){ 
>            result.append("\n"); 
>            result.append(betterToString((SimpleGroup)value, indent+" "));
>          } else { 
>            result.append(": "); 
>            result.append(value.toString()); 
>            result.append("\n"); 
>          }
>        }
>      }
>    }
>  }
>  return result.toString();{code}
> reduced that time to less than 500 milliseconds. 
> The existing implementation is really poor and exhibits an infamous Java string performance issue and should be fixed.
> This was a significant problem for us but we were able to work around it so I am marking this issue as "Minor".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)