You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Randy Tidd (Jira)" <ji...@apache.org> on 2020/03/02 18:25:00 UTC
[jira] [Created] (PARQUET-1808) SimpleGroup.toString() uses String
+= and so has poor performance
Randy Tidd created PARQUET-1808:
-----------------------------------
Summary: SimpleGroup.toString() uses String += and so has poor performance
Key: PARQUET-1808
URL: https://issues.apache.org/jira/browse/PARQUET-1808
Project: Parquet
Issue Type: Bug
Components: parquet-mr
Affects Versions: 1.11.0
Reporter: Randy Tidd
This method in SimpleGroup uses `+=` for String concatenation which is a known performance problem in Java, the performance degrades exponentially the more strings that are added.
[https://github.com/apache/parquet-mr/blob/d69192809d0d5ec36c0d8c126c8bed09ee3cee35/parquet-column/src/main/java/org/apache/parquet/example/data/simple/SimpleGroup.java#L50]
We ran into a performance problem whereby a single column in a Parquet file was defined as a group:
{quote} optional group customer_ids (LIST) {
repeated group list {
optional binary element (STRING);
}
}
{quote}
and had over 31,000 values. Reading this single column took over 8 minutes due to time spent in the `toString()` method. Using a different implementation that uses `StringBuffer` like this:
StringBuffer result = new StringBuffer();
int i = 0;
for (Type field : schema.getFields()) {
String name = field.getName();
List<Object> values = data[i];
++i;
if (values != null) {
if (values.size() > 0) {
for (Object value : values) {
result.append(indent);
result.append(name);
if (value == null) {
result.append(": NULL\n");
} else if (value instanceof Group) {
result.append("\n");
result.append(betterToString((SimpleGroup)value, indent+" "));
} else {
result.append(": ");
result.append(value.toString());
result.append("\n");
}
}
}
}
}
return result.toString();
reduced that time to less than 500 milliseconds.
The existing implementation is really poor and exhibits an infamous Java string performance issue and should be fixed.
This was a significant problem for us but we were able to work around it so I am marking this issue as "Minor".
--
This message was sent by Atlassian Jira
(v8.3.4#803005)