You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Gabor Szadovszky (Jira)" <ji...@apache.org> on 2020/05/07 12:34:00 UTC
[jira] [Resolved] (PARQUET-1808) SimpleGroup.toString() uses String
+= and so has poor performance
[ https://issues.apache.org/jira/browse/PARQUET-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gabor Szadovszky resolved PARQUET-1808.
---------------------------------------
Resolution: Fixed
> SimpleGroup.toString() uses String += and so has poor performance
> -----------------------------------------------------------------
>
> Key: PARQUET-1808
> URL: https://issues.apache.org/jira/browse/PARQUET-1808
> Project: Parquet
> Issue Type: Bug
> Components: parquet-mr
> Affects Versions: 1.11.0
> Reporter: Randy Tidd
> Assignee: Shankar Koirala
> Priority: Minor
> Labels: pull-request-available
>
> This method in SimpleGroup uses `+=` for String concatenation which is a known performance problem in Java, the performance degrades exponentially the more strings that are added.
> [https://github.com/apache/parquet-mr/blob/d69192809d0d5ec36c0d8c126c8bed09ee3cee35/parquet-column/src/main/java/org/apache/parquet/example/data/simple/SimpleGroup.java#L50]
> We ran into a performance problem whereby a single column in a Parquet file was defined as a group:
> {code:java}
> optional group customer_ids (LIST) {
> repeated group list {
> optional binary element (STRING);
> }
> }{code}
>
> and had over 31,000 values. Reading this single column took over 8 minutes due to time spent in the `toString()` method. Using a different implementation that uses `StringBuffer` like this:
> {code:java}
> StringBuffer result = new StringBuffer();
> int i = 0;
> for (Type field : schema.getFields()) {
> String name = field.getName();
> List<Object> values = data[i];
> ++i;
> if (values != null) {
> if (values.size() > 0) {
> for (Object value : values) {
> result.append(indent);
> result.append(name);
> if (value == null) {
> result.append(": NULL\n");
> } else if (value instanceof Group){
> result.append("\n");
> result.append(betterToString((SimpleGroup)value, indent+" "));
> } else {
> result.append(": ");
> result.append(value.toString());
> result.append("\n");
> }
> }
> }
> }
> }
> return result.toString();{code}
> reduced that time to less than 500 milliseconds.
> The existing implementation is really poor and exhibits an infamous Java string performance issue and should be fixed.
> This was a significant problem for us but we were able to work around it so I am marking this issue as "Minor".
--
This message was sent by Atlassian Jira
(v8.3.4#803005)