You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Yan Zhou.sc" <Ya...@huawei.com> on 2014/10/21 18:43:01 UTC

A performance irregularity?

Hi,

We have a Parquet file with more than 1000 columns of nested types, and the columns are sparse, namely most columns per row are nulls.
When writing the Parquet, the performance is very slow on CPU. Profiler shows that MessageColumnIORecordConsumer.writeNull is called
recursively and each recursion gets ever larger number of invocations by approximately 35X.

The following code in MessageColumnIO.java shows where the problem could be:


    private void writeNull(ColumnIO undefinedField, int r, int d) {


      if (undefinedField.getType().isPrimitive()) {


        columnWriter[((PrimitiveColumnIO)undefinedField).getId()].writeNull(r, d);


      } else {


        GroupColumnIO groupColumnIO = (GroupColumnIO)undefinedField;


        int childrenCount = groupColumnIO.getChildrenCount();


        for (int i = 0; i < childrenCount; i++) {


          writeNull(groupColumnIO.getChild(i), r, d);


        }


      }


    }


As red marked, the recursion occurring in the loop seems to cause the explosion of the number of invocation calls.

My question is: Since this writeNull is only called for a missing field at a level, and all its descendents are known to be missing and their count are known from schema, will there be possibly a more efficient way to store the information than the current store of all of the descendants' missing indicator?

Or is there a workaround to avoid this "trap" for now?


Thanks for help!

Re: A performance irregularity?

Posted by Julien Le Dem <ju...@twitter.com.INVALID>.
It is certainly possible to avoid the recursion and improve this.
As you mentioned, the schema is known in advance.
Pull requests are welcome if you want to take a stab at it.

On Tue, Oct 21, 2014 at 9:43 AM, Yan Zhou.sc <Ya...@huawei.com> wrote:

> Hi,
>
> We have a Parquet file with more than 1000 columns of nested types, and
> the columns are sparse, namely most columns per row are nulls.
> When writing the Parquet, the performance is very slow on CPU. Profiler
> shows that MessageColumnIORecordConsumer.writeNull is called
> recursively and each recursion gets ever larger number of invocations by
> approximately 35X.
>
> The following code in MessageColumnIO.java shows where the problem could
> be:
>
>
>     private void writeNull(ColumnIO undefinedField, int r, int d) {
>
>
>       if (undefinedField.getType().isPrimitive()) {
>
>
>
> columnWriter[((PrimitiveColumnIO)undefinedField).getId()].writeNull(r, d);
>
>
>       } else {
>
>
>         GroupColumnIO groupColumnIO = (GroupColumnIO)undefinedField;
>
>
>         int childrenCount = groupColumnIO.getChildrenCount();
>
>
>         for (int i = 0; i < childrenCount; i++) {
>
>
>           writeNull(groupColumnIO.getChild(i), r, d);
>
>
>         }
>
>
>       }
>
>
>     }
>
>
> As red marked, the recursion occurring in the loop seems to cause the
> explosion of the number of invocation calls.
>
> My question is: Since this writeNull is only called for a missing field at
> a level, and all its descendents are known to be missing and their count
> are known from schema, will there be possibly a more efficient way to store
> the information than the current store of all of the descendants' missing
> indicator?
>
> Or is there a workaround to avoid this "trap" for now?
>
>
> Thanks for help!
>