You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Catalin Toda (Jira)" <ji...@apache.org> on 2021/10/13 22:31:00 UTC
[jira] [Commented] (SPARK-35640) Refactor Parquet vectorized reader to remove duplicated code paths

    [ https://issues.apache.org/jira/browse/SPARK-35640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17428520#comment-17428520 ] 

Catalin Toda commented on SPARK-35640:
--------------------------------------

I opened https://issues.apache.org/jira/browse/SPARK-36990 which seems to be related to this change as well.
It seems that the logicalTypeAnnotation is null in my environment. This PR proposes relying on logicalTypeAnnotation which would then always return false.

> Refactor Parquet vectorized reader to remove duplicated code paths
> ------------------------------------------------------------------
>
>                 Key: SPARK-35640
>                 URL: https://issues.apache.org/jira/browse/SPARK-35640
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 3.2.0
>            Reporter: Chao Sun
>            Assignee: Chao Sun
>            Priority: Major
>             Fix For: 3.2.0
>
>
> Currently in Parquet vectorized code path, there are many code duplications such as the following:
> {code:java}
>   public void readIntegers(
>       int total,
>       WritableColumnVector c,
>       int rowId,
>       int level,
>       VectorizedValuesReader data) throws IOException {
>     int left = total;
>     while (left > 0) {
>       if (this.currentCount == 0) this.readNextGroup();
>       int n = Math.min(left, this.currentCount);
>       switch (mode) {
>         case RLE:
>           if (currentValue == level) {
>             data.readIntegers(n, c, rowId);
>           } else {
>             c.putNulls(rowId, n);
>           }
>           break;
>         case PACKED:
>           for (int i = 0; i < n; ++i) {
>             if (currentBuffer[currentBufferIdx++] == level) {
>               c.putInt(rowId + i, data.readInteger());
>             } else {
>               c.putNull(rowId + i);
>             }
>           }
>           break;
>       }
>       rowId += n;
>       left -= n;
>       currentCount -= n;
>     }
>   }
> {code}
> This makes it hard to maintain as any change on this will need to be replicated in 20+ places. The issue becomes more serious when we are going to implement column index and complex type support for the vectorized path.
> The original intention is for performance. However now days JIT compilers tend to be smart on this and will inline virtual calls as much as possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org