You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by viirya <gi...@git.apache.org> on 2016/06/20 03:05:07 UTC

[GitHub] spark pull request #13775: [SPARK-16060][SQL] Vectorized Orc reader

GitHub user viirya opened a pull request:

    https://github.com/apache/spark/pull/13775

    [SPARK-16060][SQL] Vectorized Orc reader

    ## What changes were proposed in this pull request?
    
    Currently Orc reader in Spark SQL doesn't support vectorized reading. As Hive Orc already support vectorization, we can add this support to improve Orc reading performance.
    
    ### Benchmark
    
    Benchmark code:
    
        test("Benchmark for Orc") {
          val N = 500 << 12
            withOrcTable((0 until N).map(i => (i, i.toString, i.toLong, i.toDouble)), "t") {
              val benchmark = new Benchmark("Orc reader", N)
              benchmark.addCase("reading Orc file", 10) { iter =>
                sql("SELECT * FROM t").collect()
              }
              benchmark.run()
          }
        }
    
    Before this patch:
    
        Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-25-generic
        Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
        Orc reader:                              Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
        ------------------------------------------------------------------------------------------------
        reading Orc file                              4750 / 5266          0.4        2319.1       1.0X
    
    After this patch:
    
        Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-25-generic
        Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
        Orc reader:                              Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
        ------------------------------------------------------------------------------------------------
        reading Orc file                              3550 / 3824          0.6        1733.2       1.0X
    
    
    
    ## How was this patch tested?
    Existing tests.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 vectorized-orc-reader3

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13775.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13775
    
----
commit 2861ac2a5136c065ec38cfc24bf9f979d5b7ae07
Author: Liang-Chi Hsieh <si...@tw.ibm.com>
Date:   2016-06-16T02:31:23Z

    Add vectorized Orc reader support.

commit eee8eca70920d624becb43c8510d217ce4d9820b
Author: Liang-Chi Hsieh <si...@tw.ibm.com>
Date:   2016-06-17T09:44:11Z

    import.

commit b753d09e3e369fc91a17d9632123dbe40d7d9dfb
Author: Liang-Chi Hsieh <si...@tw.ibm.com>
Date:   2016-06-18T10:00:00Z

    If column is repeating, always using row id 0.

commit 7d26f5ed785269299b324df8bfc1c64c2d4a2b48
Author: Liang-Chi Hsieh <si...@tw.ibm.com>
Date:   2016-06-19T04:16:49Z

    Fix bugs of getBinary and numFields.

commit 74fe936e522a827384461e445b9ba44f96ce29fe
Author: Liang-Chi Hsieh <si...@tw.ibm.com>
Date:   2016-06-20T02:44:07Z

    Remove unnecessary change.

commit 7e7bb6c57860187f391f66ca82cdd715d0b2be43
Author: Liang-Chi Hsieh <si...@tw.ibm.com>
Date:   2016-06-20T02:48:11Z

    Remove unnecessary change.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    **[Test build #69156 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69156/consoleFull)** for PR 13775 at commit [`0ac61b7`](https://github.com/apache/spark/commit/0ac61b794146634887d184076aababfd25a22ff5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    @hvanhovell @rxin I've updated the benchmark. Please let me know if this time it is appropriate. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    ping @yhuai @liancheng @hvanhovell @cloud-fan Can you take a look at this? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    This is an interesting pull request (very similar to the Parquet approach). Two high level comments:
    
    1. Can we just convert Hive's vector format into our own column batch, to avoid turning the data into rows? I'd imagine you will get a bigger speedup here.
    
    2. We need more tests. This patch has almost no tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    **[Test build #69156 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69156/consoleFull)** for PR 13775 at commit [`0ac61b7`](https://github.com/apache/spark/commit/0ac61b794146634887d184076aababfd25a22ff5).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66436/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60827/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    **[Test build #61037 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61037/consoleFull)** for PR 13775 at commit [`855bcfd`](https://github.com/apache/spark/commit/855bcfde2067af4bd88d95a6365f976ecf891de9).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61037/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    **[Test build #69082 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69082/consoleFull)** for PR 13775 at commit [`c297678`](https://github.com/apache/spark/commit/c2976788255588d66ad2527646e0719e32bdf182).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    **[Test build #69133 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69133/consoleFull)** for PR 13775 at commit [`55bb19f`](https://github.com/apache/spark/commit/55bb19f91658767acf08e06ee7e64db27a7222aa).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    @yhuai You mean just using `sql("SELECT * FROM t").count()`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    **[Test build #61088 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61088/consoleFull)** for PR 13775 at commit [`66ab632`](https://github.com/apache/spark/commit/66ab632274674ae5b38c84bac8801feab3c9d2e0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13775#discussion_r83796409
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala ---
    @@ -118,6 +120,11 @@ class OrcFileFormat extends FileFormat with DataSourceRegister with Serializable
         val broadcastedHadoopConf =
           sparkSession.sparkContext.broadcast(new SerializableConfiguration(hadoopConf))
     
    +    val enableVectorizedReader: Boolean =
    +      sparkSession.sessionState.conf.orcVectorizedReaderEnabled &&
    +      dataSchema.forall(f => f.dataType.isInstanceOf[AtomicType] &&
    --- End diff --
    
    This is similar with ParquetFileFormat does. We might not add new `AtomicType` frequently and current `AtomicType` should be relatively stable. If we do, it should be easily tested out the data type is not supported in reader codes.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    @dongjoon-hyun No problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    **[Test build #61384 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61384/consoleFull)** for PR 13775 at commit [`4c14278`](https://github.com/apache/spark/commit/4c14278d067e37cba569de73d24ba8f23c6eb450).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    **[Test build #69124 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69124/consoleFull)** for PR 13775 at commit [`8638a0e`](https://github.com/apache/spark/commit/8638a0e2b98719770bff50804dcc0fc0e83674ad).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13775#discussion_r89127052
  
    --- Diff: sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/orc/SparkVectorizedOrcRecordReader.java ---
    @@ -0,0 +1,189 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.hadoop.hive.ql.io.orc;
    +
    +import java.io.IOException;
    +import java.util.LinkedList;
    +import java.util.List;
    +
    +import org.apache.hadoop.conf.Configuration;
    +import org.apache.hadoop.hive.ql.exec.vector.ColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DecimalColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DoubleColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
    +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
    +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
    +import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
    +import org.apache.hadoop.hive.serde2.objectinspector.StructField;
    +import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
    +import org.apache.hadoop.hive.serde2.typeinfo.DecimalTypeInfo;
    +import org.apache.hadoop.hive.serde2.typeinfo.PrimitiveTypeInfo;
    +import org.apache.hadoop.io.NullWritable;
    +import org.apache.hadoop.mapred.FileSplit;
    +import org.apache.hadoop.mapred.RecordReader;
    +
    +/**
    + * A mapred.RecordReader that returns VectorizedRowBatch.
    + */
    +public class SparkVectorizedOrcRecordReader
    +      implements RecordReader<NullWritable, VectorizedRowBatch> {
    +    private final org.apache.hadoop.hive.ql.io.orc.RecordReader reader;
    +    private final long offset;
    +    private final long length;
    +    private float progress = 0.0f;
    +    private ObjectInspector objectInspector;
    +
    +    SparkVectorizedOrcRecordReader(Reader file, Configuration conf,
    +        FileSplit fileSplit) throws IOException {
    +      this.offset = fileSplit.getStart();
    +      this.length = fileSplit.getLength();
    +      this.objectInspector = file.getObjectInspector();
    +      this.reader = OrcInputFormat.createReaderFromFile(file, conf, this.offset,
    +        this.length);
    +      this.progress = reader.getProgress();
    +    }
    +
    +    /**
    +     * Create a ColumnVector based on given ObjectInspector's type info.
    +     *
    +     * @param inspector ObjectInspector
    +     */
    +    private ColumnVector createColumnVector(ObjectInspector inspector) {
    +      switch(inspector.getCategory()) {
    +        case PRIMITIVE:
    +          {
    +            PrimitiveTypeInfo primitiveTypeInfo =
    +              (PrimitiveTypeInfo) ((PrimitiveObjectInspector)inspector).getTypeInfo();
    +            switch(primitiveTypeInfo.getPrimitiveCategory()) {
    +              case BOOLEAN:
    +              case BYTE:
    +              case SHORT:
    +              case INT:
    +              case LONG:
    +              case DATE:
    +              case INTERVAL_YEAR_MONTH:
    +                return new LongColumnVector(VectorizedRowBatch.DEFAULT_SIZE);
    +              case FLOAT:
    +              case DOUBLE:
    +                return new DoubleColumnVector(VectorizedRowBatch.DEFAULT_SIZE);
    +              case BINARY:
    +              case STRING:
    +              case CHAR:
    +              case VARCHAR:
    +                BytesColumnVector column = new BytesColumnVector(VectorizedRowBatch.DEFAULT_SIZE);
    +                column.initBuffer();
    +                return column;
    +              case DECIMAL:
    +                DecimalTypeInfo tInfo = (DecimalTypeInfo) primitiveTypeInfo;
    +                return new DecimalColumnVector(VectorizedRowBatch.DEFAULT_SIZE,
    +                    tInfo.precision(), tInfo.scale());
    +              default:
    +                throw new RuntimeException("Vectorizaton is not supported for datatype:"
    +                    + primitiveTypeInfo.getPrimitiveCategory());
    +            }
    +          }
    +        default:
    +          throw new RuntimeException("Vectorization is not supported for datatype:"
    +              + inspector.getCategory());
    +      }
    +    }
    +
    +    /**
    +     * Walk through the object inspector and add column vectors
    +     *
    +     * @param oi StructObjectInspector
    +     * @param cvList ColumnVectors are populated in this list
    +     */
    +    private void allocateColumnVector(StructObjectInspector oi,
    +        List<ColumnVector> cvList) {
    +      if (cvList == null) {
    +        throw new RuntimeException("Null columnvector list");
    +      }
    +      if (oi == null) {
    +        return;
    +      }
    +      final List<? extends StructField> fields = oi.getAllStructFieldRefs();
    +      for(StructField field : fields) {
    +        ObjectInspector fieldObjectInspector = field.getFieldObjectInspector();
    +        cvList.add(createColumnVector(fieldObjectInspector));
    +      }
    +    }
    +
    +    /**
    +     * Create VectorizedRowBatch from ObjectInspector
    +     *
    +     * @param oi StructObjectInspector
    +     * @return VectorizedRowBatch
    +     */
    +    private VectorizedRowBatch constructVectorizedRowBatch(
    +        StructObjectInspector oi) {
    +      final List<ColumnVector> cvList = new LinkedList<ColumnVector>();
    +      allocateColumnVector(oi, cvList);
    +      final VectorizedRowBatch result = new VectorizedRowBatch(cvList.size());
    +      int i = 0;
    +      for(ColumnVector cv : cvList) {
    +        result.cols[i++] = cv;
    +      }
    +      return result;
    +    }
    +
    +    @Override
    +    public boolean next(NullWritable key, VectorizedRowBatch value) throws IOException {
    --- End diff --
    
    Do you mean a batch is empty and then its next batch has some rows? I think it is not possible.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68995/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    **[Test build #63498 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63498/consoleFull)** for PR 13775 at commit [`b067658`](https://github.com/apache/spark/commit/b067658c53a3252f0a8a288e09b07feaf0ace8d4).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by tejasapatil <gi...@git.apache.org>.

Github user tejasapatil commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13775#discussion_r83763831
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala ---
    @@ -118,6 +120,11 @@ class OrcFileFormat extends FileFormat with DataSourceRegister with Serializable
         val broadcastedHadoopConf =
           sparkSession.sparkContext.broadcast(new SerializableConfiguration(hadoopConf))
     
    +    val enableVectorizedReader: Boolean =
    +      sparkSession.sessionState.conf.orcVectorizedReaderEnabled &&
    +      dataSchema.forall(f => f.dataType.isInstanceOf[AtomicType] &&
    --- End diff --
    
    This is not reliable. If a new `AtomicType` type gets introduced and not supported for vectorised reads, this will not guard us against that. Its easy for anyone to forget to add new atomic type to the exclusion list in this check


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    **[Test build #69131 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69131/consoleFull)** for PR 13775 at commit [`8638a0e`](https://github.com/apache/spark/commit/8638a0e2b98719770bff50804dcc0fc0e83674ad).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60828/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by tejasapatil <gi...@git.apache.org>.

Github user tejasapatil commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13775#discussion_r83752300
  
    --- Diff: sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/orc/SparkVectorizedOrcRecordReader.java ---
    @@ -0,0 +1,189 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.hadoop.hive.ql.io.orc;
    +
    +import java.io.IOException;
    +import java.util.LinkedList;
    +import java.util.List;
    +
    +import org.apache.hadoop.conf.Configuration;
    +import org.apache.hadoop.hive.ql.exec.vector.ColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DecimalColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DoubleColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
    +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
    +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
    +import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
    +import org.apache.hadoop.hive.serde2.objectinspector.StructField;
    +import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
    +import org.apache.hadoop.hive.serde2.typeinfo.DecimalTypeInfo;
    +import org.apache.hadoop.hive.serde2.typeinfo.PrimitiveTypeInfo;
    +import org.apache.hadoop.io.NullWritable;
    +import org.apache.hadoop.mapred.FileSplit;
    +import org.apache.hadoop.mapred.RecordReader;
    +
    +/**
    + * A mapred.RecordReader that returns VectorizedRowBatch.
    + */
    +public class SparkVectorizedOrcRecordReader
    +      implements RecordReader<NullWritable, VectorizedRowBatch> {
    +    private final org.apache.hadoop.hive.ql.io.orc.RecordReader reader;
    +    private final long offset;
    +    private final long length;
    +    private float progress = 0.0f;
    +    private ObjectInspector objectInspector;
    +
    +    SparkVectorizedOrcRecordReader(Reader file, Configuration conf,
    +        FileSplit fileSplit) throws IOException {
    +      this.offset = fileSplit.getStart();
    +      this.length = fileSplit.getLength();
    +      this.objectInspector = file.getObjectInspector();
    +      this.reader = OrcInputFormat.createReaderFromFile(file, conf, this.offset,
    +        this.length);
    +      this.progress = reader.getProgress();
    +    }
    +
    +    /**
    +     * Create a ColumnVector based on given ObjectInspector's type info.
    +     *
    +     * @param inspector ObjectInspector
    +     */
    +    private ColumnVector createColumnVector(ObjectInspector inspector) {
    +      switch(inspector.getCategory()) {
    +        case PRIMITIVE:
    +          {
    +            PrimitiveTypeInfo primitiveTypeInfo =
    +              (PrimitiveTypeInfo) ((PrimitiveObjectInspector)inspector).getTypeInfo();
    +            switch(primitiveTypeInfo.getPrimitiveCategory()) {
    +              case BOOLEAN:
    +              case BYTE:
    +              case SHORT:
    +              case INT:
    +              case LONG:
    +              case DATE:
    +              case INTERVAL_YEAR_MONTH:
    +                return new LongColumnVector(VectorizedRowBatch.DEFAULT_SIZE);
    +              case FLOAT:
    +              case DOUBLE:
    +                return new DoubleColumnVector(VectorizedRowBatch.DEFAULT_SIZE);
    +              case BINARY:
    +              case STRING:
    +              case CHAR:
    +              case VARCHAR:
    +                BytesColumnVector column = new BytesColumnVector(VectorizedRowBatch.DEFAULT_SIZE);
    +                column.initBuffer();
    +                return column;
    +              case DECIMAL:
    +                DecimalTypeInfo tInfo = (DecimalTypeInfo) primitiveTypeInfo;
    +                return new DecimalColumnVector(VectorizedRowBatch.DEFAULT_SIZE,
    +                    tInfo.precision(), tInfo.scale());
    +              default:
    +                throw new RuntimeException("Vectorizaton is not supported for datatype:"
    +                    + primitiveTypeInfo.getPrimitiveCategory());
    +            }
    +          }
    +        default:
    +          throw new RuntimeException("Vectorization is not supported for datatype:"
    --- End diff --
    
    The exception message is not actionable for end user. There are two options:
    - Automatically fallback to non-vectorised codepath which we know would work. OR
    - Suggest config change(s) to user so that they can do it and not have to google the solution.
    
    Later one is an easier path.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    **[Test build #69148 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69148/consoleFull)** for PR 13775 at commit [`3014834`](https://github.com/apache/spark/commit/3014834391906264797b38d2e1b2e50b8e6d5327).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    **[Test build #69124 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69124/consoleFull)** for PR 13775 at commit [`8638a0e`](https://github.com/apache/spark/commit/8638a0e2b98719770bff50804dcc0fc0e83674ad).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61088/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    > 1. Can we just convert Hive's vector format into our own column batch, to avoid turning the data into rows? I'd imagine you will get a bigger speedup here.
    
    @rxin You are right. After converting Hive's column vectors into Spark's column batch, the performance is largely improved. I've posted the newer benchmark in the pr description.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    **[Test build #69080 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69080/consoleFull)** for PR 13775 at commit [`47bc196`](https://github.com/apache/spark/commit/47bc196a8e91f280cda235c60d0974c7c69fb0ad).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by tejasapatil <gi...@git.apache.org>.

Github user tejasapatil commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13775#discussion_r83753744
  
    --- Diff: sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/orc/SparkVectorizedOrcRecordReader.java ---
    @@ -0,0 +1,189 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.hadoop.hive.ql.io.orc;
    +
    +import java.io.IOException;
    +import java.util.LinkedList;
    +import java.util.List;
    +
    +import org.apache.hadoop.conf.Configuration;
    +import org.apache.hadoop.hive.ql.exec.vector.ColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DecimalColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DoubleColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
    +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
    +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
    +import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
    +import org.apache.hadoop.hive.serde2.objectinspector.StructField;
    +import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
    +import org.apache.hadoop.hive.serde2.typeinfo.DecimalTypeInfo;
    +import org.apache.hadoop.hive.serde2.typeinfo.PrimitiveTypeInfo;
    +import org.apache.hadoop.io.NullWritable;
    +import org.apache.hadoop.mapred.FileSplit;
    +import org.apache.hadoop.mapred.RecordReader;
    +
    +/**
    + * A mapred.RecordReader that returns VectorizedRowBatch.
    + */
    +public class SparkVectorizedOrcRecordReader
    +      implements RecordReader<NullWritable, VectorizedRowBatch> {
    +    private final org.apache.hadoop.hive.ql.io.orc.RecordReader reader;
    +    private final long offset;
    +    private final long length;
    +    private float progress = 0.0f;
    +    private ObjectInspector objectInspector;
    +
    +    SparkVectorizedOrcRecordReader(Reader file, Configuration conf,
    +        FileSplit fileSplit) throws IOException {
    +      this.offset = fileSplit.getStart();
    +      this.length = fileSplit.getLength();
    +      this.objectInspector = file.getObjectInspector();
    +      this.reader = OrcInputFormat.createReaderFromFile(file, conf, this.offset,
    +        this.length);
    +      this.progress = reader.getProgress();
    +    }
    +
    +    /**
    +     * Create a ColumnVector based on given ObjectInspector's type info.
    +     *
    +     * @param inspector ObjectInspector
    +     */
    +    private ColumnVector createColumnVector(ObjectInspector inspector) {
    +      switch(inspector.getCategory()) {
    +        case PRIMITIVE:
    +          {
    +            PrimitiveTypeInfo primitiveTypeInfo =
    +              (PrimitiveTypeInfo) ((PrimitiveObjectInspector)inspector).getTypeInfo();
    +            switch(primitiveTypeInfo.getPrimitiveCategory()) {
    +              case BOOLEAN:
    +              case BYTE:
    +              case SHORT:
    +              case INT:
    +              case LONG:
    +              case DATE:
    +              case INTERVAL_YEAR_MONTH:
    +                return new LongColumnVector(VectorizedRowBatch.DEFAULT_SIZE);
    +              case FLOAT:
    +              case DOUBLE:
    +                return new DoubleColumnVector(VectorizedRowBatch.DEFAULT_SIZE);
    +              case BINARY:
    +              case STRING:
    +              case CHAR:
    +              case VARCHAR:
    +                BytesColumnVector column = new BytesColumnVector(VectorizedRowBatch.DEFAULT_SIZE);
    +                column.initBuffer();
    +                return column;
    +              case DECIMAL:
    +                DecimalTypeInfo tInfo = (DecimalTypeInfo) primitiveTypeInfo;
    +                return new DecimalColumnVector(VectorizedRowBatch.DEFAULT_SIZE,
    +                    tInfo.precision(), tInfo.scale());
    +              default:
    +                throw new RuntimeException("Vectorizaton is not supported for datatype:"
    +                    + primitiveTypeInfo.getPrimitiveCategory());
    +            }
    +          }
    +        default:
    +          throw new RuntimeException("Vectorization is not supported for datatype:"
    +              + inspector.getCategory());
    +      }
    +    }
    +
    +    /**
    +     * Walk through the object inspector and add column vectors
    +     *
    +     * @param oi StructObjectInspector
    +     * @param cvList ColumnVectors are populated in this list
    +     */
    +    private void allocateColumnVector(StructObjectInspector oi,
    +        List<ColumnVector> cvList) {
    +      if (cvList == null) {
    +        throw new RuntimeException("Null columnvector list");
    +      }
    +      if (oi == null) {
    +        return;
    +      }
    +      final List<? extends StructField> fields = oi.getAllStructFieldRefs();
    +      for(StructField field : fields) {
    +        ObjectInspector fieldObjectInspector = field.getFieldObjectInspector();
    +        cvList.add(createColumnVector(fieldObjectInspector));
    +      }
    +    }
    +
    +    /**
    +     * Create VectorizedRowBatch from ObjectInspector
    +     *
    +     * @param oi StructObjectInspector
    +     * @return VectorizedRowBatch
    +     */
    +    private VectorizedRowBatch constructVectorizedRowBatch(
    +        StructObjectInspector oi) {
    +      final List<ColumnVector> cvList = new LinkedList<ColumnVector>();
    +      allocateColumnVector(oi, cvList);
    +      final VectorizedRowBatch result = new VectorizedRowBatch(cvList.size());
    +      int i = 0;
    +      for(ColumnVector cv : cvList) {
    +        result.cols[i++] = cv;
    +      }
    +      return result;
    +    }
    +
    +    @Override
    +    public boolean next(NullWritable key, VectorizedRowBatch value) throws IOException {
    +      if (reader.hasNext()) {
    +        try {
    +          reader.nextBatch(value);
    +          if (value == null || value.endOfFile || value.size == 0) {
    +            return false;
    +          }
    +        } catch (Exception e) {
    +          throw new RuntimeException(e);
    +        }
    +        progress = reader.getProgress();
    +        return true;
    +      } else {
    --- End diff --
    
    nit: you can get rid of the else and simply `return false;` in the end


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63498/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66424/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69156/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61075/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    for the benchmark, how about we just test the scan operation?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    **[Test build #68990 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68990/consoleFull)** for PR 13775 at commit [`3895a98`](https://github.com/apache/spark/commit/3895a980a2aae2dc7dedbf0797bb8a37d089e683).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    This is still wrong unfortunately --- count(*) is going to prune all the columns ...



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    @hvanhovell @rxin Got it. Thanks! I will re-run the benchmark.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by tejasapatil <gi...@git.apache.org>.

Github user tejasapatil commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13775#discussion_r83756504
  
    --- Diff: sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/orc/SparkVectorizedOrcRecordReader.java ---
    @@ -0,0 +1,189 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.hadoop.hive.ql.io.orc;
    +
    +import java.io.IOException;
    +import java.util.LinkedList;
    +import java.util.List;
    +
    +import org.apache.hadoop.conf.Configuration;
    +import org.apache.hadoop.hive.ql.exec.vector.ColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DecimalColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DoubleColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
    +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
    +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
    +import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
    +import org.apache.hadoop.hive.serde2.objectinspector.StructField;
    +import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
    +import org.apache.hadoop.hive.serde2.typeinfo.DecimalTypeInfo;
    +import org.apache.hadoop.hive.serde2.typeinfo.PrimitiveTypeInfo;
    +import org.apache.hadoop.io.NullWritable;
    +import org.apache.hadoop.mapred.FileSplit;
    +import org.apache.hadoop.mapred.RecordReader;
    +
    +/**
    + * A mapred.RecordReader that returns VectorizedRowBatch.
    + */
    +public class SparkVectorizedOrcRecordReader
    +      implements RecordReader<NullWritable, VectorizedRowBatch> {
    +    private final org.apache.hadoop.hive.ql.io.orc.RecordReader reader;
    +    private final long offset;
    +    private final long length;
    +    private float progress = 0.0f;
    +    private ObjectInspector objectInspector;
    +
    +    SparkVectorizedOrcRecordReader(Reader file, Configuration conf,
    +        FileSplit fileSplit) throws IOException {
    +      this.offset = fileSplit.getStart();
    +      this.length = fileSplit.getLength();
    +      this.objectInspector = file.getObjectInspector();
    +      this.reader = OrcInputFormat.createReaderFromFile(file, conf, this.offset,
    +        this.length);
    +      this.progress = reader.getProgress();
    +    }
    +
    +    /**
    +     * Create a ColumnVector based on given ObjectInspector's type info.
    +     *
    +     * @param inspector ObjectInspector
    +     */
    +    private ColumnVector createColumnVector(ObjectInspector inspector) {
    +      switch(inspector.getCategory()) {
    +        case PRIMITIVE:
    +          {
    +            PrimitiveTypeInfo primitiveTypeInfo =
    +              (PrimitiveTypeInfo) ((PrimitiveObjectInspector)inspector).getTypeInfo();
    +            switch(primitiveTypeInfo.getPrimitiveCategory()) {
    +              case BOOLEAN:
    +              case BYTE:
    +              case SHORT:
    +              case INT:
    +              case LONG:
    +              case DATE:
    +              case INTERVAL_YEAR_MONTH:
    +                return new LongColumnVector(VectorizedRowBatch.DEFAULT_SIZE);
    +              case FLOAT:
    +              case DOUBLE:
    +                return new DoubleColumnVector(VectorizedRowBatch.DEFAULT_SIZE);
    +              case BINARY:
    +              case STRING:
    +              case CHAR:
    +              case VARCHAR:
    +                BytesColumnVector column = new BytesColumnVector(VectorizedRowBatch.DEFAULT_SIZE);
    +                column.initBuffer();
    +                return column;
    +              case DECIMAL:
    +                DecimalTypeInfo tInfo = (DecimalTypeInfo) primitiveTypeInfo;
    +                return new DecimalColumnVector(VectorizedRowBatch.DEFAULT_SIZE,
    +                    tInfo.precision(), tInfo.scale());
    +              default:
    +                throw new RuntimeException("Vectorizaton is not supported for datatype:"
    +                    + primitiveTypeInfo.getPrimitiveCategory());
    +            }
    +          }
    +        default:
    +          throw new RuntimeException("Vectorization is not supported for datatype:"
    +              + inspector.getCategory());
    +      }
    +    }
    +
    +    /**
    +     * Walk through the object inspector and add column vectors
    +     *
    +     * @param oi StructObjectInspector
    +     * @param cvList ColumnVectors are populated in this list
    +     */
    +    private void allocateColumnVector(StructObjectInspector oi,
    +        List<ColumnVector> cvList) {
    +      if (cvList == null) {
    +        throw new RuntimeException("Null columnvector list");
    +      }
    +      if (oi == null) {
    +        return;
    +      }
    +      final List<? extends StructField> fields = oi.getAllStructFieldRefs();
    +      for(StructField field : fields) {
    +        ObjectInspector fieldObjectInspector = field.getFieldObjectInspector();
    +        cvList.add(createColumnVector(fieldObjectInspector));
    +      }
    +    }
    +
    +    /**
    +     * Create VectorizedRowBatch from ObjectInspector
    +     *
    +     * @param oi StructObjectInspector
    +     * @return VectorizedRowBatch
    +     */
    +    private VectorizedRowBatch constructVectorizedRowBatch(
    +        StructObjectInspector oi) {
    +      final List<ColumnVector> cvList = new LinkedList<ColumnVector>();
    --- End diff --
    
    Do you really need this list ? I see this list is being used just for getting the column vectors and then add those to the `VectorizedRowBatch` below. Why not populate the `VectorizedRowBatch` directly ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    **[Test build #69147 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69147/consoleFull)** for PR 13775 at commit [`160e924`](https://github.com/apache/spark/commit/160e92470136282ae3e94dc82ed41571a601017f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by tejasapatil <gi...@git.apache.org>.

Github user tejasapatil commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13775#discussion_r83753217
  
    --- Diff: sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/orc/SparkVectorizedOrcRecordReader.java ---
    @@ -0,0 +1,189 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.hadoop.hive.ql.io.orc;
    +
    +import java.io.IOException;
    +import java.util.LinkedList;
    +import java.util.List;
    +
    +import org.apache.hadoop.conf.Configuration;
    +import org.apache.hadoop.hive.ql.exec.vector.ColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DecimalColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DoubleColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
    +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
    +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
    +import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
    +import org.apache.hadoop.hive.serde2.objectinspector.StructField;
    +import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
    +import org.apache.hadoop.hive.serde2.typeinfo.DecimalTypeInfo;
    +import org.apache.hadoop.hive.serde2.typeinfo.PrimitiveTypeInfo;
    +import org.apache.hadoop.io.NullWritable;
    +import org.apache.hadoop.mapred.FileSplit;
    +import org.apache.hadoop.mapred.RecordReader;
    +
    +/**
    + * A mapred.RecordReader that returns VectorizedRowBatch.
    + */
    +public class SparkVectorizedOrcRecordReader
    +      implements RecordReader<NullWritable, VectorizedRowBatch> {
    +    private final org.apache.hadoop.hive.ql.io.orc.RecordReader reader;
    +    private final long offset;
    +    private final long length;
    +    private float progress = 0.0f;
    +    private ObjectInspector objectInspector;
    +
    +    SparkVectorizedOrcRecordReader(Reader file, Configuration conf,
    --- End diff --
    
    nit: have each param on a separate line for readability. There are other places in this PR where the same comment will apply


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    @viirya . If possible, I'd like to benchmark this PR in my laptop.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    @dongjoon-hyun Sure. Do you need any help?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    ping @hvanhovell @rxin @liancheng @yhuai Could you review this now? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    **[Test build #66436 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66436/consoleFull)** for PR 13775 at commit [`ed780f6`](https://github.com/apache/spark/commit/ed780f66bf191eacdd2b81a2cfff4fbab71f1e4e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by tejasapatil <gi...@git.apache.org>.

Github user tejasapatil commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13775#discussion_r83753422
  
    --- Diff: sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/orc/SparkVectorizedOrcRecordReader.java ---
    @@ -0,0 +1,189 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.hadoop.hive.ql.io.orc;
    +
    +import java.io.IOException;
    +import java.util.LinkedList;
    +import java.util.List;
    +
    +import org.apache.hadoop.conf.Configuration;
    +import org.apache.hadoop.hive.ql.exec.vector.ColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DecimalColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DoubleColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
    +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
    +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
    +import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
    +import org.apache.hadoop.hive.serde2.objectinspector.StructField;
    +import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
    +import org.apache.hadoop.hive.serde2.typeinfo.DecimalTypeInfo;
    +import org.apache.hadoop.hive.serde2.typeinfo.PrimitiveTypeInfo;
    +import org.apache.hadoop.io.NullWritable;
    +import org.apache.hadoop.mapred.FileSplit;
    +import org.apache.hadoop.mapred.RecordReader;
    +
    +/**
    + * A mapred.RecordReader that returns VectorizedRowBatch.
    + */
    +public class SparkVectorizedOrcRecordReader
    +      implements RecordReader<NullWritable, VectorizedRowBatch> {
    +    private final org.apache.hadoop.hive.ql.io.orc.RecordReader reader;
    +    private final long offset;
    +    private final long length;
    +    private float progress = 0.0f;
    +    private ObjectInspector objectInspector;
    +
    +    SparkVectorizedOrcRecordReader(Reader file, Configuration conf,
    +        FileSplit fileSplit) throws IOException {
    +      this.offset = fileSplit.getStart();
    +      this.length = fileSplit.getLength();
    +      this.objectInspector = file.getObjectInspector();
    +      this.reader = OrcInputFormat.createReaderFromFile(file, conf, this.offset,
    +        this.length);
    +      this.progress = reader.getProgress();
    +    }
    +
    +    /**
    +     * Create a ColumnVector based on given ObjectInspector's type info.
    +     *
    +     * @param inspector ObjectInspector
    +     */
    +    private ColumnVector createColumnVector(ObjectInspector inspector) {
    +      switch(inspector.getCategory()) {
    +        case PRIMITIVE:
    +          {
    +            PrimitiveTypeInfo primitiveTypeInfo =
    +              (PrimitiveTypeInfo) ((PrimitiveObjectInspector)inspector).getTypeInfo();
    +            switch(primitiveTypeInfo.getPrimitiveCategory()) {
    +              case BOOLEAN:
    +              case BYTE:
    +              case SHORT:
    +              case INT:
    +              case LONG:
    +              case DATE:
    +              case INTERVAL_YEAR_MONTH:
    +                return new LongColumnVector(VectorizedRowBatch.DEFAULT_SIZE);
    +              case FLOAT:
    +              case DOUBLE:
    +                return new DoubleColumnVector(VectorizedRowBatch.DEFAULT_SIZE);
    +              case BINARY:
    +              case STRING:
    +              case CHAR:
    +              case VARCHAR:
    +                BytesColumnVector column = new BytesColumnVector(VectorizedRowBatch.DEFAULT_SIZE);
    +                column.initBuffer();
    +                return column;
    +              case DECIMAL:
    +                DecimalTypeInfo tInfo = (DecimalTypeInfo) primitiveTypeInfo;
    +                return new DecimalColumnVector(VectorizedRowBatch.DEFAULT_SIZE,
    +                    tInfo.precision(), tInfo.scale());
    +              default:
    +                throw new RuntimeException("Vectorizaton is not supported for datatype:"
    +                    + primitiveTypeInfo.getPrimitiveCategory());
    +            }
    +          }
    +        default:
    +          throw new RuntimeException("Vectorization is not supported for datatype:"
    +              + inspector.getCategory());
    +      }
    +    }
    +
    +    /**
    +     * Walk through the object inspector and add column vectors
    +     *
    +     * @param oi StructObjectInspector
    +     * @param cvList ColumnVectors are populated in this list
    +     */
    +    private void allocateColumnVector(StructObjectInspector oi,
    +        List<ColumnVector> cvList) {
    +      if (cvList == null) {
    +        throw new RuntimeException("Null columnvector list");
    +      }
    +      if (oi == null) {
    +        return;
    --- End diff --
    
    curious: why would object inspector be `null` ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by zjffdu <gi...@git.apache.org>.

Github user zjffdu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13775#discussion_r68016908
  
    --- Diff: sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/orc/SparkVectorizedOrcRecordReader.java ---
    @@ -0,0 +1,191 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.hadoop.hive.ql.io.orc;
    +
    +import java.io.IOException;
    +import java.util.LinkedList;
    +import java.util.List;
    +
    +import org.apache.hadoop.conf.Configuration;
    +import org.apache.hadoop.hive.ql.exec.vector.ColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DecimalColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DoubleColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
    +import org.apache.hadoop.hive.ql.metadata.HiveException;
    +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
    +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
    +import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
    +import org.apache.hadoop.hive.serde2.objectinspector.StructField;
    +import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
    +import org.apache.hadoop.hive.serde2.typeinfo.DecimalTypeInfo;
    +import org.apache.hadoop.hive.serde2.typeinfo.PrimitiveTypeInfo;
    +import org.apache.hadoop.io.NullWritable;
    +import org.apache.hadoop.mapred.FileSplit;
    +import org.apache.hadoop.mapred.RecordReader;
    +
    +/**
    + * A mapred.RecordReader that returns VectorizedRowBatch.
    + */
    +public class SparkVectorizedOrcRecordReader
    +      implements RecordReader<NullWritable, VectorizedRowBatch> {
    +    private final org.apache.hadoop.hive.ql.io.orc.RecordReader reader;
    +    private final long offset;
    +    private final long length;
    +    private float progress = 0.0f;
    +    private ObjectInspector objectInspector;
    +
    +    SparkVectorizedOrcRecordReader(Reader file, Configuration conf,
    +        FileSplit fileSplit) throws IOException {
    +      this.offset = fileSplit.getStart();
    +      this.length = fileSplit.getLength();
    +      this.objectInspector = file.getObjectInspector();
    +      this.reader = OrcInputFormat.createReaderFromFile(file, conf, this.offset,
    +        this.length);
    +      this.progress = reader.getProgress();
    +    }
    +
    +    /**
    +     * Create a ColumnVector based on given ObjectInspector's type info.
    +     *
    +     * @param inspector ObjectInspector
    +     */
    +    private ColumnVector createColumnVector(ObjectInspector inspector) {
    +      switch(inspector.getCategory()) {
    +      case PRIMITIVE:
    +        {
    +          PrimitiveTypeInfo primitiveTypeInfo =
    +            (PrimitiveTypeInfo) ((PrimitiveObjectInspector)inspector).getTypeInfo();
    +          switch(primitiveTypeInfo.getPrimitiveCategory()) {
    +            case BOOLEAN:
    +            case BYTE:
    +            case SHORT:
    +            case INT:
    +            case LONG:
    +            case DATE:
    +            case INTERVAL_YEAR_MONTH:
    +              return new LongColumnVector(VectorizedRowBatch.DEFAULT_SIZE);
    +            case FLOAT:
    +            case DOUBLE:
    +              return new DoubleColumnVector(VectorizedRowBatch.DEFAULT_SIZE);
    +            case BINARY:
    +            case STRING:
    +            case CHAR:
    +            case VARCHAR:
    +              BytesColumnVector column = new BytesColumnVector(VectorizedRowBatch.DEFAULT_SIZE);
    +              column.initBuffer();
    +              return column;
    +            case DECIMAL:
    +              DecimalTypeInfo tInfo = (DecimalTypeInfo) primitiveTypeInfo;
    +              return new DecimalColumnVector(VectorizedRowBatch.DEFAULT_SIZE,
    +                  tInfo.precision(), tInfo.scale());
    +            default:
    +              throw new RuntimeException("Vectorizaton is not supported for datatype:"
    +                  + primitiveTypeInfo.getPrimitiveCategory());
    +          }
    +        }
    +      default:
    +        throw new RuntimeException("Vectorization is not supported for datatype:"
    +            + inspector.getCategory());
    +      }
    +    }
    +
    +    /**
    +     * Walk through the object inspector and add column vectors
    +     *
    +     * @param oi StructObjectInspector
    +     * @param cvList ColumnVectors are populated in this list
    +     */
    +    private void allocateColumnVector(StructObjectInspector oi,
    +        List<ColumnVector> cvList) throws HiveException {
    +      if (cvList == null) {
    +        throw new HiveException("Null columnvector list");
    +      }
    +      if (oi == null) {
    +        return;
    +      }
    +      final List<? extends StructField> fields = oi.getAllStructFieldRefs();
    +      for(StructField field : fields) {
    +        ObjectInspector fieldObjectInspector = field.getFieldObjectInspector();
    +        cvList.add(createColumnVector(fieldObjectInspector));
    +      }
    +    }
    +
    +    /**
    +     * Create VectorizedRowBatch from ObjectInspector
    +     *
    +     * @param oi
    +     * @return
    +     * @throws HiveException
    +     */
    +    private VectorizedRowBatch constructVectorizedRowBatch(
    +        StructObjectInspector oi) throws HiveException {
    +      final List<ColumnVector> cvList = new LinkedList<ColumnVector>();
    +      allocateColumnVector(oi, cvList);
    +      final VectorizedRowBatch result = new VectorizedRowBatch(cvList.size());
    +      int i = 0;
    +      for(ColumnVector cv : cvList) {
    +        result.cols[i++] = cv;
    +      }
    +      return result;
    +    }
    +
    +    @Override
    +    public boolean next(NullWritable key, VectorizedRowBatch value) throws IOException {
    +      try {
    +        reader.nextBatch(value);
    +        if (value == null || value.endOfFile || value.size == 0) {
    +          return false;
    +        }
    +      } catch (Exception e) {
    +        throw new RuntimeException(e);
    +      }
    +      progress = reader.getProgress();
    +      return true;
    +    }
    +
    +    @Override
    +    public NullWritable createKey() {
    +      return NullWritable.get();
    +    }
    +
    +    @Override
    +    public VectorizedRowBatch createValue() {
    +      try {
    +        return constructVectorizedRowBatch((StructObjectInspector)this.objectInspector);
    +      } catch (HiveException e) {
    +      }
    --- End diff --
    
    Should log the error instead of ignore it ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by viirya <gi...@git.apache.org>.

Github user viirya closed the pull request at:

    https://github.com/apache/spark/pull/13775


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    also cc @liancheng @yhuai 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    @rxin Thanks for looking at this!
    
    1. My initial investigation when implementing this PR is looking at this direction too. Because this is waiting for a while, I can only remember that converting Hive vector format to column batch is not trivial. But I will look at this in second try.
    
    2. Yeah, I will try to add more tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by tejasapatil <gi...@git.apache.org>.

Github user tejasapatil commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13775#discussion_r83760016
  
    --- Diff: sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/orc/VectorizedSparkOrcNewRecordReader.java ---
    @@ -0,0 +1,318 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.hadoop.hive.ql.io.orc;
    +
    +import java.io.IOException;
    +import java.nio.charset.StandardCharsets;
    +import java.util.ArrayList;
    +import java.util.List;
    +
    +import org.apache.commons.lang.NotImplementedException;
    +
    +import org.apache.hadoop.conf.Configuration;
    +import org.apache.hadoop.hive.ql.exec.vector.ColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DecimalColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DoubleColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
    +import org.apache.hadoop.io.NullWritable;
    +import org.apache.hadoop.mapred.JobConf;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +import org.apache.hadoop.mapreduce.lib.input.FileSplit;
    +
    +import org.apache.spark.sql.catalyst.InternalRow;
    +import org.apache.spark.sql.catalyst.util.ArrayData;
    +import org.apache.spark.sql.catalyst.util.MapData;
    +import org.apache.spark.sql.types.DataType;
    +import org.apache.spark.sql.types.Decimal;
    +import org.apache.spark.unsafe.types.CalendarInterval;
    +import org.apache.spark.unsafe.types.UTF8String;
    +
    +/**
    + * A RecordReader that returns InternalRow for Spark SQL execution.
    + * This reader uses an internal reader that returns Hive's VectorizedRowBatch. An adapter
    + * class is used to return internal row by directly accessing data in column vectors.
    + */
    +public class VectorizedSparkOrcNewRecordReader
    +    extends org.apache.hadoop.mapreduce.RecordReader<NullWritable, InternalRow> {
    +  private final org.apache.hadoop.mapred.RecordReader<NullWritable, VectorizedRowBatch> reader;
    +  private final int numColumns;
    +  private VectorizedRowBatch internalValue;
    +  private float progress = 0.0f;
    +  private List<Integer> columnIDs;
    +
    +  private long numRowsOfBatch = 0;
    +  private int indexOfRow = 0;
    +
    +  private final Row row;
    +
    +  public VectorizedSparkOrcNewRecordReader(
    +      Reader file,
    +      JobConf conf,
    +      FileSplit fileSplit,
    +      List<Integer> columnIDs) throws IOException {
    +    List<OrcProto.Type> types = file.getTypes();
    +    numColumns = (types.size() == 0) ? 0 : types.get(0).getSubtypesCount();
    +    this.reader = new SparkVectorizedOrcRecordReader(file, conf,
    +      new org.apache.hadoop.mapred.FileSplit(fileSplit));
    +
    +    this.columnIDs = new ArrayList<>(columnIDs);
    +    this.internalValue = this.reader.createValue();
    +    this.progress = reader.getProgress();
    +    this.row = new Row(this.internalValue.cols, this.columnIDs);
    +  }
    +
    +  @Override
    +  public void close() throws IOException {
    +    reader.close();
    +  }
    +
    +  @Override
    +  public NullWritable getCurrentKey() throws IOException,
    +      InterruptedException {
    +    return NullWritable.get();
    +  }
    +
    +  @Override
    +  public InternalRow getCurrentValue() throws IOException,
    +      InterruptedException {
    +    if (indexOfRow >= numRowsOfBatch) {
    +      return null;
    +    }
    +    row.rowId = indexOfRow;
    +    indexOfRow++;
    +
    +    return row;
    +  }
    +
    +  @Override
    +  public float getProgress() throws IOException, InterruptedException {
    +    return progress;
    +  }
    +
    +  @Override
    +  public void initialize(InputSplit split, TaskAttemptContext context)
    +      throws IOException, InterruptedException {
    +  }
    +
    +  @Override
    +  public boolean nextKeyValue() throws IOException, InterruptedException {
    +    if (indexOfRow == numRowsOfBatch) {
    +      if (reader.next(NullWritable.get(), internalValue)) {
    +        if (internalValue.endOfFile) {
    +          progress = 1.0f;
    +          numRowsOfBatch = 0;
    +          indexOfRow = 0;
    +          return false;
    +        } else {
    +          assert internalValue.numCols == numColumns : "Incorrect number of columns in OrcBatch";
    +          numRowsOfBatch = internalValue.count();
    +          indexOfRow = 0;
    +          progress = reader.getProgress();
    +        }
    +        return true;
    +      } else {
    +        return false;
    +      }
    +    } else {
    +      if (indexOfRow < numRowsOfBatch) {
    +        return true;
    +      } else {
    +        return false;
    +      }
    +    }
    +  }
    +
    +  /**
    +   * Adapter class to return an internal row.
    +   */
    +  public static final class Row extends InternalRow {
    +    protected int rowId;
    +    private List<Integer> columnIDs;
    +    private final ColumnVector[] columns;
    +
    +    private Row(ColumnVector[] columns, List<Integer> columnIDs) {
    +      this.columns = columns;
    +      this.columnIDs = columnIDs;
    +    }
    +
    +    @Override
    +    public int numFields() { return columnIDs.size(); }
    +
    +    @Override
    +    public boolean anyNull() {
    +      for (int i = 0; i < columns.length; i++) {
    +        if (columnIDs.contains(i)) {
    +          if (columns[i].isRepeating && columns[i].isNull[0]) {
    +            return true;
    +          } else if (!columns[i].isRepeating && columns[i].isNull[rowId]) {
    +            return true;
    +          }
    +        }
    +      }
    +      return false;
    +    }
    +
    +    @Override
    +    public boolean isNullAt(int ordinal) {
    +      ColumnVector col = columns[columnIDs.get(ordinal)];
    +      if (col.isRepeating) {
    --- End diff --
    
    I see this pattern here and even below: The operation being done is inherently the same ... its just the index which changes in the if-else blocks. You could separate the operation being done from the index gathering part.
    
    ```
    int index = col.isRepeating ? 0 : rowId;
    return col.isNull[index];
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by dafrista <gi...@git.apache.org>.

Github user dafrista commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13775#discussion_r74356405
  
    --- Diff: sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/orc/VectorizedSparkOrcNewRecordReader.java ---
    @@ -0,0 +1,318 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.hadoop.hive.ql.io.orc;
    +
    +import java.io.IOException;
    +import java.nio.charset.StandardCharsets;
    +import java.util.ArrayList;
    +import java.util.List;
    +
    +import org.apache.commons.lang.NotImplementedException;
    +
    +import org.apache.hadoop.conf.Configuration;
    +import org.apache.hadoop.hive.ql.exec.vector.ColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DecimalColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DoubleColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
    +import org.apache.hadoop.io.NullWritable;
    +import org.apache.hadoop.mapred.JobConf;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +import org.apache.hadoop.mapreduce.lib.input.FileSplit;
    +
    +import org.apache.spark.sql.catalyst.InternalRow;
    +import org.apache.spark.sql.catalyst.util.ArrayData;
    +import org.apache.spark.sql.catalyst.util.MapData;
    +import org.apache.spark.sql.types.DataType;
    +import org.apache.spark.sql.types.Decimal;
    +import org.apache.spark.unsafe.types.CalendarInterval;
    +import org.apache.spark.unsafe.types.UTF8String;
    +
    +/**
    + * A RecordReader that returns InternalRow for Spark SQL execution.
    + * This reader uses an internal reader that returns Hive's VectorizedRowBatch. An adapter
    + * class is used to return internal row by directly accessing data in column vectors.
    + */
    +public class VectorizedSparkOrcNewRecordReader
    +    extends org.apache.hadoop.mapreduce.RecordReader<NullWritable, InternalRow> {
    +  private final org.apache.hadoop.mapred.RecordReader<NullWritable, VectorizedRowBatch> reader;
    +  private final int numColumns;
    +  private VectorizedRowBatch internalValue;
    +  private float progress = 0.0f;
    +  private List<Integer> columnIDs;
    +
    +  private long numRowsOfBatch = 0;
    +  private int indexOfRow = 0;
    +
    +  private final Row row;
    +
    +  public VectorizedSparkOrcNewRecordReader(
    +      Reader file,
    +      JobConf conf,
    +      FileSplit fileSplit,
    +      List<Integer> columnIDs) throws IOException {
    +    List<OrcProto.Type> types = file.getTypes();
    +    numColumns = (types.size() == 0) ? 0 : types.get(0).getSubtypesCount();
    +    this.reader = new SparkVectorizedOrcRecordReader(file, conf,
    +      new org.apache.hadoop.mapred.FileSplit(fileSplit));
    +
    +    this.columnIDs = new ArrayList<>(columnIDs);
    +    this.internalValue = this.reader.createValue();
    +    this.progress = reader.getProgress();
    +    this.row = new Row(this.internalValue.cols, this.columnIDs);
    +  }
    +
    +  @Override
    +  public void close() throws IOException {
    +    reader.close();
    +  }
    +
    +  @Override
    +  public NullWritable getCurrentKey() throws IOException,
    +      InterruptedException {
    +    return NullWritable.get();
    +  }
    +
    +  @Override
    +  public InternalRow getCurrentValue() throws IOException,
    +      InterruptedException {
    +    if (indexOfRow >= numRowsOfBatch) {
    +      return null;
    +    }
    +    row.rowId = indexOfRow;
    +    indexOfRow++;
    +
    +    return row;
    +  }
    +
    +  @Override
    +  public float getProgress() throws IOException, InterruptedException {
    +    return progress;
    +  }
    +
    +  @Override
    +  public void initialize(InputSplit split, TaskAttemptContext context)
    +      throws IOException, InterruptedException {
    +  }
    +
    +  @Override
    +  public boolean nextKeyValue() throws IOException, InterruptedException {
    +    if (indexOfRow == numRowsOfBatch && progress < 1.0f) {
    --- End diff --
    
    We can't rely on `progress` to indicate the last row -- `< 1.0f` can evaluate to true even before the last row because of imprecision. Anyway, this part of the check is redundant, `reader.next` on the next line will return false when no more rows remain.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69149/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    **[Test build #69147 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69147/consoleFull)** for PR 13775 at commit [`160e924`](https://github.com/apache/spark/commit/160e92470136282ae3e94dc82ed41571a601017f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13775#discussion_r74369496
  
    --- Diff: sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/orc/VectorizedSparkOrcNewRecordReader.java ---
    @@ -0,0 +1,318 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.hadoop.hive.ql.io.orc;
    +
    +import java.io.IOException;
    +import java.nio.charset.StandardCharsets;
    +import java.util.ArrayList;
    +import java.util.List;
    +
    +import org.apache.commons.lang.NotImplementedException;
    +
    +import org.apache.hadoop.conf.Configuration;
    +import org.apache.hadoop.hive.ql.exec.vector.ColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DecimalColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DoubleColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
    +import org.apache.hadoop.io.NullWritable;
    +import org.apache.hadoop.mapred.JobConf;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +import org.apache.hadoop.mapreduce.lib.input.FileSplit;
    +
    +import org.apache.spark.sql.catalyst.InternalRow;
    +import org.apache.spark.sql.catalyst.util.ArrayData;
    +import org.apache.spark.sql.catalyst.util.MapData;
    +import org.apache.spark.sql.types.DataType;
    +import org.apache.spark.sql.types.Decimal;
    +import org.apache.spark.unsafe.types.CalendarInterval;
    +import org.apache.spark.unsafe.types.UTF8String;
    +
    +/**
    + * A RecordReader that returns InternalRow for Spark SQL execution.
    + * This reader uses an internal reader that returns Hive's VectorizedRowBatch. An adapter
    + * class is used to return internal row by directly accessing data in column vectors.
    + */
    +public class VectorizedSparkOrcNewRecordReader
    +    extends org.apache.hadoop.mapreduce.RecordReader<NullWritable, InternalRow> {
    +  private final org.apache.hadoop.mapred.RecordReader<NullWritable, VectorizedRowBatch> reader;
    +  private final int numColumns;
    +  private VectorizedRowBatch internalValue;
    +  private float progress = 0.0f;
    +  private List<Integer> columnIDs;
    +
    +  private long numRowsOfBatch = 0;
    +  private int indexOfRow = 0;
    +
    +  private final Row row;
    +
    +  public VectorizedSparkOrcNewRecordReader(
    +      Reader file,
    +      JobConf conf,
    +      FileSplit fileSplit,
    +      List<Integer> columnIDs) throws IOException {
    +    List<OrcProto.Type> types = file.getTypes();
    +    numColumns = (types.size() == 0) ? 0 : types.get(0).getSubtypesCount();
    +    this.reader = new SparkVectorizedOrcRecordReader(file, conf,
    +      new org.apache.hadoop.mapred.FileSplit(fileSplit));
    +
    +    this.columnIDs = new ArrayList<>(columnIDs);
    +    this.internalValue = this.reader.createValue();
    +    this.progress = reader.getProgress();
    +    this.row = new Row(this.internalValue.cols, this.columnIDs);
    +  }
    +
    +  @Override
    +  public void close() throws IOException {
    +    reader.close();
    +  }
    +
    +  @Override
    +  public NullWritable getCurrentKey() throws IOException,
    +      InterruptedException {
    +    return NullWritable.get();
    +  }
    +
    +  @Override
    +  public InternalRow getCurrentValue() throws IOException,
    +      InterruptedException {
    +    if (indexOfRow >= numRowsOfBatch) {
    +      return null;
    +    }
    +    row.rowId = indexOfRow;
    +    indexOfRow++;
    +
    +    return row;
    +  }
    +
    +  @Override
    +  public float getProgress() throws IOException, InterruptedException {
    +    return progress;
    +  }
    +
    +  @Override
    +  public void initialize(InputSplit split, TaskAttemptContext context)
    +      throws IOException, InterruptedException {
    +  }
    +
    +  @Override
    +  public boolean nextKeyValue() throws IOException, InterruptedException {
    +    if (indexOfRow == numRowsOfBatch && progress < 1.0f) {
    --- End diff --
    
    oh. We can use reader.hasNext().


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    @viirya Have you tried the ORC reader in presto? I guess that one is more efficient than the one in Hive. But not sure how hard will be to make it work with Spark SQL (column batch).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by tejasapatil <gi...@git.apache.org>.

Github user tejasapatil commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13775#discussion_r83756988
  
    --- Diff: sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/orc/SparkVectorizedOrcRecordReader.java ---
    @@ -0,0 +1,189 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.hadoop.hive.ql.io.orc;
    +
    +import java.io.IOException;
    +import java.util.LinkedList;
    +import java.util.List;
    +
    +import org.apache.hadoop.conf.Configuration;
    +import org.apache.hadoop.hive.ql.exec.vector.ColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DecimalColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DoubleColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
    +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
    +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
    +import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
    +import org.apache.hadoop.hive.serde2.objectinspector.StructField;
    +import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
    +import org.apache.hadoop.hive.serde2.typeinfo.DecimalTypeInfo;
    +import org.apache.hadoop.hive.serde2.typeinfo.PrimitiveTypeInfo;
    +import org.apache.hadoop.io.NullWritable;
    +import org.apache.hadoop.mapred.FileSplit;
    +import org.apache.hadoop.mapred.RecordReader;
    +
    +/**
    + * A mapred.RecordReader that returns VectorizedRowBatch.
    + */
    +public class SparkVectorizedOrcRecordReader
    +      implements RecordReader<NullWritable, VectorizedRowBatch> {
    +    private final org.apache.hadoop.hive.ql.io.orc.RecordReader reader;
    +    private final long offset;
    +    private final long length;
    +    private float progress = 0.0f;
    +    private ObjectInspector objectInspector;
    +
    +    SparkVectorizedOrcRecordReader(Reader file, Configuration conf,
    +        FileSplit fileSplit) throws IOException {
    +      this.offset = fileSplit.getStart();
    +      this.length = fileSplit.getLength();
    +      this.objectInspector = file.getObjectInspector();
    +      this.reader = OrcInputFormat.createReaderFromFile(file, conf, this.offset,
    +        this.length);
    +      this.progress = reader.getProgress();
    +    }
    +
    +    /**
    +     * Create a ColumnVector based on given ObjectInspector's type info.
    +     *
    +     * @param inspector ObjectInspector
    +     */
    +    private ColumnVector createColumnVector(ObjectInspector inspector) {
    +      switch(inspector.getCategory()) {
    +        case PRIMITIVE:
    +          {
    +            PrimitiveTypeInfo primitiveTypeInfo =
    +              (PrimitiveTypeInfo) ((PrimitiveObjectInspector)inspector).getTypeInfo();
    +            switch(primitiveTypeInfo.getPrimitiveCategory()) {
    +              case BOOLEAN:
    +              case BYTE:
    +              case SHORT:
    +              case INT:
    +              case LONG:
    +              case DATE:
    +              case INTERVAL_YEAR_MONTH:
    +                return new LongColumnVector(VectorizedRowBatch.DEFAULT_SIZE);
    +              case FLOAT:
    +              case DOUBLE:
    +                return new DoubleColumnVector(VectorizedRowBatch.DEFAULT_SIZE);
    +              case BINARY:
    +              case STRING:
    +              case CHAR:
    +              case VARCHAR:
    +                BytesColumnVector column = new BytesColumnVector(VectorizedRowBatch.DEFAULT_SIZE);
    +                column.initBuffer();
    +                return column;
    +              case DECIMAL:
    +                DecimalTypeInfo tInfo = (DecimalTypeInfo) primitiveTypeInfo;
    +                return new DecimalColumnVector(VectorizedRowBatch.DEFAULT_SIZE,
    +                    tInfo.precision(), tInfo.scale());
    +              default:
    +                throw new RuntimeException("Vectorizaton is not supported for datatype:"
    +                    + primitiveTypeInfo.getPrimitiveCategory());
    +            }
    +          }
    +        default:
    +          throw new RuntimeException("Vectorization is not supported for datatype:"
    +              + inspector.getCategory());
    +      }
    +    }
    +
    +    /**
    +     * Walk through the object inspector and add column vectors
    +     *
    +     * @param oi StructObjectInspector
    +     * @param cvList ColumnVectors are populated in this list
    +     */
    +    private void allocateColumnVector(StructObjectInspector oi,
    --- End diff --
    
    The method name is bad. You have already allocated the list in the caller. I would either :
    
    - move the list creation in the method and not the caller OR
    - rename the method as `populateColumnVector()`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    Hmm. It seems `Merge remote-tracking branch` makes rebasing confused. Let me think how to compare this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    ping @rxin @yhuai @liancheng @hvanhovell @cloud-fan Can you take a look? This is waiting for review for a while. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    **[Test build #66424 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66424/consoleFull)** for PR 13775 at commit [`ed780f6`](https://github.com/apache/spark/commit/ed780f66bf191eacdd2b81a2cfff4fbab71f1e4e).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69133/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by tejasapatil <gi...@git.apache.org>.

Github user tejasapatil commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13775#discussion_r83752360
  
    --- Diff: sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/orc/SparkVectorizedOrcRecordReader.java ---
    @@ -0,0 +1,189 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.hadoop.hive.ql.io.orc;
    +
    +import java.io.IOException;
    +import java.util.LinkedList;
    +import java.util.List;
    +
    +import org.apache.hadoop.conf.Configuration;
    +import org.apache.hadoop.hive.ql.exec.vector.ColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DecimalColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DoubleColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
    +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
    +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
    +import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
    +import org.apache.hadoop.hive.serde2.objectinspector.StructField;
    +import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
    +import org.apache.hadoop.hive.serde2.typeinfo.DecimalTypeInfo;
    +import org.apache.hadoop.hive.serde2.typeinfo.PrimitiveTypeInfo;
    +import org.apache.hadoop.io.NullWritable;
    +import org.apache.hadoop.mapred.FileSplit;
    +import org.apache.hadoop.mapred.RecordReader;
    +
    +/**
    + * A mapred.RecordReader that returns VectorizedRowBatch.
    + */
    +public class SparkVectorizedOrcRecordReader
    +      implements RecordReader<NullWritable, VectorizedRowBatch> {
    +    private final org.apache.hadoop.hive.ql.io.orc.RecordReader reader;
    +    private final long offset;
    +    private final long length;
    +    private float progress = 0.0f;
    +    private ObjectInspector objectInspector;
    +
    +    SparkVectorizedOrcRecordReader(Reader file, Configuration conf,
    +        FileSplit fileSplit) throws IOException {
    +      this.offset = fileSplit.getStart();
    +      this.length = fileSplit.getLength();
    +      this.objectInspector = file.getObjectInspector();
    +      this.reader = OrcInputFormat.createReaderFromFile(file, conf, this.offset,
    +        this.length);
    +      this.progress = reader.getProgress();
    +    }
    +
    +    /**
    +     * Create a ColumnVector based on given ObjectInspector's type info.
    +     *
    +     * @param inspector ObjectInspector
    +     */
    +    private ColumnVector createColumnVector(ObjectInspector inspector) {
    +      switch(inspector.getCategory()) {
    +        case PRIMITIVE:
    +          {
    +            PrimitiveTypeInfo primitiveTypeInfo =
    +              (PrimitiveTypeInfo) ((PrimitiveObjectInspector)inspector).getTypeInfo();
    +            switch(primitiveTypeInfo.getPrimitiveCategory()) {
    +              case BOOLEAN:
    +              case BYTE:
    +              case SHORT:
    +              case INT:
    +              case LONG:
    +              case DATE:
    +              case INTERVAL_YEAR_MONTH:
    +                return new LongColumnVector(VectorizedRowBatch.DEFAULT_SIZE);
    +              case FLOAT:
    +              case DOUBLE:
    +                return new DoubleColumnVector(VectorizedRowBatch.DEFAULT_SIZE);
    +              case BINARY:
    +              case STRING:
    +              case CHAR:
    +              case VARCHAR:
    +                BytesColumnVector column = new BytesColumnVector(VectorizedRowBatch.DEFAULT_SIZE);
    +                column.initBuffer();
    +                return column;
    +              case DECIMAL:
    +                DecimalTypeInfo tInfo = (DecimalTypeInfo) primitiveTypeInfo;
    +                return new DecimalColumnVector(VectorizedRowBatch.DEFAULT_SIZE,
    +                    tInfo.precision(), tInfo.scale());
    +              default:
    +                throw new RuntimeException("Vectorizaton is not supported for datatype:"
    +                    + primitiveTypeInfo.getPrimitiveCategory());
    +            }
    +          }
    +        default:
    +          throw new RuntimeException("Vectorization is not supported for datatype:"
    +              + inspector.getCategory());
    +      }
    +    }
    +
    +    /**
    +     * Walk through the object inspector and add column vectors
    +     *
    +     * @param oi StructObjectInspector
    +     * @param cvList ColumnVectors are populated in this list
    +     */
    +    private void allocateColumnVector(StructObjectInspector oi,
    +        List<ColumnVector> cvList) {
    +      if (cvList == null) {
    +        throw new RuntimeException("Null columnvector list");
    +      }
    +      if (oi == null) {
    +        return;
    +      }
    +      final List<? extends StructField> fields = oi.getAllStructFieldRefs();
    +      for(StructField field : fields) {
    --- End diff --
    
    nit: space after `for`. There are other places in this PR where the same comment will apply


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    **[Test build #61075 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61075/consoleFull)** for PR 13775 at commit [`855bcfd`](https://github.com/apache/spark/commit/855bcfde2067af4bd88d95a6365f976ecf891de9).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    **[Test build #61075 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61075/consoleFull)** for PR 13775 at commit [`855bcfd`](https://github.com/apache/spark/commit/855bcfde2067af4bd88d95a6365f976ecf891de9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    ping @yhuai Any chance you can review this? Thank you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68990/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61384/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    **[Test build #60827 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60827/consoleFull)** for PR 13775 at commit [`7e7bb6c`](https://github.com/apache/spark/commit/7e7bb6c57860187f391f66ca82cdd715d0b2be43).
     * This patch **fails RAT tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    **[Test build #68995 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68995/consoleFull)** for PR 13775 at commit [`c24169d`](https://github.com/apache/spark/commit/c24169d513c53eb9887f53749a4a6a4e51351667).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    Otherwise, may I implement this way in my PR as a Viirya's approach?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69080/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    @rxin oh, right...I will update this later. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    **[Test build #61088 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61088/consoleFull)** for PR 13775 at commit [`66ab632`](https://github.com/apache/spark/commit/66ab632274674ae5b38c84bac8801feab3c9d2e0).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69082/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    **[Test build #60827 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60827/consoleFull)** for PR 13775 at commit [`7e7bb6c`](https://github.com/apache/spark/commit/7e7bb6c57860187f391f66ca82cdd715d0b2be43).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    **[Test build #60828 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60828/consoleFull)** for PR 13775 at commit [`20b832e`](https://github.com/apache/spark/commit/20b832ee4e5ed4e794cc1bc8f2f67cce973759e0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    @yhuai I ran another benchmark for the scan operation. The results are updated in the pr description.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    @rxin @davies @hvanhovell @yhuai @tejasapatil @zjffdu @dafrista I've address review comments and add tests. Please help review this if you can. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13775#discussion_r74183317
  
    --- Diff: sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/orc/VectorizedSparkOrcNewRecordReader.java ---
    @@ -0,0 +1,317 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.hadoop.hive.ql.io.orc;
    +
    +import java.io.IOException;
    +import java.nio.charset.StandardCharsets;
    +import java.util.List;
    +
    +import org.apache.commons.lang.NotImplementedException;
    +
    +import org.apache.hadoop.conf.Configuration;
    +import org.apache.hadoop.hive.ql.exec.vector.ColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DecimalColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DoubleColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
    +import org.apache.hadoop.io.NullWritable;
    +import org.apache.hadoop.mapred.JobConf;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +import org.apache.hadoop.mapreduce.lib.input.FileSplit;
    +
    +import org.apache.spark.sql.catalyst.InternalRow;
    +import org.apache.spark.sql.catalyst.util.ArrayData;
    +import org.apache.spark.sql.catalyst.util.MapData;
    +import org.apache.spark.sql.types.DataType;
    +import org.apache.spark.sql.types.Decimal;
    +import org.apache.spark.unsafe.types.CalendarInterval;
    +import org.apache.spark.unsafe.types.UTF8String;
    +
    +/**
    + * A RecordReader that returns InternalRow for Spark SQL execution.
    + * This reader uses an internal reader that returns Hive's VectorizedRowBatch. An adapter
    + * class is used to return internal row by directly accessing data in column vectors.
    + */
    +public class VectorizedSparkOrcNewRecordReader
    +    extends org.apache.hadoop.mapreduce.RecordReader<NullWritable, InternalRow> {
    +  private final org.apache.hadoop.mapred.RecordReader<NullWritable, VectorizedRowBatch> reader;
    +  private final int numColumns;
    +  private VectorizedRowBatch internalValue;
    +  private float progress = 0.0f;
    +  private List<Integer> columnIDs;
    +
    +  private long numRowsOfBatch = 0;
    +  private int indexOfRow = 0;
    +
    +  private final Row row;
    +
    +  public VectorizedSparkOrcNewRecordReader(
    +      Reader file,
    +      JobConf conf,
    +      FileSplit fileSplit,
    +      List<Integer> columnIDs) throws IOException {
    +    List<OrcProto.Type> types = file.getTypes();
    +    numColumns = (types.size() == 0) ? 0 : types.get(0).getSubtypesCount();
    +    this.reader = new SparkVectorizedOrcRecordReader(file, conf,
    +      new org.apache.hadoop.mapred.FileSplit(fileSplit));
    +
    +    this.columnIDs = columnIDs;
    +    this.internalValue = this.reader.createValue();
    +    this.progress = reader.getProgress();
    +    this.row = new Row(this.internalValue.cols, columnIDs);
    +  }
    +
    +  @Override
    +  public void close() throws IOException {
    +    reader.close();
    +  }
    +
    +  @Override
    +  public NullWritable getCurrentKey() throws IOException,
    +      InterruptedException {
    +    return NullWritable.get();
    +  }
    +
    +  @Override
    +  public InternalRow getCurrentValue() throws IOException,
    +      InterruptedException {
    +    if (indexOfRow >= numRowsOfBatch) {
    +      return null;
    +    }
    +    row.rowId = indexOfRow;
    +    indexOfRow++;
    +
    +    return row;
    +  }
    +
    +  @Override
    +  public float getProgress() throws IOException, InterruptedException {
    +    return progress;
    +  }
    +
    +  @Override
    +  public void initialize(InputSplit split, TaskAttemptContext context)
    +      throws IOException, InterruptedException {
    +  }
    +
    +  @Override
    +  public boolean nextKeyValue() throws IOException, InterruptedException {
    +    if (indexOfRow == numRowsOfBatch && progress < 1.0f) {
    +      if (reader.next(NullWritable.get(), internalValue)) {
    +        if (internalValue.endOfFile) {
    +          progress = 1.0f;
    +          numRowsOfBatch = 0;
    +          indexOfRow = 0;
    +          return false;
    +        } else {
    +          assert internalValue.numCols == numColumns : "Incorrect number of columns in OrcBatch";
    +          numRowsOfBatch = internalValue.count();
    +          indexOfRow = 0;
    +          progress = reader.getProgress();
    +        }
    +        return true;
    +      } else {
    +        return false;
    +      }
    +    } else {
    +      if (indexOfRow < numRowsOfBatch) {
    +        return true;
    +      } else {
    +        return false;
    +      }
    +    }
    +  }
    +
    +  /**
    +   * Adapter class to return an internal row.
    +   */
    +  public static final class Row extends InternalRow {
    +    protected int rowId;
    +    private List<Integer> columnIDs;
    +    private final ColumnVector[] columns;
    +
    +    private Row(ColumnVector[] columns, List<Integer> columnIDs) {
    +      this.columns = columns;
    +      this.columnIDs = columnIDs;
    +    }
    +
    +    @Override
    +    public int numFields() { return columnIDs.size(); }
    --- End diff --
    
    Good catch! Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by dafrista <gi...@git.apache.org>.

Github user dafrista commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13775#discussion_r74180849
  
    --- Diff: sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/orc/VectorizedSparkOrcNewRecordReader.java ---
    @@ -0,0 +1,317 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.hadoop.hive.ql.io.orc;
    +
    +import java.io.IOException;
    +import java.nio.charset.StandardCharsets;
    +import java.util.List;
    +
    +import org.apache.commons.lang.NotImplementedException;
    +
    +import org.apache.hadoop.conf.Configuration;
    +import org.apache.hadoop.hive.ql.exec.vector.ColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DecimalColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DoubleColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
    +import org.apache.hadoop.io.NullWritable;
    +import org.apache.hadoop.mapred.JobConf;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +import org.apache.hadoop.mapreduce.lib.input.FileSplit;
    +
    +import org.apache.spark.sql.catalyst.InternalRow;
    +import org.apache.spark.sql.catalyst.util.ArrayData;
    +import org.apache.spark.sql.catalyst.util.MapData;
    +import org.apache.spark.sql.types.DataType;
    +import org.apache.spark.sql.types.Decimal;
    +import org.apache.spark.unsafe.types.CalendarInterval;
    +import org.apache.spark.unsafe.types.UTF8String;
    +
    +/**
    + * A RecordReader that returns InternalRow for Spark SQL execution.
    + * This reader uses an internal reader that returns Hive's VectorizedRowBatch. An adapter
    + * class is used to return internal row by directly accessing data in column vectors.
    + */
    +public class VectorizedSparkOrcNewRecordReader
    +    extends org.apache.hadoop.mapreduce.RecordReader<NullWritable, InternalRow> {
    +  private final org.apache.hadoop.mapred.RecordReader<NullWritable, VectorizedRowBatch> reader;
    +  private final int numColumns;
    +  private VectorizedRowBatch internalValue;
    +  private float progress = 0.0f;
    +  private List<Integer> columnIDs;
    +
    +  private long numRowsOfBatch = 0;
    +  private int indexOfRow = 0;
    +
    +  private final Row row;
    +
    +  public VectorizedSparkOrcNewRecordReader(
    +      Reader file,
    +      JobConf conf,
    +      FileSplit fileSplit,
    +      List<Integer> columnIDs) throws IOException {
    +    List<OrcProto.Type> types = file.getTypes();
    +    numColumns = (types.size() == 0) ? 0 : types.get(0).getSubtypesCount();
    +    this.reader = new SparkVectorizedOrcRecordReader(file, conf,
    +      new org.apache.hadoop.mapred.FileSplit(fileSplit));
    +
    +    this.columnIDs = columnIDs;
    +    this.internalValue = this.reader.createValue();
    +    this.progress = reader.getProgress();
    +    this.row = new Row(this.internalValue.cols, columnIDs);
    +  }
    +
    +  @Override
    +  public void close() throws IOException {
    +    reader.close();
    +  }
    +
    +  @Override
    +  public NullWritable getCurrentKey() throws IOException,
    +      InterruptedException {
    +    return NullWritable.get();
    +  }
    +
    +  @Override
    +  public InternalRow getCurrentValue() throws IOException,
    +      InterruptedException {
    +    if (indexOfRow >= numRowsOfBatch) {
    +      return null;
    +    }
    +    row.rowId = indexOfRow;
    +    indexOfRow++;
    +
    +    return row;
    +  }
    +
    +  @Override
    +  public float getProgress() throws IOException, InterruptedException {
    +    return progress;
    +  }
    +
    +  @Override
    +  public void initialize(InputSplit split, TaskAttemptContext context)
    +      throws IOException, InterruptedException {
    +  }
    +
    +  @Override
    +  public boolean nextKeyValue() throws IOException, InterruptedException {
    +    if (indexOfRow == numRowsOfBatch && progress < 1.0f) {
    +      if (reader.next(NullWritable.get(), internalValue)) {
    +        if (internalValue.endOfFile) {
    +          progress = 1.0f;
    +          numRowsOfBatch = 0;
    +          indexOfRow = 0;
    +          return false;
    +        } else {
    +          assert internalValue.numCols == numColumns : "Incorrect number of columns in OrcBatch";
    +          numRowsOfBatch = internalValue.count();
    +          indexOfRow = 0;
    +          progress = reader.getProgress();
    +        }
    +        return true;
    +      } else {
    +        return false;
    +      }
    +    } else {
    +      if (indexOfRow < numRowsOfBatch) {
    +        return true;
    +      } else {
    +        return false;
    +      }
    +    }
    +  }
    +
    +  /**
    +   * Adapter class to return an internal row.
    +   */
    +  public static final class Row extends InternalRow {
    +    protected int rowId;
    +    private List<Integer> columnIDs;
    +    private final ColumnVector[] columns;
    +
    +    private Row(ColumnVector[] columns, List<Integer> columnIDs) {
    +      this.columns = columns;
    +      this.columnIDs = columnIDs;
    +    }
    +
    +    @Override
    +    public int numFields() { return columnIDs.size(); }
    --- End diff --
    
    I've done some profiling of this code (repeated jstack traces on a job that scans the table), and was surprised to see this call taking around 30% of CPU time. Each time .size() is called, the scala sequence to java list conversion has to take place. Other calls that involve columnIDs also took about 9% of CPU time.
    
    I made a change to avoid this conversion, and saw the CPU time spent in these calls reduced to effectively zero. The change was to make a (java) copy of the columnIDs in the constructor of VectorizedSparkOrcNewRecordReader. I replaced `this.columnIDs = columnIDs` at L75 with `this.columnIDs = new ArrayList<>(columnIDs)` and `columnIDs` at L78 with `this.columnIDs`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    **[Test build #61037 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61037/consoleFull)** for PR 13775 at commit [`855bcfd`](https://github.com/apache/spark/commit/855bcfde2067af4bd88d95a6365f976ecf891de9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    **[Test build #69131 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69131/consoleFull)** for PR 13775 at commit [`8638a0e`](https://github.com/apache/spark/commit/8638a0e2b98719770bff50804dcc0fc0e83674ad).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    @tejasapatil Thanks for the review comment! I will update this later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by tejasapatil <gi...@git.apache.org>.

Github user tejasapatil commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13775#discussion_r83761287
  
    --- Diff: sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/orc/VectorizedSparkOrcNewRecordReader.java ---
    @@ -0,0 +1,318 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.hadoop.hive.ql.io.orc;
    +
    +import java.io.IOException;
    +import java.nio.charset.StandardCharsets;
    +import java.util.ArrayList;
    +import java.util.List;
    +
    +import org.apache.commons.lang.NotImplementedException;
    +
    +import org.apache.hadoop.conf.Configuration;
    +import org.apache.hadoop.hive.ql.exec.vector.ColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DecimalColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DoubleColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
    +import org.apache.hadoop.io.NullWritable;
    +import org.apache.hadoop.mapred.JobConf;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +import org.apache.hadoop.mapreduce.lib.input.FileSplit;
    +
    +import org.apache.spark.sql.catalyst.InternalRow;
    +import org.apache.spark.sql.catalyst.util.ArrayData;
    +import org.apache.spark.sql.catalyst.util.MapData;
    +import org.apache.spark.sql.types.DataType;
    +import org.apache.spark.sql.types.Decimal;
    +import org.apache.spark.unsafe.types.CalendarInterval;
    +import org.apache.spark.unsafe.types.UTF8String;
    +
    +/**
    + * A RecordReader that returns InternalRow for Spark SQL execution.
    + * This reader uses an internal reader that returns Hive's VectorizedRowBatch. An adapter
    + * class is used to return internal row by directly accessing data in column vectors.
    + */
    +public class VectorizedSparkOrcNewRecordReader
    +    extends org.apache.hadoop.mapreduce.RecordReader<NullWritable, InternalRow> {
    +  private final org.apache.hadoop.mapred.RecordReader<NullWritable, VectorizedRowBatch> reader;
    +  private final int numColumns;
    +  private VectorizedRowBatch internalValue;
    +  private float progress = 0.0f;
    +  private List<Integer> columnIDs;
    +
    +  private long numRowsOfBatch = 0;
    +  private int indexOfRow = 0;
    +
    +  private final Row row;
    +
    +  public VectorizedSparkOrcNewRecordReader(
    +      Reader file,
    +      JobConf conf,
    +      FileSplit fileSplit,
    +      List<Integer> columnIDs) throws IOException {
    +    List<OrcProto.Type> types = file.getTypes();
    +    numColumns = (types.size() == 0) ? 0 : types.get(0).getSubtypesCount();
    +    this.reader = new SparkVectorizedOrcRecordReader(file, conf,
    +      new org.apache.hadoop.mapred.FileSplit(fileSplit));
    +
    +    this.columnIDs = new ArrayList<>(columnIDs);
    +    this.internalValue = this.reader.createValue();
    +    this.progress = reader.getProgress();
    +    this.row = new Row(this.internalValue.cols, this.columnIDs);
    +  }
    +
    +  @Override
    +  public void close() throws IOException {
    +    reader.close();
    +  }
    +
    +  @Override
    +  public NullWritable getCurrentKey() throws IOException,
    +      InterruptedException {
    +    return NullWritable.get();
    +  }
    +
    +  @Override
    +  public InternalRow getCurrentValue() throws IOException,
    +      InterruptedException {
    +    if (indexOfRow >= numRowsOfBatch) {
    +      return null;
    +    }
    +    row.rowId = indexOfRow;
    +    indexOfRow++;
    +
    +    return row;
    +  }
    +
    +  @Override
    +  public float getProgress() throws IOException, InterruptedException {
    +    return progress;
    +  }
    +
    +  @Override
    +  public void initialize(InputSplit split, TaskAttemptContext context)
    +      throws IOException, InterruptedException {
    +  }
    +
    +  @Override
    +  public boolean nextKeyValue() throws IOException, InterruptedException {
    +    if (indexOfRow == numRowsOfBatch) {
    +      if (reader.next(NullWritable.get(), internalValue)) {
    +        if (internalValue.endOfFile) {
    +          progress = 1.0f;
    +          numRowsOfBatch = 0;
    +          indexOfRow = 0;
    +          return false;
    +        } else {
    +          assert internalValue.numCols == numColumns : "Incorrect number of columns in OrcBatch";
    +          numRowsOfBatch = internalValue.count();
    +          indexOfRow = 0;
    +          progress = reader.getProgress();
    +        }
    +        return true;
    +      } else {
    +        return false;
    +      }
    +    } else {
    +      if (indexOfRow < numRowsOfBatch) {
    +        return true;
    +      } else {
    +        return false;
    +      }
    +    }
    +  }
    +
    +  /**
    +   * Adapter class to return an internal row.
    +   */
    +  public static final class Row extends InternalRow {
    +    protected int rowId;
    +    private List<Integer> columnIDs;
    +    private final ColumnVector[] columns;
    +
    +    private Row(ColumnVector[] columns, List<Integer> columnIDs) {
    +      this.columns = columns;
    +      this.columnIDs = columnIDs;
    +    }
    +
    +    @Override
    +    public int numFields() { return columnIDs.size(); }
    +
    +    @Override
    +    public boolean anyNull() {
    +      for (int i = 0; i < columns.length; i++) {
    +        if (columnIDs.contains(i)) {
    +          if (columns[i].isRepeating && columns[i].isNull[0]) {
    +            return true;
    +          } else if (!columns[i].isRepeating && columns[i].isNull[rowId]) {
    +            return true;
    +          }
    +        }
    +      }
    +      return false;
    +    }
    +
    +    @Override
    +    public boolean isNullAt(int ordinal) {
    +      ColumnVector col = columns[columnIDs.get(ordinal)];
    +      if (col.isRepeating) {
    +        return col.isNull[0];
    +      } else {
    +        return col.isNull[rowId];
    +      }
    +    }
    +
    +    @Override
    +    public boolean getBoolean(int ordinal) {
    +      LongColumnVector col = (LongColumnVector)columns[columnIDs.get(ordinal)];
    --- End diff --
    
    nit: space after `(LongColumnVector)`. This might also apply to other places in the PR but I am not pointing out each instance.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by tejasapatil <gi...@git.apache.org>.

Github user tejasapatil commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13775#discussion_r83761710
  
    --- Diff: sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/orc/VectorizedSparkOrcNewRecordReader.java ---
    @@ -0,0 +1,318 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.hadoop.hive.ql.io.orc;
    +
    +import java.io.IOException;
    +import java.nio.charset.StandardCharsets;
    +import java.util.ArrayList;
    +import java.util.List;
    +
    +import org.apache.commons.lang.NotImplementedException;
    +
    +import org.apache.hadoop.conf.Configuration;
    +import org.apache.hadoop.hive.ql.exec.vector.ColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DecimalColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DoubleColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
    +import org.apache.hadoop.io.NullWritable;
    +import org.apache.hadoop.mapred.JobConf;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +import org.apache.hadoop.mapreduce.lib.input.FileSplit;
    +
    +import org.apache.spark.sql.catalyst.InternalRow;
    +import org.apache.spark.sql.catalyst.util.ArrayData;
    +import org.apache.spark.sql.catalyst.util.MapData;
    +import org.apache.spark.sql.types.DataType;
    +import org.apache.spark.sql.types.Decimal;
    +import org.apache.spark.unsafe.types.CalendarInterval;
    +import org.apache.spark.unsafe.types.UTF8String;
    +
    +/**
    + * A RecordReader that returns InternalRow for Spark SQL execution.
    + * This reader uses an internal reader that returns Hive's VectorizedRowBatch. An adapter
    + * class is used to return internal row by directly accessing data in column vectors.
    + */
    +public class VectorizedSparkOrcNewRecordReader
    +    extends org.apache.hadoop.mapreduce.RecordReader<NullWritable, InternalRow> {
    +  private final org.apache.hadoop.mapred.RecordReader<NullWritable, VectorizedRowBatch> reader;
    +  private final int numColumns;
    +  private VectorizedRowBatch internalValue;
    +  private float progress = 0.0f;
    +  private List<Integer> columnIDs;
    +
    +  private long numRowsOfBatch = 0;
    +  private int indexOfRow = 0;
    +
    +  private final Row row;
    +
    +  public VectorizedSparkOrcNewRecordReader(
    +      Reader file,
    +      JobConf conf,
    +      FileSplit fileSplit,
    +      List<Integer> columnIDs) throws IOException {
    +    List<OrcProto.Type> types = file.getTypes();
    +    numColumns = (types.size() == 0) ? 0 : types.get(0).getSubtypesCount();
    +    this.reader = new SparkVectorizedOrcRecordReader(file, conf,
    +      new org.apache.hadoop.mapred.FileSplit(fileSplit));
    +
    +    this.columnIDs = new ArrayList<>(columnIDs);
    +    this.internalValue = this.reader.createValue();
    +    this.progress = reader.getProgress();
    +    this.row = new Row(this.internalValue.cols, this.columnIDs);
    +  }
    +
    +  @Override
    +  public void close() throws IOException {
    +    reader.close();
    +  }
    +
    +  @Override
    +  public NullWritable getCurrentKey() throws IOException,
    +      InterruptedException {
    +    return NullWritable.get();
    +  }
    +
    +  @Override
    +  public InternalRow getCurrentValue() throws IOException,
    +      InterruptedException {
    +    if (indexOfRow >= numRowsOfBatch) {
    +      return null;
    +    }
    +    row.rowId = indexOfRow;
    +    indexOfRow++;
    +
    +    return row;
    +  }
    +
    +  @Override
    +  public float getProgress() throws IOException, InterruptedException {
    +    return progress;
    +  }
    +
    +  @Override
    +  public void initialize(InputSplit split, TaskAttemptContext context)
    +      throws IOException, InterruptedException {
    +  }
    +
    +  @Override
    +  public boolean nextKeyValue() throws IOException, InterruptedException {
    +    if (indexOfRow == numRowsOfBatch) {
    +      if (reader.next(NullWritable.get(), internalValue)) {
    +        if (internalValue.endOfFile) {
    +          progress = 1.0f;
    +          numRowsOfBatch = 0;
    +          indexOfRow = 0;
    +          return false;
    +        } else {
    +          assert internalValue.numCols == numColumns : "Incorrect number of columns in OrcBatch";
    +          numRowsOfBatch = internalValue.count();
    +          indexOfRow = 0;
    +          progress = reader.getProgress();
    +        }
    +        return true;
    +      } else {
    +        return false;
    +      }
    +    } else {
    +      if (indexOfRow < numRowsOfBatch) {
    +        return true;
    +      } else {
    +        return false;
    +      }
    +    }
    +  }
    +
    +  /**
    +   * Adapter class to return an internal row.
    +   */
    +  public static final class Row extends InternalRow {
    +    protected int rowId;
    +    private List<Integer> columnIDs;
    +    private final ColumnVector[] columns;
    +
    +    private Row(ColumnVector[] columns, List<Integer> columnIDs) {
    +      this.columns = columns;
    +      this.columnIDs = columnIDs;
    +    }
    +
    +    @Override
    +    public int numFields() { return columnIDs.size(); }
    +
    +    @Override
    +    public boolean anyNull() {
    +      for (int i = 0; i < columns.length; i++) {
    +        if (columnIDs.contains(i)) {
    +          if (columns[i].isRepeating && columns[i].isNull[0]) {
    +            return true;
    +          } else if (!columns[i].isRepeating && columns[i].isNull[rowId]) {
    +            return true;
    +          }
    +        }
    +      }
    +      return false;
    +    }
    +
    +    @Override
    +    public boolean isNullAt(int ordinal) {
    +      ColumnVector col = columns[columnIDs.get(ordinal)];
    +      if (col.isRepeating) {
    +        return col.isNull[0];
    +      } else {
    +        return col.isNull[rowId];
    +      }
    +    }
    +
    +    @Override
    +    public boolean getBoolean(int ordinal) {
    +      LongColumnVector col = (LongColumnVector)columns[columnIDs.get(ordinal)];
    +      if (col.isRepeating) {
    +        return col.vector[0] > 0;
    +      } else {
    +        return col.vector[rowId] > 0;
    +      }
    +    }
    +
    +    @Override
    +    public byte getByte(int ordinal) {
    +      LongColumnVector col = (LongColumnVector)columns[columnIDs.get(ordinal)];
    +      if (col.isRepeating) {
    +        return (byte)col.vector[0];
    +      } else {
    +        return (byte)col.vector[rowId];
    +      }
    +    }
    +
    +    @Override
    +    public short getShort(int ordinal) {
    +      LongColumnVector col = (LongColumnVector)columns[columnIDs.get(ordinal)];
    +      if (col.isRepeating) {
    +        return (short)col.vector[0];
    +      } else {
    +        return (short)col.vector[rowId];
    +      }
    +    }
    +
    +    @Override
    +    public int getInt(int ordinal) {
    +      LongColumnVector col = (LongColumnVector)columns[columnIDs.get(ordinal)];
    +      if (col.isRepeating) {
    +        return (int)col.vector[0];
    +      } else {
    +        return (int)col.vector[rowId];
    +      }
    +    }
    +
    +    @Override
    +    public long getLong(int ordinal) {
    +      LongColumnVector col = (LongColumnVector)columns[columnIDs.get(ordinal)];
    +      if (col.isRepeating) {
    +        return (long)col.vector[0];
    +      } else {
    +        return (long)col.vector[rowId];
    +      }
    +    }
    +
    +    @Override
    +    public float getFloat(int ordinal) {
    +      DoubleColumnVector col = (DoubleColumnVector)columns[columnIDs.get(ordinal)];
    +      if (col.isRepeating) {
    +        return (float)col.vector[0];
    +      } else {
    +        return (float)col.vector[rowId];
    +      }
    +    }
    +
    +    @Override
    +    public double getDouble(int ordinal) {
    +      DoubleColumnVector col = (DoubleColumnVector)columns[columnIDs.get(ordinal)];
    +      if (col.isRepeating) {
    +        return (double)col.vector[0];
    +      } else {
    +        return (double)col.vector[rowId];
    +      }
    +    }
    +
    +    @Override
    +    public Decimal getDecimal(int ordinal, int precision, int scale) {
    +      DecimalColumnVector col = (DecimalColumnVector)columns[columnIDs.get(ordinal)];
    +      if (col.isRepeating) {
    +        return Decimal.apply(col.vector[0].getHiveDecimal().bigDecimalValue(), precision, scale);
    +      } else {
    +        return Decimal.apply(col.vector[rowId].getHiveDecimal().bigDecimalValue(),
    +          precision, scale);
    +      }
    +    }
    +
    +    @Override
    +    public UTF8String getUTF8String(int ordinal) {
    +      BytesColumnVector bv = ((BytesColumnVector)columns[columnIDs.get(ordinal)]);
    +      if (bv.isRepeating) {
    +        return UTF8String.fromBytes(bv.vector[0], bv.start[0], bv.length[0]);
    +      } else {
    +        return UTF8String.fromBytes(bv.vector[rowId], bv.start[rowId], bv.length[rowId]);
    +      }
    +    }
    +
    +    @Override
    +    public byte[] getBinary(int ordinal) {
    +      BytesColumnVector col = (BytesColumnVector)columns[columnIDs.get(ordinal)];
    +      if (col.isRepeating) {
    +        byte[] binary = new byte[col.length[0]];
    +        System.arraycopy(col.vector[0], col.start[0], binary, 0, binary.length);
    +        return binary;
    +      } else {
    +        byte[] binary = new byte[col.length[rowId]];
    +        System.arraycopy(col.vector[rowId], col.start[rowId], binary, 0, binary.length);
    +        return binary;
    +      }
    +    }
    +
    +    @Override
    +    public CalendarInterval getInterval(int ordinal) {
    +      throw new NotImplementedException();
    --- End diff --
    
    I think the hive artifact used by Spark does not support vectorisation for all the datatypes and thats why this is not being implemented. In future, upgrading hive in Spark might make it possible to add it. Can you add a comment about this ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by dafrista <gi...@git.apache.org>.

Github user dafrista commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13775#discussion_r74368871
  
    --- Diff: sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/orc/VectorizedSparkOrcNewRecordReader.java ---
    @@ -0,0 +1,318 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.hadoop.hive.ql.io.orc;
    +
    +import java.io.IOException;
    +import java.nio.charset.StandardCharsets;
    +import java.util.ArrayList;
    +import java.util.List;
    +
    +import org.apache.commons.lang.NotImplementedException;
    +
    +import org.apache.hadoop.conf.Configuration;
    +import org.apache.hadoop.hive.ql.exec.vector.ColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DecimalColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DoubleColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
    +import org.apache.hadoop.io.NullWritable;
    +import org.apache.hadoop.mapred.JobConf;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +import org.apache.hadoop.mapreduce.lib.input.FileSplit;
    +
    +import org.apache.spark.sql.catalyst.InternalRow;
    +import org.apache.spark.sql.catalyst.util.ArrayData;
    +import org.apache.spark.sql.catalyst.util.MapData;
    +import org.apache.spark.sql.types.DataType;
    +import org.apache.spark.sql.types.Decimal;
    +import org.apache.spark.unsafe.types.CalendarInterval;
    +import org.apache.spark.unsafe.types.UTF8String;
    +
    +/**
    + * A RecordReader that returns InternalRow for Spark SQL execution.
    + * This reader uses an internal reader that returns Hive's VectorizedRowBatch. An adapter
    + * class is used to return internal row by directly accessing data in column vectors.
    + */
    +public class VectorizedSparkOrcNewRecordReader
    +    extends org.apache.hadoop.mapreduce.RecordReader<NullWritable, InternalRow> {
    +  private final org.apache.hadoop.mapred.RecordReader<NullWritable, VectorizedRowBatch> reader;
    +  private final int numColumns;
    +  private VectorizedRowBatch internalValue;
    +  private float progress = 0.0f;
    +  private List<Integer> columnIDs;
    +
    +  private long numRowsOfBatch = 0;
    +  private int indexOfRow = 0;
    +
    +  private final Row row;
    +
    +  public VectorizedSparkOrcNewRecordReader(
    +      Reader file,
    +      JobConf conf,
    +      FileSplit fileSplit,
    +      List<Integer> columnIDs) throws IOException {
    +    List<OrcProto.Type> types = file.getTypes();
    +    numColumns = (types.size() == 0) ? 0 : types.get(0).getSubtypesCount();
    +    this.reader = new SparkVectorizedOrcRecordReader(file, conf,
    +      new org.apache.hadoop.mapred.FileSplit(fileSplit));
    +
    +    this.columnIDs = new ArrayList<>(columnIDs);
    +    this.internalValue = this.reader.createValue();
    +    this.progress = reader.getProgress();
    +    this.row = new Row(this.internalValue.cols, this.columnIDs);
    +  }
    +
    +  @Override
    +  public void close() throws IOException {
    +    reader.close();
    +  }
    +
    +  @Override
    +  public NullWritable getCurrentKey() throws IOException,
    +      InterruptedException {
    +    return NullWritable.get();
    +  }
    +
    +  @Override
    +  public InternalRow getCurrentValue() throws IOException,
    +      InterruptedException {
    +    if (indexOfRow >= numRowsOfBatch) {
    +      return null;
    +    }
    +    row.rowId = indexOfRow;
    +    indexOfRow++;
    +
    +    return row;
    +  }
    +
    +  @Override
    +  public float getProgress() throws IOException, InterruptedException {
    +    return progress;
    +  }
    +
    +  @Override
    +  public void initialize(InputSplit split, TaskAttemptContext context)
    +      throws IOException, InterruptedException {
    +  }
    +
    +  @Override
    +  public boolean nextKeyValue() throws IOException, InterruptedException {
    +    if (indexOfRow == numRowsOfBatch && progress < 1.0f) {
    --- End diff --
    
    Ah, I see. The new commit should work \U0001f44d 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    @hvanhovell @rxin I've re-run the benchmark and updated the results.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    **[Test build #69149 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69149/consoleFull)** for PR 13775 at commit [`bd15842`](https://github.com/apache/spark/commit/bd15842e7b146cf292d9c29b896362412b22b8c1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    If we do want to maintain one in Spark, we should probably start by adding Presto's. Data formats don't change super frequently, so I think we would be able to maintain it, especially when a large company like Facebook has resources devoted to making it work :)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by tejasapatil <gi...@git.apache.org>.

Github user tejasapatil commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13775#discussion_r83757105
  
    --- Diff: sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/orc/SparkVectorizedOrcRecordReader.java ---
    @@ -0,0 +1,189 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.hadoop.hive.ql.io.orc;
    +
    +import java.io.IOException;
    +import java.util.LinkedList;
    +import java.util.List;
    +
    +import org.apache.hadoop.conf.Configuration;
    +import org.apache.hadoop.hive.ql.exec.vector.ColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DecimalColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DoubleColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
    +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
    +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
    +import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
    +import org.apache.hadoop.hive.serde2.objectinspector.StructField;
    +import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
    +import org.apache.hadoop.hive.serde2.typeinfo.DecimalTypeInfo;
    +import org.apache.hadoop.hive.serde2.typeinfo.PrimitiveTypeInfo;
    +import org.apache.hadoop.io.NullWritable;
    +import org.apache.hadoop.mapred.FileSplit;
    +import org.apache.hadoop.mapred.RecordReader;
    +
    +/**
    + * A mapred.RecordReader that returns VectorizedRowBatch.
    + */
    +public class SparkVectorizedOrcRecordReader
    +      implements RecordReader<NullWritable, VectorizedRowBatch> {
    +    private final org.apache.hadoop.hive.ql.io.orc.RecordReader reader;
    +    private final long offset;
    +    private final long length;
    +    private float progress = 0.0f;
    +    private ObjectInspector objectInspector;
    +
    +    SparkVectorizedOrcRecordReader(Reader file, Configuration conf,
    +        FileSplit fileSplit) throws IOException {
    +      this.offset = fileSplit.getStart();
    +      this.length = fileSplit.getLength();
    +      this.objectInspector = file.getObjectInspector();
    +      this.reader = OrcInputFormat.createReaderFromFile(file, conf, this.offset,
    +        this.length);
    +      this.progress = reader.getProgress();
    +    }
    +
    +    /**
    +     * Create a ColumnVector based on given ObjectInspector's type info.
    +     *
    +     * @param inspector ObjectInspector
    +     */
    +    private ColumnVector createColumnVector(ObjectInspector inspector) {
    +      switch(inspector.getCategory()) {
    +        case PRIMITIVE:
    +          {
    +            PrimitiveTypeInfo primitiveTypeInfo =
    +              (PrimitiveTypeInfo) ((PrimitiveObjectInspector)inspector).getTypeInfo();
    +            switch(primitiveTypeInfo.getPrimitiveCategory()) {
    +              case BOOLEAN:
    +              case BYTE:
    +              case SHORT:
    +              case INT:
    +              case LONG:
    +              case DATE:
    +              case INTERVAL_YEAR_MONTH:
    +                return new LongColumnVector(VectorizedRowBatch.DEFAULT_SIZE);
    +              case FLOAT:
    +              case DOUBLE:
    +                return new DoubleColumnVector(VectorizedRowBatch.DEFAULT_SIZE);
    +              case BINARY:
    +              case STRING:
    +              case CHAR:
    +              case VARCHAR:
    +                BytesColumnVector column = new BytesColumnVector(VectorizedRowBatch.DEFAULT_SIZE);
    +                column.initBuffer();
    +                return column;
    +              case DECIMAL:
    +                DecimalTypeInfo tInfo = (DecimalTypeInfo) primitiveTypeInfo;
    +                return new DecimalColumnVector(VectorizedRowBatch.DEFAULT_SIZE,
    +                    tInfo.precision(), tInfo.scale());
    +              default:
    +                throw new RuntimeException("Vectorizaton is not supported for datatype:"
    +                    + primitiveTypeInfo.getPrimitiveCategory());
    +            }
    +          }
    +        default:
    +          throw new RuntimeException("Vectorization is not supported for datatype:"
    +              + inspector.getCategory());
    +      }
    +    }
    +
    +    /**
    +     * Walk through the object inspector and add column vectors
    +     *
    +     * @param oi StructObjectInspector
    +     * @param cvList ColumnVectors are populated in this list
    +     */
    +    private void allocateColumnVector(StructObjectInspector oi,
    +        List<ColumnVector> cvList) {
    +      if (cvList == null) {
    +        throw new RuntimeException("Null columnvector list");
    +      }
    +      if (oi == null) {
    +        return;
    +      }
    +      final List<? extends StructField> fields = oi.getAllStructFieldRefs();
    +      for(StructField field : fields) {
    +        ObjectInspector fieldObjectInspector = field.getFieldObjectInspector();
    +        cvList.add(createColumnVector(fieldObjectInspector));
    +      }
    +    }
    +
    +    /**
    +     * Create VectorizedRowBatch from ObjectInspector
    +     *
    +     * @param oi StructObjectInspector
    +     * @return VectorizedRowBatch
    +     */
    +    private VectorizedRowBatch constructVectorizedRowBatch(
    +        StructObjectInspector oi) {
    +      final List<ColumnVector> cvList = new LinkedList<ColumnVector>();
    +      allocateColumnVector(oi, cvList);
    +      final VectorizedRowBatch result = new VectorizedRowBatch(cvList.size());
    +      int i = 0;
    +      for(ColumnVector cv : cvList) {
    +        result.cols[i++] = cv;
    +      }
    +      return result;
    +    }
    +
    +    @Override
    +    public boolean next(NullWritable key, VectorizedRowBatch value) throws IOException {
    --- End diff --
    
    curiosity: can a `VectorizedRowBatch` be ever empty while the next batch has some rows ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    ping @yhuai @liancheng @hvanhovell @cloud-fan again...



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69147/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by tejasapatil <gi...@git.apache.org>.

Github user tejasapatil commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13775#discussion_r83751452
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -251,6 +251,12 @@ object SQLConf {
           .booleanConf
           .createWithDefault(true)
     
    +  val ORC_VECTORIZED_READER_ENABLED =
    +    SQLConfigBuilder("spark.sql.orc.enableVectorizedReader")
    +      .doc("Enables vectorized orc reader.")
    +      .booleanConf
    +      .createWithDefault(true)
    --- End diff --
    
    Please turn it off by default. Until there is more testing, this might be risky thing to do. If you have productionised this code and have been running smoothly for sometime, then it would be comfortable to launch with default turned on.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by tejasapatil <gi...@git.apache.org>.

Github user tejasapatil commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13775#discussion_r83756710
  
    --- Diff: sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/orc/SparkVectorizedOrcRecordReader.java ---
    @@ -0,0 +1,189 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.hadoop.hive.ql.io.orc;
    +
    +import java.io.IOException;
    +import java.util.LinkedList;
    +import java.util.List;
    +
    +import org.apache.hadoop.conf.Configuration;
    +import org.apache.hadoop.hive.ql.exec.vector.ColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DecimalColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DoubleColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
    +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
    +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
    +import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
    +import org.apache.hadoop.hive.serde2.objectinspector.StructField;
    +import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
    +import org.apache.hadoop.hive.serde2.typeinfo.DecimalTypeInfo;
    +import org.apache.hadoop.hive.serde2.typeinfo.PrimitiveTypeInfo;
    +import org.apache.hadoop.io.NullWritable;
    +import org.apache.hadoop.mapred.FileSplit;
    +import org.apache.hadoop.mapred.RecordReader;
    +
    +/**
    + * A mapred.RecordReader that returns VectorizedRowBatch.
    + */
    +public class SparkVectorizedOrcRecordReader
    +      implements RecordReader<NullWritable, VectorizedRowBatch> {
    +    private final org.apache.hadoop.hive.ql.io.orc.RecordReader reader;
    +    private final long offset;
    +    private final long length;
    +    private float progress = 0.0f;
    +    private ObjectInspector objectInspector;
    +
    +    SparkVectorizedOrcRecordReader(Reader file, Configuration conf,
    +        FileSplit fileSplit) throws IOException {
    +      this.offset = fileSplit.getStart();
    +      this.length = fileSplit.getLength();
    +      this.objectInspector = file.getObjectInspector();
    +      this.reader = OrcInputFormat.createReaderFromFile(file, conf, this.offset,
    +        this.length);
    +      this.progress = reader.getProgress();
    +    }
    +
    +    /**
    +     * Create a ColumnVector based on given ObjectInspector's type info.
    +     *
    +     * @param inspector ObjectInspector
    +     */
    +    private ColumnVector createColumnVector(ObjectInspector inspector) {
    +      switch(inspector.getCategory()) {
    +        case PRIMITIVE:
    +          {
    +            PrimitiveTypeInfo primitiveTypeInfo =
    +              (PrimitiveTypeInfo) ((PrimitiveObjectInspector)inspector).getTypeInfo();
    +            switch(primitiveTypeInfo.getPrimitiveCategory()) {
    +              case BOOLEAN:
    +              case BYTE:
    +              case SHORT:
    +              case INT:
    +              case LONG:
    +              case DATE:
    +              case INTERVAL_YEAR_MONTH:
    +                return new LongColumnVector(VectorizedRowBatch.DEFAULT_SIZE);
    +              case FLOAT:
    +              case DOUBLE:
    +                return new DoubleColumnVector(VectorizedRowBatch.DEFAULT_SIZE);
    +              case BINARY:
    +              case STRING:
    +              case CHAR:
    +              case VARCHAR:
    +                BytesColumnVector column = new BytesColumnVector(VectorizedRowBatch.DEFAULT_SIZE);
    +                column.initBuffer();
    +                return column;
    +              case DECIMAL:
    +                DecimalTypeInfo tInfo = (DecimalTypeInfo) primitiveTypeInfo;
    +                return new DecimalColumnVector(VectorizedRowBatch.DEFAULT_SIZE,
    +                    tInfo.precision(), tInfo.scale());
    +              default:
    +                throw new RuntimeException("Vectorizaton is not supported for datatype:"
    +                    + primitiveTypeInfo.getPrimitiveCategory());
    +            }
    +          }
    +        default:
    +          throw new RuntimeException("Vectorization is not supported for datatype:"
    +              + inspector.getCategory());
    +      }
    +    }
    +
    +    /**
    +     * Walk through the object inspector and add column vectors
    +     *
    +     * @param oi StructObjectInspector
    +     * @param cvList ColumnVectors are populated in this list
    +     */
    +    private void allocateColumnVector(StructObjectInspector oi,
    +        List<ColumnVector> cvList) {
    +      if (cvList == null) {
    +        throw new RuntimeException("Null columnvector list");
    +      }
    +      if (oi == null) {
    +        return;
    +      }
    +      final List<? extends StructField> fields = oi.getAllStructFieldRefs();
    +      for(StructField field : fields) {
    +        ObjectInspector fieldObjectInspector = field.getFieldObjectInspector();
    +        cvList.add(createColumnVector(fieldObjectInspector));
    +      }
    +    }
    +
    +    /**
    +     * Create VectorizedRowBatch from ObjectInspector
    +     *
    +     * @param oi StructObjectInspector
    +     * @return VectorizedRowBatch
    +     */
    +    private VectorizedRowBatch constructVectorizedRowBatch(
    +        StructObjectInspector oi) {
    +      final List<ColumnVector> cvList = new LinkedList<ColumnVector>();
    +      allocateColumnVector(oi, cvList);
    +      final VectorizedRowBatch result = new VectorizedRowBatch(cvList.size());
    +      int i = 0;
    +      for(ColumnVector cv : cvList) {
    +        result.cols[i++] = cv;
    +      }
    +      return result;
    +    }
    +
    +    @Override
    +    public boolean next(NullWritable key, VectorizedRowBatch value) throws IOException {
    +      if (reader.hasNext()) {
    +        try {
    +          reader.nextBatch(value);
    +          if (value == null || value.endOfFile || value.size == 0) {
    +            return false;
    +          }
    +        } catch (Exception e) {
    +          throw new RuntimeException(e);
    +        }
    +        progress = reader.getProgress();
    +        return true;
    +      } else {
    +        return false;
    +      }
    +    }
    +
    +    @Override
    +    public NullWritable createKey() {
    +      return NullWritable.get();
    +    }
    +
    +    @Override
    +    public VectorizedRowBatch createValue() {
    +      return constructVectorizedRowBatch((StructObjectInspector)this.objectInspector);
    --- End diff --
    
    nit: space after `(StructObjectInspector)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    **[Test build #69080 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69080/consoleFull)** for PR 13775 at commit [`47bc196`](https://github.com/apache/spark/commit/47bc196a8e91f280cda235c60d0974c7c69fb0ad).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13775#discussion_r74367827
  
    --- Diff: sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/orc/VectorizedSparkOrcNewRecordReader.java ---
    @@ -0,0 +1,318 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.hadoop.hive.ql.io.orc;
    +
    +import java.io.IOException;
    +import java.nio.charset.StandardCharsets;
    +import java.util.ArrayList;
    +import java.util.List;
    +
    +import org.apache.commons.lang.NotImplementedException;
    +
    +import org.apache.hadoop.conf.Configuration;
    +import org.apache.hadoop.hive.ql.exec.vector.ColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DecimalColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DoubleColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
    +import org.apache.hadoop.io.NullWritable;
    +import org.apache.hadoop.mapred.JobConf;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +import org.apache.hadoop.mapreduce.lib.input.FileSplit;
    +
    +import org.apache.spark.sql.catalyst.InternalRow;
    +import org.apache.spark.sql.catalyst.util.ArrayData;
    +import org.apache.spark.sql.catalyst.util.MapData;
    +import org.apache.spark.sql.types.DataType;
    +import org.apache.spark.sql.types.Decimal;
    +import org.apache.spark.unsafe.types.CalendarInterval;
    +import org.apache.spark.unsafe.types.UTF8String;
    +
    +/**
    + * A RecordReader that returns InternalRow for Spark SQL execution.
    + * This reader uses an internal reader that returns Hive's VectorizedRowBatch. An adapter
    + * class is used to return internal row by directly accessing data in column vectors.
    + */
    +public class VectorizedSparkOrcNewRecordReader
    +    extends org.apache.hadoop.mapreduce.RecordReader<NullWritable, InternalRow> {
    +  private final org.apache.hadoop.mapred.RecordReader<NullWritable, VectorizedRowBatch> reader;
    +  private final int numColumns;
    +  private VectorizedRowBatch internalValue;
    +  private float progress = 0.0f;
    +  private List<Integer> columnIDs;
    +
    +  private long numRowsOfBatch = 0;
    +  private int indexOfRow = 0;
    +
    +  private final Row row;
    +
    +  public VectorizedSparkOrcNewRecordReader(
    +      Reader file,
    +      JobConf conf,
    +      FileSplit fileSplit,
    +      List<Integer> columnIDs) throws IOException {
    +    List<OrcProto.Type> types = file.getTypes();
    +    numColumns = (types.size() == 0) ? 0 : types.get(0).getSubtypesCount();
    +    this.reader = new SparkVectorizedOrcRecordReader(file, conf,
    +      new org.apache.hadoop.mapred.FileSplit(fileSplit));
    +
    +    this.columnIDs = new ArrayList<>(columnIDs);
    +    this.internalValue = this.reader.createValue();
    +    this.progress = reader.getProgress();
    +    this.row = new Row(this.internalValue.cols, this.columnIDs);
    +  }
    +
    +  @Override
    +  public void close() throws IOException {
    +    reader.close();
    +  }
    +
    +  @Override
    +  public NullWritable getCurrentKey() throws IOException,
    +      InterruptedException {
    +    return NullWritable.get();
    +  }
    +
    +  @Override
    +  public InternalRow getCurrentValue() throws IOException,
    +      InterruptedException {
    +    if (indexOfRow >= numRowsOfBatch) {
    +      return null;
    +    }
    +    row.rowId = indexOfRow;
    +    indexOfRow++;
    +
    +    return row;
    +  }
    +
    +  @Override
    +  public float getProgress() throws IOException, InterruptedException {
    +    return progress;
    +  }
    +
    +  @Override
    +  public void initialize(InputSplit split, TaskAttemptContext context)
    +      throws IOException, InterruptedException {
    +  }
    +
    +  @Override
    +  public boolean nextKeyValue() throws IOException, InterruptedException {
    +    if (indexOfRow == numRowsOfBatch && progress < 1.0f) {
    --- End diff --
    
    reader.next will not return false if no more batch. An exception will be thrown then.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by tejasapatil <gi...@git.apache.org>.

Github user tejasapatil commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13775#discussion_r83756912
  
    --- Diff: sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/orc/SparkVectorizedOrcRecordReader.java ---
    @@ -0,0 +1,189 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.hadoop.hive.ql.io.orc;
    +
    +import java.io.IOException;
    +import java.util.LinkedList;
    +import java.util.List;
    +
    +import org.apache.hadoop.conf.Configuration;
    +import org.apache.hadoop.hive.ql.exec.vector.ColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DecimalColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DoubleColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
    +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
    +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
    +import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
    +import org.apache.hadoop.hive.serde2.objectinspector.StructField;
    +import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
    +import org.apache.hadoop.hive.serde2.typeinfo.DecimalTypeInfo;
    +import org.apache.hadoop.hive.serde2.typeinfo.PrimitiveTypeInfo;
    +import org.apache.hadoop.io.NullWritable;
    +import org.apache.hadoop.mapred.FileSplit;
    +import org.apache.hadoop.mapred.RecordReader;
    +
    +/**
    + * A mapred.RecordReader that returns VectorizedRowBatch.
    + */
    +public class SparkVectorizedOrcRecordReader
    +      implements RecordReader<NullWritable, VectorizedRowBatch> {
    +    private final org.apache.hadoop.hive.ql.io.orc.RecordReader reader;
    +    private final long offset;
    +    private final long length;
    +    private float progress = 0.0f;
    +    private ObjectInspector objectInspector;
    +
    +    SparkVectorizedOrcRecordReader(Reader file, Configuration conf,
    +        FileSplit fileSplit) throws IOException {
    +      this.offset = fileSplit.getStart();
    +      this.length = fileSplit.getLength();
    +      this.objectInspector = file.getObjectInspector();
    +      this.reader = OrcInputFormat.createReaderFromFile(file, conf, this.offset,
    +        this.length);
    +      this.progress = reader.getProgress();
    +    }
    +
    +    /**
    +     * Create a ColumnVector based on given ObjectInspector's type info.
    +     *
    +     * @param inspector ObjectInspector
    +     */
    +    private ColumnVector createColumnVector(ObjectInspector inspector) {
    +      switch(inspector.getCategory()) {
    +        case PRIMITIVE:
    +          {
    +            PrimitiveTypeInfo primitiveTypeInfo =
    +              (PrimitiveTypeInfo) ((PrimitiveObjectInspector)inspector).getTypeInfo();
    +            switch(primitiveTypeInfo.getPrimitiveCategory()) {
    +              case BOOLEAN:
    +              case BYTE:
    +              case SHORT:
    +              case INT:
    +              case LONG:
    +              case DATE:
    +              case INTERVAL_YEAR_MONTH:
    +                return new LongColumnVector(VectorizedRowBatch.DEFAULT_SIZE);
    +              case FLOAT:
    +              case DOUBLE:
    +                return new DoubleColumnVector(VectorizedRowBatch.DEFAULT_SIZE);
    +              case BINARY:
    +              case STRING:
    +              case CHAR:
    +              case VARCHAR:
    +                BytesColumnVector column = new BytesColumnVector(VectorizedRowBatch.DEFAULT_SIZE);
    +                column.initBuffer();
    +                return column;
    +              case DECIMAL:
    +                DecimalTypeInfo tInfo = (DecimalTypeInfo) primitiveTypeInfo;
    +                return new DecimalColumnVector(VectorizedRowBatch.DEFAULT_SIZE,
    +                    tInfo.precision(), tInfo.scale());
    +              default:
    +                throw new RuntimeException("Vectorizaton is not supported for datatype:"
    +                    + primitiveTypeInfo.getPrimitiveCategory());
    +            }
    +          }
    +        default:
    +          throw new RuntimeException("Vectorization is not supported for datatype:"
    +              + inspector.getCategory());
    +      }
    +    }
    +
    +    /**
    +     * Walk through the object inspector and add column vectors
    +     *
    +     * @param oi StructObjectInspector
    +     * @param cvList ColumnVectors are populated in this list
    +     */
    +    private void allocateColumnVector(StructObjectInspector oi,
    +        List<ColumnVector> cvList) {
    +      if (cvList == null) {
    +        throw new RuntimeException("Null columnvector list");
    +      }
    +      if (oi == null) {
    +        return;
    +      }
    +      final List<? extends StructField> fields = oi.getAllStructFieldRefs();
    +      for(StructField field : fields) {
    +        ObjectInspector fieldObjectInspector = field.getFieldObjectInspector();
    +        cvList.add(createColumnVector(fieldObjectInspector));
    +      }
    +    }
    +
    +    /**
    +     * Create VectorizedRowBatch from ObjectInspector
    +     *
    +     * @param oi StructObjectInspector
    +     * @return VectorizedRowBatch
    +     */
    +    private VectorizedRowBatch constructVectorizedRowBatch(
    +        StructObjectInspector oi) {
    +      final List<ColumnVector> cvList = new LinkedList<ColumnVector>();
    +      allocateColumnVector(oi, cvList);
    +      final VectorizedRowBatch result = new VectorizedRowBatch(cvList.size());
    +      int i = 0;
    +      for(ColumnVector cv : cvList) {
    +        result.cols[i++] = cv;
    +      }
    +      return result;
    +    }
    +
    +    @Override
    +    public boolean next(NullWritable key, VectorizedRowBatch value) throws IOException {
    +      if (reader.hasNext()) {
    +        try {
    +          reader.nextBatch(value);
    +          if (value == null || value.endOfFile || value.size == 0) {
    --- End diff --
    
    you can simplify as : `return (value != null && !value.endOfFile && value.size > 0)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    **[Test build #68995 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68995/consoleFull)** for PR 13775 at commit [`c24169d`](https://github.com/apache/spark/commit/c24169d513c53eb9887f53749a4a6a4e51351667).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    **[Test build #66436 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66436/consoleFull)** for PR 13775 at commit [`ed780f6`](https://github.com/apache/spark/commit/ed780f66bf191eacdd2b81a2cfff4fbab71f1e4e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by tejasapatil <gi...@git.apache.org>.

Github user tejasapatil commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13775#discussion_r83757435
  
    --- Diff: sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/orc/VectorizedSparkOrcNewRecordReader.java ---
    @@ -0,0 +1,318 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.hadoop.hive.ql.io.orc;
    +
    +import java.io.IOException;
    +import java.nio.charset.StandardCharsets;
    +import java.util.ArrayList;
    +import java.util.List;
    +
    +import org.apache.commons.lang.NotImplementedException;
    +
    +import org.apache.hadoop.conf.Configuration;
    +import org.apache.hadoop.hive.ql.exec.vector.ColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DecimalColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DoubleColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
    +import org.apache.hadoop.io.NullWritable;
    +import org.apache.hadoop.mapred.JobConf;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +import org.apache.hadoop.mapreduce.lib.input.FileSplit;
    +
    +import org.apache.spark.sql.catalyst.InternalRow;
    +import org.apache.spark.sql.catalyst.util.ArrayData;
    +import org.apache.spark.sql.catalyst.util.MapData;
    +import org.apache.spark.sql.types.DataType;
    +import org.apache.spark.sql.types.Decimal;
    +import org.apache.spark.unsafe.types.CalendarInterval;
    +import org.apache.spark.unsafe.types.UTF8String;
    +
    +/**
    + * A RecordReader that returns InternalRow for Spark SQL execution.
    + * This reader uses an internal reader that returns Hive's VectorizedRowBatch. An adapter
    + * class is used to return internal row by directly accessing data in column vectors.
    + */
    +public class VectorizedSparkOrcNewRecordReader
    +    extends org.apache.hadoop.mapreduce.RecordReader<NullWritable, InternalRow> {
    +  private final org.apache.hadoop.mapred.RecordReader<NullWritable, VectorizedRowBatch> reader;
    +  private final int numColumns;
    +  private VectorizedRowBatch internalValue;
    +  private float progress = 0.0f;
    +  private List<Integer> columnIDs;
    +
    +  private long numRowsOfBatch = 0;
    +  private int indexOfRow = 0;
    +
    +  private final Row row;
    +
    +  public VectorizedSparkOrcNewRecordReader(
    +      Reader file,
    +      JobConf conf,
    +      FileSplit fileSplit,
    +      List<Integer> columnIDs) throws IOException {
    +    List<OrcProto.Type> types = file.getTypes();
    +    numColumns = (types.size() == 0) ? 0 : types.get(0).getSubtypesCount();
    +    this.reader = new SparkVectorizedOrcRecordReader(file, conf,
    +      new org.apache.hadoop.mapred.FileSplit(fileSplit));
    +
    +    this.columnIDs = new ArrayList<>(columnIDs);
    +    this.internalValue = this.reader.createValue();
    +    this.progress = reader.getProgress();
    +    this.row = new Row(this.internalValue.cols, this.columnIDs);
    +  }
    +
    +  @Override
    +  public void close() throws IOException {
    +    reader.close();
    +  }
    +
    +  @Override
    +  public NullWritable getCurrentKey() throws IOException,
    +      InterruptedException {
    --- End diff --
    
    nit: move all the `InterruptedException` in the previous line. This might also apply to other places in the PR but I am not pointing out each instance.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by tejasapatil <gi...@git.apache.org>.

Github user tejasapatil commented on the issue:

https://github.com/apache/spark/pull/13775

Earlier this year I had spent some time trying out Presto's ORC reader with Spark.

In standalone benchmark, Presto's ORC reader is 3x faster than the one in Hive. My experimental setup was to add a CLI around ORC reader code in both hive and presto. It would read a file from local disk and deserialize all the columns.

With that promising result, I tried hooking Presto reader with Spark. Presto has its own notion of "page" to manage memory and "slice" for Unsafe access to data. When I integrated it with Spark, I had to add a shim to convert Spark objects to corresponding Presto objects. This hurts performance.

Then I decided to fork few classes from Presto to make it directly work with Spark's internal data representations (eg. `UTF8String`). My final numbers (for end to end runs of Spark jobs) were ranging from no gains to 2x improvement. Note that there were things Presto's reader supports (namely vectorization and predicate pushdown) which I had not integrated as that would have demanded more forking. The measurements were over queries which ran over large table, read all the rows and wrote those as-is to another table. Having done all these things, my personal take is that forking classes is bad from a maintenance standpoint. Presto's ORC reader is tightly coupled with the engines' internal constructs and refactoring it to make it generic is non-trivial work. We can think about having a native ORC reader in Spark (just like Presto does) which would be super performant but the downside is that one has to be sync with all upstream changes to the file format as it evolves.

cc @rxin (I recall you were also interested in this at some point)

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69124/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    **[Test build #69133 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69133/consoleFull)** for PR 13775 at commit [`55bb19f`](https://github.com/apache/spark/commit/55bb19f91658767acf08e06ee7e64db27a7222aa).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63585/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    **[Test build #63585 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63585/consoleFull)** for PR 13775 at commit [`06066eb`](https://github.com/apache/spark/commit/06066eb241eb97c4cf363adff2b0160b8a423ab8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    **[Test build #63585 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63585/consoleFull)** for PR 13775 at commit [`06066eb`](https://github.com/apache/spark/commit/06066eb241eb97c4cf363adff2b0160b8a423ab8).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69148/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by zjffdu <gi...@git.apache.org>.

Github user zjffdu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13775#discussion_r68016740
  
    --- Diff: sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/orc/SparkVectorizedOrcRecordReader.java ---
    @@ -0,0 +1,191 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.hadoop.hive.ql.io.orc;
    +
    +import java.io.IOException;
    +import java.util.LinkedList;
    +import java.util.List;
    +
    +import org.apache.hadoop.conf.Configuration;
    +import org.apache.hadoop.hive.ql.exec.vector.ColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DecimalColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DoubleColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
    +import org.apache.hadoop.hive.ql.metadata.HiveException;
    +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
    +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
    +import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
    +import org.apache.hadoop.hive.serde2.objectinspector.StructField;
    +import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
    +import org.apache.hadoop.hive.serde2.typeinfo.DecimalTypeInfo;
    +import org.apache.hadoop.hive.serde2.typeinfo.PrimitiveTypeInfo;
    +import org.apache.hadoop.io.NullWritable;
    +import org.apache.hadoop.mapred.FileSplit;
    +import org.apache.hadoop.mapred.RecordReader;
    +
    +/**
    + * A mapred.RecordReader that returns VectorizedRowBatch.
    + */
    +public class SparkVectorizedOrcRecordReader
    +      implements RecordReader<NullWritable, VectorizedRowBatch> {
    +    private final org.apache.hadoop.hive.ql.io.orc.RecordReader reader;
    +    private final long offset;
    +    private final long length;
    +    private float progress = 0.0f;
    +    private ObjectInspector objectInspector;
    +
    +    SparkVectorizedOrcRecordReader(Reader file, Configuration conf,
    +        FileSplit fileSplit) throws IOException {
    +      this.offset = fileSplit.getStart();
    +      this.length = fileSplit.getLength();
    +      this.objectInspector = file.getObjectInspector();
    +      this.reader = OrcInputFormat.createReaderFromFile(file, conf, this.offset,
    +        this.length);
    +      this.progress = reader.getProgress();
    +    }
    +
    +    /**
    +     * Create a ColumnVector based on given ObjectInspector's type info.
    +     *
    +     * @param inspector ObjectInspector
    +     */
    +    private ColumnVector createColumnVector(ObjectInspector inspector) {
    +      switch(inspector.getCategory()) {
    +      case PRIMITIVE:
    +        {
    +          PrimitiveTypeInfo primitiveTypeInfo =
    +            (PrimitiveTypeInfo) ((PrimitiveObjectInspector)inspector).getTypeInfo();
    +          switch(primitiveTypeInfo.getPrimitiveCategory()) {
    +            case BOOLEAN:
    +            case BYTE:
    +            case SHORT:
    +            case INT:
    +            case LONG:
    +            case DATE:
    +            case INTERVAL_YEAR_MONTH:
    +              return new LongColumnVector(VectorizedRowBatch.DEFAULT_SIZE);
    +            case FLOAT:
    +            case DOUBLE:
    +              return new DoubleColumnVector(VectorizedRowBatch.DEFAULT_SIZE);
    +            case BINARY:
    +            case STRING:
    +            case CHAR:
    +            case VARCHAR:
    +              BytesColumnVector column = new BytesColumnVector(VectorizedRowBatch.DEFAULT_SIZE);
    +              column.initBuffer();
    +              return column;
    +            case DECIMAL:
    +              DecimalTypeInfo tInfo = (DecimalTypeInfo) primitiveTypeInfo;
    +              return new DecimalColumnVector(VectorizedRowBatch.DEFAULT_SIZE,
    +                  tInfo.precision(), tInfo.scale());
    +            default:
    +              throw new RuntimeException("Vectorizaton is not supported for datatype:"
    +                  + primitiveTypeInfo.getPrimitiveCategory());
    +          }
    +        }
    +      default:
    +        throw new RuntimeException("Vectorization is not supported for datatype:"
    +            + inspector.getCategory());
    +      }
    +    }
    +
    +    /**
    +     * Walk through the object inspector and add column vectors
    +     *
    +     * @param oi StructObjectInspector
    +     * @param cvList ColumnVectors are populated in this list
    +     */
    +    private void allocateColumnVector(StructObjectInspector oi,
    +        List<ColumnVector> cvList) throws HiveException {
    +      if (cvList == null) {
    +        throw new HiveException("Null columnvector list");
    +      }
    +      if (oi == null) {
    +        return;
    +      }
    +      final List<? extends StructField> fields = oi.getAllStructFieldRefs();
    +      for(StructField field : fields) {
    +        ObjectInspector fieldObjectInspector = field.getFieldObjectInspector();
    +        cvList.add(createColumnVector(fieldObjectInspector));
    +      }
    +    }
    +
    +    /**
    +     * Create VectorizedRowBatch from ObjectInspector
    +     *
    +     * @param oi
    +     * @return
    +     * @throws HiveException
    +     */
    +    private VectorizedRowBatch constructVectorizedRowBatch(
    +        StructObjectInspector oi) throws HiveException {
    --- End diff --
    
    IOException instead of HiveException ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    **[Test build #63498 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63498/consoleFull)** for PR 13775 at commit [`b067658`](https://github.com/apache/spark/commit/b067658c53a3252f0a8a288e09b07feaf0ace8d4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    @viirya when you construct a performance benchmark, you would want to minimize the overhead of things outside the code path you are testing. In this case, a lot of the time were spent in the collect operation.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    **[Test build #68990 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68990/consoleFull)** for PR 13775 at commit [`3895a98`](https://github.com/apache/spark/commit/3895a980a2aae2dc7dedbf0797bb8a37d089e683).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    **[Test build #60828 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60828/consoleFull)** for PR 13775 at commit [`20b832e`](https://github.com/apache/spark/commit/20b832ee4e5ed4e794cc1bc8f2f67cce973759e0).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by dafrista <gi...@git.apache.org>.

Github user dafrista commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13775#discussion_r68135229
  
    --- Diff: sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/orc/VectorizedSparkOrcNewRecordReader.java ---
    @@ -0,0 +1,320 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.hadoop.hive.ql.io.orc;
    +
    +import java.io.IOException;
    +import java.nio.charset.StandardCharsets;
    +import java.util.List;
    +
    +import org.apache.commons.lang.NotImplementedException;
    +
    +import org.apache.hadoop.conf.Configuration;
    +import org.apache.hadoop.hive.ql.exec.vector.ColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DecimalColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DoubleColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
    +import org.apache.hadoop.io.NullWritable;
    +import org.apache.hadoop.mapred.JobConf;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +import org.apache.hadoop.mapreduce.lib.input.FileSplit;
    +
    +import org.apache.spark.sql.catalyst.InternalRow;
    +import org.apache.spark.sql.catalyst.util.ArrayData;
    +import org.apache.spark.sql.catalyst.util.MapData;
    +import org.apache.spark.sql.types.DataType;
    +import org.apache.spark.sql.types.Decimal;
    +import org.apache.spark.unsafe.types.CalendarInterval;
    +import org.apache.spark.unsafe.types.UTF8String;
    +
    +/**
    + * A RecordReader that returns InternalRow for Spark SQL execution.
    + * This reader uses an internal reader that returns Hive's VectorizedRowBatch. An adapter
    + * class is used to return internal row by directly accessing data in column vectors.
    + */
    +public class VectorizedSparkOrcNewRecordReader
    +    extends org.apache.hadoop.mapreduce.RecordReader<NullWritable, InternalRow> {
    +  private final org.apache.hadoop.mapred.RecordReader<NullWritable, VectorizedRowBatch> reader;
    +  private final int numColumns;
    +  private VectorizedRowBatch internalValue;
    +  private float progress = 0.0f;
    +  private List<Integer> columnIDs;
    +
    +  private long numRowsOfBatch = 0;
    +  private int indexOfRow = 0;
    +
    +  private final Row row;
    +
    +  public VectorizedSparkOrcNewRecordReader(
    +      Reader file,
    +      JobConf conf,
    +      FileSplit fileSplit,
    +      List<Integer> columnIDs) throws IOException {
    +    List<OrcProto.Type> types = file.getTypes();
    +    numColumns = (types.size() == 0) ? 0 : types.get(0).getSubtypesCount();
    +    this.reader = new SparkVectorizedOrcRecordReader(file, conf,
    +      new org.apache.hadoop.mapred.FileSplit(fileSplit));
    +
    +    this.columnIDs = columnIDs;
    +    this.internalValue = this.reader.createValue();
    +    this.progress = reader.getProgress();
    +    this.row = new Row(this.internalValue.cols, columnIDs);
    +  }
    +
    +  @Override
    +  public void close() throws IOException {
    +    reader.close();
    +  }
    +
    +  @Override
    +  public NullWritable getCurrentKey() throws IOException,
    +      InterruptedException {
    +    return NullWritable.get();
    +  }
    +
    +  @Override
    +  public InternalRow getCurrentValue() throws IOException,
    +      InterruptedException {
    +    if (indexOfRow >= numRowsOfBatch) {
    +      return null;
    +    }
    +    row.rowId = indexOfRow;
    +    indexOfRow++;
    +
    +    return row;
    +  }
    +
    +  @Override
    +  public float getProgress() throws IOException, InterruptedException {
    +    return progress;
    +  }
    +
    +  @Override
    +  public void initialize(InputSplit split, TaskAttemptContext context)
    +      throws IOException, InterruptedException {
    +  }
    +
    +  @Override
    +  public boolean nextKeyValue() throws IOException, InterruptedException {
    +    if (indexOfRow == numRowsOfBatch && progress < 1.0f) {
    +      if (reader.next(NullWritable.get(), internalValue)) {
    +        if (internalValue.endOfFile) {
    +          progress = 1.0f;
    +          numRowsOfBatch = 0;
    +          indexOfRow = 0;
    +          return false;
    +        } else {
    +          assert internalValue.numCols == numColumns : "Incorrect number of columns in OrcBatch";
    +          numRowsOfBatch = internalValue.count();
    +          indexOfRow = 0;
    +          progress = reader.getProgress();
    +        }
    +        return true;
    +      } else {
    +        return false;
    +      }
    +    } else {
    +      if (indexOfRow < numRowsOfBatch) {
    +        return true;
    +      } else {
    +        return false;
    +      }
    +    }
    +  }
    +
    +  /**
    +   * Adapter class to return an internal row.
    +   */
    +  public static final class Row extends InternalRow {
    +    protected int rowId;
    +    private List<Integer> columnIDs;
    +    private final ColumnVector[] columns;
    +
    +    private Row(ColumnVector[] columns, List<Integer> columnIDs) {
    +      this.columns = columns;
    +      this.columnIDs = columnIDs;
    +    }
    +
    +    @Override
    +    public int numFields() { return columnIDs.size(); }
    +
    +    @Override
    +    public boolean anyNull() {
    +      for (int i = 0; i < columns.length; i++) {
    +        if (columnIDs.contains(i)) {
    +          if (columns[i].isRepeating && columns[i].isNull[0]) {
    +            return true;
    +          } else if (!columns[i].isRepeating && columns[i].isNull[rowId]) {
    +            return true;
    +          }
    +        }
    +      }
    +      return false;
    +    }
    +
    +    @Override
    +    public boolean isNullAt(int ordinal) {
    +      ColumnVector col = columns[columnIDs.get(ordinal)];
    +      if (col.isRepeating) {
    +        return col.isNull[0];
    +      } else {
    +        return col.isNull[rowId];
    +      }
    +    }
    +
    +    @Override
    +    public boolean getBoolean(int ordinal) {
    +      LongColumnVector col = (LongColumnVector)columns[columnIDs.get(ordinal)];
    +      if (col.isRepeating) {
    +        return col.vector[0] > 0;
    +      } else {
    +        return col.vector[rowId] > 0;
    +      }
    +    }
    +
    +    @Override
    +    public byte getByte(int ordinal) {
    +      LongColumnVector col = (LongColumnVector)columns[columnIDs.get(ordinal)];
    +      if (col.isRepeating) {
    +        return (byte)col.vector[0];
    +      } else {
    +        return (byte)col.vector[rowId];
    +      }
    +    }
    +
    +    @Override
    +    public short getShort(int ordinal) {
    +      LongColumnVector col = (LongColumnVector)columns[columnIDs.get(ordinal)];
    +      if (col.isRepeating) {
    +        return (short)col.vector[0];
    +      } else {
    +        return (short)col.vector[rowId];
    +      }
    +    }
    +
    +    @Override
    +    public int getInt(int ordinal) {
    +      LongColumnVector col = (LongColumnVector)columns[columnIDs.get(ordinal)];
    +      if (col.isRepeating) {
    +        return (int)col.vector[0];
    +      } else {
    +        return (int)col.vector[rowId];
    +      }
    +    }
    +
    +    @Override
    +    public long getLong(int ordinal) {
    +      LongColumnVector col = (LongColumnVector)columns[columnIDs.get(ordinal)];
    +      if (col.isRepeating) {
    +        return (long)col.vector[0];
    +      } else {
    +        return (long)col.vector[rowId];
    +      }
    +    }
    +
    +    @Override
    +    public float getFloat(int ordinal) {
    +      DoubleColumnVector col = (DoubleColumnVector)columns[columnIDs.get(ordinal)];
    +      if (col.isRepeating) {
    +        return (float)col.vector[0];
    +      } else {
    +        return (float)col.vector[rowId];
    +      }
    +    }
    +
    +    @Override
    +    public double getDouble(int ordinal) {
    +      DoubleColumnVector col = (DoubleColumnVector)columns[columnIDs.get(ordinal)];
    +      if (col.isRepeating) {
    +        return (double)col.vector[0];
    +      } else {
    +        return (double)col.vector[rowId];
    +      }
    +    }
    +
    +    @Override
    +    public Decimal getDecimal(int ordinal, int precision, int scale) {
    +      DecimalColumnVector col = (DecimalColumnVector)columns[columnIDs.get(ordinal)];
    +      if (col.isRepeating) {
    +        return Decimal.apply(col.vector[0].getHiveDecimal().bigDecimalValue(), precision, scale);
    +      } else {
    +        return Decimal.apply(col.vector[rowId].getHiveDecimal().bigDecimalValue(),
    +          precision, scale);
    +      }
    +    }
    +
    +    @Override
    +    public UTF8String getUTF8String(int ordinal) {
    +      BytesColumnVector bv = ((BytesColumnVector)columns[columnIDs.get(ordinal)]);
    +      String str = null;
    +      if (bv.isRepeating) {
    +        str = new String(bv.vector[0], bv.start[0], bv.length[0], StandardCharsets.UTF_8);
    --- End diff --
    
    Can creation of a String be avoided by using `UTF8String.fromBytes`? My understanding is that the encode/decode in `new String(..)` and `UTF8String.fromString` can add up.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    Thank you! First, I'll try to rebase and run with my `OrcReadBenchmark` (which is similar with ParquetReadBenchmark).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    I agree that from a maintenance standpoint forking the classes is bad. But if we really want to have the one in Spark, I would like to help too. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by hvanhovell <gi...@git.apache.org>.

Github user hvanhovell commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    Would PR https://github.com/apache/spark/pull/13676 help to improve performance?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by tejasapatil <gi...@git.apache.org>.

Github user tejasapatil commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13775#discussion_r83752583
  
    --- Diff: sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/orc/SparkVectorizedOrcRecordReader.java ---
    @@ -0,0 +1,189 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.hadoop.hive.ql.io.orc;
    +
    +import java.io.IOException;
    +import java.util.LinkedList;
    +import java.util.List;
    +
    +import org.apache.hadoop.conf.Configuration;
    +import org.apache.hadoop.hive.ql.exec.vector.ColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DecimalColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DoubleColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
    +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
    +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
    +import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
    +import org.apache.hadoop.hive.serde2.objectinspector.StructField;
    +import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
    +import org.apache.hadoop.hive.serde2.typeinfo.DecimalTypeInfo;
    +import org.apache.hadoop.hive.serde2.typeinfo.PrimitiveTypeInfo;
    +import org.apache.hadoop.io.NullWritable;
    +import org.apache.hadoop.mapred.FileSplit;
    +import org.apache.hadoop.mapred.RecordReader;
    +
    +/**
    + * A mapred.RecordReader that returns VectorizedRowBatch.
    + */
    +public class SparkVectorizedOrcRecordReader
    +      implements RecordReader<NullWritable, VectorizedRowBatch> {
    +    private final org.apache.hadoop.hive.ql.io.orc.RecordReader reader;
    +    private final long offset;
    +    private final long length;
    +    private float progress = 0.0f;
    +    private ObjectInspector objectInspector;
    +
    +    SparkVectorizedOrcRecordReader(Reader file, Configuration conf,
    +        FileSplit fileSplit) throws IOException {
    +      this.offset = fileSplit.getStart();
    +      this.length = fileSplit.getLength();
    +      this.objectInspector = file.getObjectInspector();
    +      this.reader = OrcInputFormat.createReaderFromFile(file, conf, this.offset,
    +        this.length);
    +      this.progress = reader.getProgress();
    +    }
    +
    +    /**
    +     * Create a ColumnVector based on given ObjectInspector's type info.
    +     *
    +     * @param inspector ObjectInspector
    +     */
    +    private ColumnVector createColumnVector(ObjectInspector inspector) {
    +      switch(inspector.getCategory()) {
    +        case PRIMITIVE:
    +          {
    +            PrimitiveTypeInfo primitiveTypeInfo =
    +              (PrimitiveTypeInfo) ((PrimitiveObjectInspector)inspector).getTypeInfo();
    +            switch(primitiveTypeInfo.getPrimitiveCategory()) {
    +              case BOOLEAN:
    +              case BYTE:
    +              case SHORT:
    +              case INT:
    +              case LONG:
    +              case DATE:
    +              case INTERVAL_YEAR_MONTH:
    +                return new LongColumnVector(VectorizedRowBatch.DEFAULT_SIZE);
    +              case FLOAT:
    +              case DOUBLE:
    +                return new DoubleColumnVector(VectorizedRowBatch.DEFAULT_SIZE);
    +              case BINARY:
    +              case STRING:
    +              case CHAR:
    +              case VARCHAR:
    +                BytesColumnVector column = new BytesColumnVector(VectorizedRowBatch.DEFAULT_SIZE);
    +                column.initBuffer();
    +                return column;
    +              case DECIMAL:
    +                DecimalTypeInfo tInfo = (DecimalTypeInfo) primitiveTypeInfo;
    +                return new DecimalColumnVector(VectorizedRowBatch.DEFAULT_SIZE,
    +                    tInfo.precision(), tInfo.scale());
    +              default:
    +                throw new RuntimeException("Vectorizaton is not supported for datatype:"
    +                    + primitiveTypeInfo.getPrimitiveCategory());
    +            }
    +          }
    +        default:
    +          throw new RuntimeException("Vectorization is not supported for datatype:"
    +              + inspector.getCategory());
    +      }
    +    }
    +
    +    /**
    +     * Walk through the object inspector and add column vectors
    +     *
    +     * @param oi StructObjectInspector
    +     * @param cvList ColumnVectors are populated in this list
    +     */
    +    private void allocateColumnVector(StructObjectInspector oi,
    +        List<ColumnVector> cvList) {
    +      if (cvList == null) {
    --- End diff --
    
    Based on code below, I don't think this would ever happen


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by tejasapatil <gi...@git.apache.org>.

Github user tejasapatil commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13775#discussion_r83764569
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala ---
    @@ -131,31 +138,43 @@ class OrcFileFormat extends FileFormat with DataSourceRegister with Serializable
             val physicalSchema = maybePhysicalSchema.get
             OrcRelation.setRequiredColumns(conf, physicalSchema, requiredSchema)
     
    -        val orcRecordReader = {
    -          val job = Job.getInstance(conf)
    -          FileInputFormat.setInputPaths(job, file.filePath)
    -
    -          val fileSplit = new FileSplit(
    -            new Path(new URI(file.filePath)), file.start, file.length, Array.empty
    -          )
    -          // Custom OrcRecordReader is used to get
    -          // ObjectInspector during recordReader creation itself and can
    -          // avoid NameNode call in unwrapOrcStructs per file.
    -          // Specifically would be helpful for partitioned datasets.
    -          val orcReader = OrcFile.createReader(
    -            new Path(new URI(file.filePath)), OrcFile.readerOptions(conf))
    -          new SparkOrcNewRecordReader(orcReader, conf, fileSplit.getStart, fileSplit.getLength)
    +        val job = Job.getInstance(conf)
    +        FileInputFormat.setInputPaths(job, file.filePath)
    +
    +        val fileSplit = new FileSplit(
    +          new Path(new URI(file.filePath)), file.start, file.length, Array.empty
    +        )
    +        // Custom OrcRecordReader is used to get
    +        // ObjectInspector during recordReader creation itself and can
    +        // avoid NameNode call in unwrapOrcStructs per file.
    +        // Specifically would be helpful for partitioned datasets.
    +        val orcReader = OrcFile.createReader(
    +          new Path(new URI(file.filePath)), OrcFile.readerOptions(conf))
    +
    +        if (enableVectorizedReader) {
    +          val conf = job.getConfiguration.asInstanceOf[JobConf]
    --- End diff --
    
    why can't you reuse the `conf` at line 129 (https://github.com/apache/spark/pull/13775/files#diff-01999ccbf13e95a0ea2d223f69d8ae23R129) ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13775#discussion_r83796484
  
    --- Diff: sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/orc/VectorizedSparkOrcNewRecordReader.java ---
    @@ -0,0 +1,318 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.hadoop.hive.ql.io.orc;
    +
    +import java.io.IOException;
    +import java.nio.charset.StandardCharsets;
    +import java.util.ArrayList;
    +import java.util.List;
    +
    +import org.apache.commons.lang.NotImplementedException;
    +
    +import org.apache.hadoop.conf.Configuration;
    +import org.apache.hadoop.hive.ql.exec.vector.ColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DecimalColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DoubleColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
    +import org.apache.hadoop.io.NullWritable;
    +import org.apache.hadoop.mapred.JobConf;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +import org.apache.hadoop.mapreduce.lib.input.FileSplit;
    +
    +import org.apache.spark.sql.catalyst.InternalRow;
    +import org.apache.spark.sql.catalyst.util.ArrayData;
    +import org.apache.spark.sql.catalyst.util.MapData;
    +import org.apache.spark.sql.types.DataType;
    +import org.apache.spark.sql.types.Decimal;
    +import org.apache.spark.unsafe.types.CalendarInterval;
    +import org.apache.spark.unsafe.types.UTF8String;
    +
    +/**
    + * A RecordReader that returns InternalRow for Spark SQL execution.
    + * This reader uses an internal reader that returns Hive's VectorizedRowBatch. An adapter
    + * class is used to return internal row by directly accessing data in column vectors.
    + */
    +public class VectorizedSparkOrcNewRecordReader
    +    extends org.apache.hadoop.mapreduce.RecordReader<NullWritable, InternalRow> {
    +  private final org.apache.hadoop.mapred.RecordReader<NullWritable, VectorizedRowBatch> reader;
    +  private final int numColumns;
    +  private VectorizedRowBatch internalValue;
    +  private float progress = 0.0f;
    +  private List<Integer> columnIDs;
    +
    +  private long numRowsOfBatch = 0;
    +  private int indexOfRow = 0;
    +
    +  private final Row row;
    +
    +  public VectorizedSparkOrcNewRecordReader(
    +      Reader file,
    +      JobConf conf,
    +      FileSplit fileSplit,
    +      List<Integer> columnIDs) throws IOException {
    +    List<OrcProto.Type> types = file.getTypes();
    +    numColumns = (types.size() == 0) ? 0 : types.get(0).getSubtypesCount();
    +    this.reader = new SparkVectorizedOrcRecordReader(file, conf,
    +      new org.apache.hadoop.mapred.FileSplit(fileSplit));
    +
    +    this.columnIDs = new ArrayList<>(columnIDs);
    +    this.internalValue = this.reader.createValue();
    +    this.progress = reader.getProgress();
    +    this.row = new Row(this.internalValue.cols, this.columnIDs);
    +  }
    +
    +  @Override
    +  public void close() throws IOException {
    +    reader.close();
    +  }
    +
    +  @Override
    +  public NullWritable getCurrentKey() throws IOException,
    +      InterruptedException {
    +    return NullWritable.get();
    +  }
    +
    +  @Override
    +  public InternalRow getCurrentValue() throws IOException,
    +      InterruptedException {
    +    if (indexOfRow >= numRowsOfBatch) {
    +      return null;
    +    }
    +    row.rowId = indexOfRow;
    +    indexOfRow++;
    +
    +    return row;
    +  }
    +
    +  @Override
    +  public float getProgress() throws IOException, InterruptedException {
    +    return progress;
    +  }
    +
    +  @Override
    +  public void initialize(InputSplit split, TaskAttemptContext context)
    +      throws IOException, InterruptedException {
    +  }
    +
    +  @Override
    +  public boolean nextKeyValue() throws IOException, InterruptedException {
    +    if (indexOfRow == numRowsOfBatch) {
    +      if (reader.next(NullWritable.get(), internalValue)) {
    +        if (internalValue.endOfFile) {
    +          progress = 1.0f;
    +          numRowsOfBatch = 0;
    +          indexOfRow = 0;
    +          return false;
    +        } else {
    +          assert internalValue.numCols == numColumns : "Incorrect number of columns in OrcBatch";
    +          numRowsOfBatch = internalValue.count();
    +          indexOfRow = 0;
    +          progress = reader.getProgress();
    +        }
    +        return true;
    +      } else {
    +        return false;
    +      }
    +    } else {
    +      if (indexOfRow < numRowsOfBatch) {
    +        return true;
    +      } else {
    +        return false;
    +      }
    +    }
    +  }
    +
    +  /**
    +   * Adapter class to return an internal row.
    +   */
    +  public static final class Row extends InternalRow {
    +    protected int rowId;
    +    private List<Integer> columnIDs;
    +    private final ColumnVector[] columns;
    +
    +    private Row(ColumnVector[] columns, List<Integer> columnIDs) {
    +      this.columns = columns;
    +      this.columnIDs = columnIDs;
    +    }
    +
    +    @Override
    +    public int numFields() { return columnIDs.size(); }
    +
    +    @Override
    +    public boolean anyNull() {
    +      for (int i = 0; i < columns.length; i++) {
    +        if (columnIDs.contains(i)) {
    +          if (columns[i].isRepeating && columns[i].isNull[0]) {
    +            return true;
    +          } else if (!columns[i].isRepeating && columns[i].isNull[rowId]) {
    +            return true;
    +          }
    +        }
    +      }
    +      return false;
    +    }
    +
    +    @Override
    +    public boolean isNullAt(int ordinal) {
    +      ColumnVector col = columns[columnIDs.get(ordinal)];
    +      if (col.isRepeating) {
    +        return col.isNull[0];
    +      } else {
    +        return col.isNull[rowId];
    +      }
    +    }
    +
    +    @Override
    +    public boolean getBoolean(int ordinal) {
    +      LongColumnVector col = (LongColumnVector)columns[columnIDs.get(ordinal)];
    +      if (col.isRepeating) {
    +        return col.vector[0] > 0;
    +      } else {
    +        return col.vector[rowId] > 0;
    +      }
    +    }
    +
    +    @Override
    +    public byte getByte(int ordinal) {
    +      LongColumnVector col = (LongColumnVector)columns[columnIDs.get(ordinal)];
    +      if (col.isRepeating) {
    +        return (byte)col.vector[0];
    +      } else {
    +        return (byte)col.vector[rowId];
    +      }
    +    }
    +
    +    @Override
    +    public short getShort(int ordinal) {
    +      LongColumnVector col = (LongColumnVector)columns[columnIDs.get(ordinal)];
    +      if (col.isRepeating) {
    +        return (short)col.vector[0];
    +      } else {
    +        return (short)col.vector[rowId];
    +      }
    +    }
    +
    +    @Override
    +    public int getInt(int ordinal) {
    +      LongColumnVector col = (LongColumnVector)columns[columnIDs.get(ordinal)];
    +      if (col.isRepeating) {
    +        return (int)col.vector[0];
    +      } else {
    +        return (int)col.vector[rowId];
    +      }
    +    }
    +
    +    @Override
    +    public long getLong(int ordinal) {
    +      LongColumnVector col = (LongColumnVector)columns[columnIDs.get(ordinal)];
    +      if (col.isRepeating) {
    +        return (long)col.vector[0];
    +      } else {
    +        return (long)col.vector[rowId];
    +      }
    +    }
    +
    +    @Override
    +    public float getFloat(int ordinal) {
    +      DoubleColumnVector col = (DoubleColumnVector)columns[columnIDs.get(ordinal)];
    +      if (col.isRepeating) {
    +        return (float)col.vector[0];
    +      } else {
    +        return (float)col.vector[rowId];
    +      }
    +    }
    +
    +    @Override
    +    public double getDouble(int ordinal) {
    +      DoubleColumnVector col = (DoubleColumnVector)columns[columnIDs.get(ordinal)];
    +      if (col.isRepeating) {
    +        return (double)col.vector[0];
    +      } else {
    +        return (double)col.vector[rowId];
    +      }
    +    }
    +
    +    @Override
    +    public Decimal getDecimal(int ordinal, int precision, int scale) {
    +      DecimalColumnVector col = (DecimalColumnVector)columns[columnIDs.get(ordinal)];
    +      if (col.isRepeating) {
    +        return Decimal.apply(col.vector[0].getHiveDecimal().bigDecimalValue(), precision, scale);
    +      } else {
    +        return Decimal.apply(col.vector[rowId].getHiveDecimal().bigDecimalValue(),
    +          precision, scale);
    +      }
    +    }
    +
    +    @Override
    +    public UTF8String getUTF8String(int ordinal) {
    +      BytesColumnVector bv = ((BytesColumnVector)columns[columnIDs.get(ordinal)]);
    +      if (bv.isRepeating) {
    +        return UTF8String.fromBytes(bv.vector[0], bv.start[0], bv.length[0]);
    +      } else {
    +        return UTF8String.fromBytes(bv.vector[rowId], bv.start[rowId], bv.length[rowId]);
    +      }
    +    }
    +
    +    @Override
    +    public byte[] getBinary(int ordinal) {
    +      BytesColumnVector col = (BytesColumnVector)columns[columnIDs.get(ordinal)];
    +      if (col.isRepeating) {
    +        byte[] binary = new byte[col.length[0]];
    +        System.arraycopy(col.vector[0], col.start[0], binary, 0, binary.length);
    +        return binary;
    +      } else {
    +        byte[] binary = new byte[col.length[rowId]];
    +        System.arraycopy(col.vector[rowId], col.start[rowId], binary, 0, binary.length);
    +        return binary;
    +      }
    +    }
    +
    +    @Override
    +    public CalendarInterval getInterval(int ordinal) {
    +      throw new NotImplementedException();
    --- End diff --
    
    OK.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    By supporting Spark's ColumnarBatch, the benchmarks show this vectorized Orc reader can boost 2 to 3x improvement.
    
    I will continue to add more tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69131/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    **[Test build #61384 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61384/consoleFull)** for PR 13775 at commit [`4c14278`](https://github.com/apache/spark/commit/4c14278d067e37cba569de73d24ba8f23c6eb450).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by tejasapatil <gi...@git.apache.org>.

Github user tejasapatil commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13775#discussion_r83753291
  
    --- Diff: sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/orc/SparkVectorizedOrcRecordReader.java ---
    @@ -0,0 +1,189 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.hadoop.hive.ql.io.orc;
    +
    +import java.io.IOException;
    +import java.util.LinkedList;
    +import java.util.List;
    +
    +import org.apache.hadoop.conf.Configuration;
    +import org.apache.hadoop.hive.ql.exec.vector.ColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DecimalColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DoubleColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
    +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
    +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
    +import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
    +import org.apache.hadoop.hive.serde2.objectinspector.StructField;
    +import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
    +import org.apache.hadoop.hive.serde2.typeinfo.DecimalTypeInfo;
    +import org.apache.hadoop.hive.serde2.typeinfo.PrimitiveTypeInfo;
    +import org.apache.hadoop.io.NullWritable;
    +import org.apache.hadoop.mapred.FileSplit;
    +import org.apache.hadoop.mapred.RecordReader;
    +
    +/**
    + * A mapred.RecordReader that returns VectorizedRowBatch.
    + */
    +public class SparkVectorizedOrcRecordReader
    +      implements RecordReader<NullWritable, VectorizedRowBatch> {
    +    private final org.apache.hadoop.hive.ql.io.orc.RecordReader reader;
    +    private final long offset;
    +    private final long length;
    +    private float progress = 0.0f;
    +    private ObjectInspector objectInspector;
    +
    +    SparkVectorizedOrcRecordReader(Reader file, Configuration conf,
    +        FileSplit fileSplit) throws IOException {
    +      this.offset = fileSplit.getStart();
    +      this.length = fileSplit.getLength();
    +      this.objectInspector = file.getObjectInspector();
    +      this.reader = OrcInputFormat.createReaderFromFile(file, conf, this.offset,
    +        this.length);
    +      this.progress = reader.getProgress();
    +    }
    +
    +    /**
    +     * Create a ColumnVector based on given ObjectInspector's type info.
    +     *
    +     * @param inspector ObjectInspector
    +     */
    +    private ColumnVector createColumnVector(ObjectInspector inspector) {
    +      switch(inspector.getCategory()) {
    +        case PRIMITIVE:
    +          {
    +            PrimitiveTypeInfo primitiveTypeInfo =
    +              (PrimitiveTypeInfo) ((PrimitiveObjectInspector)inspector).getTypeInfo();
    +            switch(primitiveTypeInfo.getPrimitiveCategory()) {
    +              case BOOLEAN:
    +              case BYTE:
    +              case SHORT:
    +              case INT:
    +              case LONG:
    +              case DATE:
    +              case INTERVAL_YEAR_MONTH:
    +                return new LongColumnVector(VectorizedRowBatch.DEFAULT_SIZE);
    +              case FLOAT:
    +              case DOUBLE:
    +                return new DoubleColumnVector(VectorizedRowBatch.DEFAULT_SIZE);
    +              case BINARY:
    +              case STRING:
    +              case CHAR:
    +              case VARCHAR:
    +                BytesColumnVector column = new BytesColumnVector(VectorizedRowBatch.DEFAULT_SIZE);
    +                column.initBuffer();
    +                return column;
    +              case DECIMAL:
    +                DecimalTypeInfo tInfo = (DecimalTypeInfo) primitiveTypeInfo;
    --- End diff --
    
    `tInfo` -> `decimalTypeInfo`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    @rxin @hvanhovell Available to review this? Or wait for after 2.0 release?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by hvanhovell <gi...@git.apache.org>.

Github user hvanhovell commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    @viirya could you re-run the benchmarks without calling collect(). Do a count or a simple aggregate instead, collect spends a tonne of time in serializing results from `InternalRow` to `Row`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by tejasapatil <gi...@git.apache.org>.

Github user tejasapatil commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13775#discussion_r83759570
  
    --- Diff: sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/orc/VectorizedSparkOrcNewRecordReader.java ---
    @@ -0,0 +1,318 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.hadoop.hive.ql.io.orc;
    +
    +import java.io.IOException;
    +import java.nio.charset.StandardCharsets;
    +import java.util.ArrayList;
    +import java.util.List;
    +
    +import org.apache.commons.lang.NotImplementedException;
    +
    +import org.apache.hadoop.conf.Configuration;
    +import org.apache.hadoop.hive.ql.exec.vector.ColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DecimalColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.DoubleColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
    +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
    +import org.apache.hadoop.io.NullWritable;
    +import org.apache.hadoop.mapred.JobConf;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +import org.apache.hadoop.mapreduce.lib.input.FileSplit;
    +
    +import org.apache.spark.sql.catalyst.InternalRow;
    +import org.apache.spark.sql.catalyst.util.ArrayData;
    +import org.apache.spark.sql.catalyst.util.MapData;
    +import org.apache.spark.sql.types.DataType;
    +import org.apache.spark.sql.types.Decimal;
    +import org.apache.spark.unsafe.types.CalendarInterval;
    +import org.apache.spark.unsafe.types.UTF8String;
    +
    +/**
    + * A RecordReader that returns InternalRow for Spark SQL execution.
    + * This reader uses an internal reader that returns Hive's VectorizedRowBatch. An adapter
    + * class is used to return internal row by directly accessing data in column vectors.
    + */
    +public class VectorizedSparkOrcNewRecordReader
    +    extends org.apache.hadoop.mapreduce.RecordReader<NullWritable, InternalRow> {
    +  private final org.apache.hadoop.mapred.RecordReader<NullWritable, VectorizedRowBatch> reader;
    +  private final int numColumns;
    +  private VectorizedRowBatch internalValue;
    +  private float progress = 0.0f;
    +  private List<Integer> columnIDs;
    +
    +  private long numRowsOfBatch = 0;
    +  private int indexOfRow = 0;
    +
    +  private final Row row;
    +
    +  public VectorizedSparkOrcNewRecordReader(
    +      Reader file,
    +      JobConf conf,
    +      FileSplit fileSplit,
    +      List<Integer> columnIDs) throws IOException {
    +    List<OrcProto.Type> types = file.getTypes();
    +    numColumns = (types.size() == 0) ? 0 : types.get(0).getSubtypesCount();
    +    this.reader = new SparkVectorizedOrcRecordReader(file, conf,
    +      new org.apache.hadoop.mapred.FileSplit(fileSplit));
    +
    +    this.columnIDs = new ArrayList<>(columnIDs);
    +    this.internalValue = this.reader.createValue();
    +    this.progress = reader.getProgress();
    +    this.row = new Row(this.internalValue.cols, this.columnIDs);
    +  }
    +
    +  @Override
    +  public void close() throws IOException {
    +    reader.close();
    +  }
    +
    +  @Override
    +  public NullWritable getCurrentKey() throws IOException,
    +      InterruptedException {
    +    return NullWritable.get();
    +  }
    +
    +  @Override
    +  public InternalRow getCurrentValue() throws IOException,
    +      InterruptedException {
    +    if (indexOfRow >= numRowsOfBatch) {
    +      return null;
    +    }
    +    row.rowId = indexOfRow;
    +    indexOfRow++;
    +
    +    return row;
    +  }
    +
    +  @Override
    +  public float getProgress() throws IOException, InterruptedException {
    +    return progress;
    +  }
    +
    +  @Override
    +  public void initialize(InputSplit split, TaskAttemptContext context)
    +      throws IOException, InterruptedException {
    +  }
    +
    +  @Override
    +  public boolean nextKeyValue() throws IOException, InterruptedException {
    +    if (indexOfRow == numRowsOfBatch) {
    +      if (reader.next(NullWritable.get(), internalValue)) {
    +        if (internalValue.endOfFile) {
    +          progress = 1.0f;
    +          numRowsOfBatch = 0;
    +          indexOfRow = 0;
    +          return false;
    +        } else {
    +          assert internalValue.numCols == numColumns : "Incorrect number of columns in OrcBatch";
    +          numRowsOfBatch = internalValue.count();
    +          indexOfRow = 0;
    +          progress = reader.getProgress();
    +        }
    +        return true;
    +      } else {
    +        return false;
    +      }
    +    } else {
    +      if (indexOfRow < numRowsOfBatch) {
    +        return true;
    +      } else {
    +        return false;
    +      }
    +    }
    +  }
    +
    +  /**
    +   * Adapter class to return an internal row.
    +   */
    +  public static final class Row extends InternalRow {
    +    protected int rowId;
    +    private List<Integer> columnIDs;
    +    private final ColumnVector[] columns;
    +
    +    private Row(ColumnVector[] columns, List<Integer> columnIDs) {
    +      this.columns = columns;
    +      this.columnIDs = columnIDs;
    +    }
    +
    +    @Override
    +    public int numFields() { return columnIDs.size(); }
    +
    +    @Override
    +    public boolean anyNull() {
    +      for (int i = 0; i < columns.length; i++) {
    +        if (columnIDs.contains(i)) {
    +          if (columns[i].isRepeating && columns[i].isNull[0]) {
    +            return true;
    +          } else if (!columns[i].isRepeating && columns[i].isNull[rowId]) {
    --- End diff --
    
    This if-else is double fetching `columns[i].isRepeating`. You could either save to a var OR add one more level of branching.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    **[Test build #69082 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69082/consoleFull)** for PR 13775 at commit [`c297678`](https://github.com/apache/spark/commit/c2976788255588d66ad2527646e0719e32bdf182).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `public class OrcColumnVector extends org.apache.spark.sql.execution.vectorized.ColumnVector `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    **[Test build #66424 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66424/consoleFull)** for PR 13775 at commit [`ed780f6`](https://github.com/apache/spark/commit/ed780f66bf191eacdd2b81a2cfff4fbab71f1e4e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13775
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org