You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by nongli <gi...@git.apache.org> on 2016/01/05 09:20:30 UTC

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

GitHub user nongli opened a pull request:

    https://github.com/apache/spark/pull/10593

    [SPARK-12644][SQL] Update parquet reader to be vectorized.

    This inlines a few of the Parquet decoders and adds vectorized APIs to support decoding in batch.
    There are a few particulars in the Parquet encodings that make this much more efficient. In
    particular, RLE encodings are very well suited for batch decoding. The Parquet 2.0 encodings are
    also very suited for this.
    
    This is a work in progress and does not affect the current execution. In subsequent patches, we will
    support more encodings and types before enabling this.
    
    Simple benchmarks indicate this can decode single ints about > 3x faster.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/nongli/spark spark-12644

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/10593.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #10593
    
----
commit 7eeff58298ceac076779a5cae05ca674ed0ac51a
Author: Nong <no...@gmail.com>
Date:   2015-12-31T22:45:30Z

    [SPARK-12636][SQL] Update UnsafeRowParquetRecordReader to support reading paths directly.
    
    As noted in the code, this change is to make this componenet easier to
    test in isolation.

commit 22afd1f0115b86cdb5ba661dd2c0714ff6a4243b
Author: Nong <no...@gmail.com>
Date:   2016-01-01T00:26:34Z

    [SPARK-12640][SQL] Add simple benchmarking utility class and add Parquet scan benchmarks.
    
    We've run benchmarks ad hoc to measure the scanner performance. We will continue to invest in this
    and it makes sense to get these benchmarks into code. This adds a simple benchmarking utility to do
    this.

commit 3e41ed43ebc16f4ea0f2a642dbf3a5e40a8bd0d9
Author: Nong <no...@gmail.com>
Date:   2016-01-01T05:12:44Z

    [SPARK-12635][SQL] Add ColumnarBatch, an in memory columnar format for execution.
    
    There are many potential benefits of having an efficient in memory columnar format as an alternate
    to UnsafeRow. This patch introduces ColumnarBatch/ColumnarVector which starts this effort. The
    remaining implementation can be done as follow up patches.
    
    As stated in the in the JIRA, there are useful external components that operate on memory in a
    simple columnar format. ColumnarBatch would serve that purpose and could server as a
    zero-serialization/zero-copy exchange for this use case.
    
    This patch supports running the underlying data either on heap or off heap. On heap runs a bit
    faster but we would need offheap for zero-copy exchanges. Currently, this mode is hidden behind one
    interface (ColumnVector).
    
    This differs from Parquet or the existing columnar cache because this is *not* intended to be used
    as a storage format. The focus is entirely on CPU efficiency as we expect to only have 1 of these
    batches in memory per task.

commit d99659d89a7709df8223ab86b1edd244b1e63086
Author: Nong <no...@gmail.com>
Date:   2016-01-01T07:28:06Z

    [SPARK-12644][SQL] Update parquet reader to be vectorized.
    
    This inlines a few of the Parquet decoders and adds vectorized APIs to support decoding in batch.
    There are a few particulars in the Parquet encodings that make this much more efficient. In
    particular, RLE encodings are very well suited for batch decoding. The Parquet 2.0 encodings are
    also very suited for this.
    
    This is a work in progress and does not affect the current execution. In subsequent patches, we will
    support more encodings and types before enabling this.
    
    Simple benchmarks indicate this can decode single ints about > 3x faster.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10593#issuecomment-171816015
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49413/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by nongli <gi...@git.apache.org>.

Github user nongli commented on the pull request:

    https://github.com/apache/spark/pull/10593#issuecomment-171735222
  
    @maropu I agree. We should probably think about updating the data sources API to support decoding in batch. I think starting with just parquet will give us a better sense of the requirements for that API.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10593#issuecomment-168934499
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10593#discussion_r49636533
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadBenchmark.scala ---
    @@ -87,15 +91,62 @@ object ParquetReadBenchmark {
                   if (!record.isNullAt(0)) sum += record.getInt(0)
                 }
                 reader.close()
    -        }}
    +          }
    +        }
    +
    +        // Driving the parquet reader in batch mode directly.
    +        benchmark.addCase("ParquetReader(Batched)") { num =>
    +          var sum = 0L
    +          files.map(_.asInstanceOf[String]).foreach { p =>
    +            val reader = new UnsafeRowParquetRecordReader
    +            try {
    +              reader.initialize(p, ("id" :: Nil).asJava)
    +              val batch = reader.resultBatch()
    +              val col = batch.column(0)
    +              while (reader.nextBatch()) {
    +                val numRows = batch.numRows()
    +                var i = 0
    +                while (i < numRows) {
    +                  if (!col.getIsNull(i)) sum += col.getInt(i)
    +                  i += 1
    +                }
    +              }
    +            } finally {
    +              reader.close()
    +            }
    +          }
    +        }
    +
    +        // Decoding in vectorized but having the reader return rows.
    +        benchmark.addCase("ParquetReader(Batch -> Row)") { num =>
    +          var sum = 0L
    +          files.map(_.asInstanceOf[String]).foreach { p =>
    +            val reader = new UnsafeRowParquetRecordReader
    +            try {
    +              reader.initialize(p, ("id" :: Nil).asJava)
    +              val batch = reader.resultBatch()
    +              while (reader.nextBatch()) {
    +                val it = batch.rowIterator()
    +                while (it.hasNext) {
    +                  val record = it.next()
    +                  if (!record.isNullAt(0)) sum += record.getInt(0)
    +                }
    +              }
    +            } finally {
    +              reader.close()
    +            }
    +          }
    +        }
     
             /*
    -          Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
    -          Single Int Column Scan:      Avg Time(ms)    Avg Rate(M/s)  Relative Rate
    -          -------------------------------------------------------------------------
    -          SQL Parquet Reader                 1910.0            13.72         1.00 X
    -          SQL Parquet MR                     2330.0            11.25         0.82 X
    -          ParquetReader                      1252.6            20.93         1.52 X
    +        Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
    +        Single Int Column Scan:            Avg Time(ms)    Avg Rate(M/s)  Relative Rate
    +        -------------------------------------------------------------------------------
    +        SQL Parquet Reader                       1682.6            15.58         1.00 X
    +        SQL Parquet MR                           2379.6            11.02         0.71 X
    +        ParquetReader                            1033.0            25.38         1.63 X
    --- End diff --
    
    Should we separate these into two group, one use SQL, another one does not. It's not clear to compare SQL parquet Reader than ParquetReader(Batched) (should be SQL ParquetReader(Batched))


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by nongli <gi...@git.apache.org>.

Github user nongli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10593#discussion_r49894629
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java ---
    @@ -0,0 +1,271 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources.parquet;
    +
    +import org.apache.parquet.Preconditions;
    +import org.apache.parquet.bytes.BytesUtils;
    +import org.apache.parquet.column.values.ValuesReader;
    +import org.apache.parquet.column.values.bitpacking.BytePacker;
    +import org.apache.parquet.column.values.bitpacking.Packer;
    +import org.apache.parquet.io.ParquetDecodingException;
    +import org.apache.spark.sql.execution.vectorized.ColumnVector;
    +
    +/**
    + * A values reader for Parquet's run-length encoded data. This is based off of the version in
    + * parquet-mr with these changes:
    + *  - Supports the vectorized interface.
    + *  - Works on byte arrays(byte[]) instead of making byte streams.
    + *
    + * This encoding is used in multiple places:
    + *  - Definition/Repetition levels
    + *  - Dictionary ids.
    + */
    +public final class VectorizedRleValuesReader extends ValuesReader {
    +  // Current decoding mode.
    +  private enum MODE {
    +    RLE,
    +    PACKED
    +  }
    +
    +  // Encoded data.
    +  private byte[] in;
    +  private int end;
    +  private int offset;
    +
    +  // bit/byte width of decoded data and utility to batch unpack them.
    +  private int bitWidth;
    +  private int bytesWidth;
    +  private BytePacker packer;
    +
    +  // Current decoding mode and values
    +  private MODE mode;
    +  private int currentCount;
    +  private int currentValue;
    +
    +  // Buffer of decoded values if the values are PACKED.
    +  private int[] currentBuffer = new int[16];
    +  private int currentBufferIdx = 0;
    +
    +  // If true, the bit width is fixed. This decoder is used in different places and this also
    +  // controls if we need to read the bitwidth from the beginning of the data stream.
    +  private final boolean fixedWidth;
    +
    +  public VectorizedRleValuesReader() {
    +    fixedWidth = false;
    +  }
    +
    +  public VectorizedRleValuesReader(int bitWidth) {
    +    fixedWidth = true;
    +    init(bitWidth);
    +  }
    +
    +  @Override
    +  public void initFromPage(int valueCount, byte[] page, int start) {
    --- End diff --
    
    I'm not sure what it is in master. There was some effort to use ByteBuffer that was merged recently.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by nongli <gi...@git.apache.org>.

Github user nongli commented on the pull request:

    https://github.com/apache/spark/pull/10593#issuecomment-169470362
  
    @velvia we don't need plan changes to enable this. As soon as we complete the implementation so this can read all parquet files, the reader will just always decode this way.
    
    There are more general ways to apply this technique but we need some more design discussions there. Those would require more planner changes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by nongli <gi...@git.apache.org>.

Github user nongli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10593#discussion_r49893792
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java ---
    @@ -0,0 +1,50 @@
    +package org.apache.spark.sql.execution.datasources.parquet;
    +
    +import java.io.IOException;
    +
    +import org.apache.spark.sql.execution.vectorized.ColumnVector;
    +import org.apache.spark.unsafe.Platform;
    +
    +import org.apache.parquet.column.values.ValuesReader;
    +
    +/**
    + * An implementation of the Parquet PLAIN decoder that supports the vectorized interface.
    + */
    +public class VectorizedPlainValuesReader extends ValuesReader implements VectorizedValuesReader {
    +  private byte[] buffer;
    +  private int offset;
    +  private final int byteSize;
    +
    +  public VectorizedPlainValuesReader(int byteSize) {
    +    this.byteSize = byteSize;
    +  }
    +
    +  @Override
    +  public void initFromPage(int valueCount, byte[] bytes, int offset) throws IOException {
    +    this.buffer = bytes;
    +    this.offset = offset + Platform.BYTE_ARRAY_OFFSET;
    +  }
    +
    +  @Override
    +  public void skip() {
    +    offset += byteSize;
    +  }
    +
    +  @Override
    +  public void skip(int n) {
    +    offset += n * byteSize;
    +  }
    +
    +  @Override
    +  public void readIntegers(int total, ColumnVector c, int rowId) {
    +    c.putIntsLittleEndian(rowId, total, buffer, offset - Platform.BYTE_ARRAY_OFFSET);
    --- End diff --
    
    I'll update the comment in putIntsLittleEndian. It is assuming that the input byte array contains little endian encoded integers. The API does not care what the host machine's endianness is. i.e. parquet always stores it as little endian, the column vector has to figur eit out.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10593#issuecomment-169552958
  
    **[Test build #48898 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48898/consoleFull)** for PR 10593 at commit [`eda59c3`](https://github.com/apache/spark/commit/eda59c3e6a3f5f4e3c55fd8ed05430f67a1f402e).
     * This patch **fails RAT tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `public class VectorizedPlainValuesReader extends ValuesReader implements VectorizedValuesReader `
      * `public final class VectorizedRleValuesReader extends ValuesReader `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10593#issuecomment-168934389
  
    **[Test build #48749 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48749/consoleFull)** for PR 10593 at commit [`d99659d`](https://github.com/apache/spark/commit/d99659d89a7709df8223ab86b1edd244b1e63086).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10593#issuecomment-171815749
  
    **[Test build #49413 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49413/consoleFull)** for PR 10593 at commit [`6b160dc`](https://github.com/apache/spark/commit/6b160dcba559cc072c92942de9801f94c3286d9c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10593#issuecomment-171782602
  
    **[Test build #49413 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49413/consoleFull)** for PR 10593 at commit [`6b160dc`](https://github.com/apache/spark/commit/6b160dcba559cc072c92942de9801f94c3286d9c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10593#discussion_r49815031
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java ---
    @@ -0,0 +1,50 @@
    +package org.apache.spark.sql.execution.datasources.parquet;
    --- End diff --
    
    we need to add license header to this


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by nongli <gi...@git.apache.org>.

Github user nongli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10593#discussion_r49035976
  
    --- Diff: core/src/test/scala/org/apache/spark/Benchmark.scala ---
    @@ -0,0 +1,102 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark
    +
    +import scala.collection.mutable
    +
    +import org.apache.commons.lang3.SystemUtils
    +import org.apache.spark.util.Utils
    +
    +/**
    + * Utility class to benchmark components. An example of how to use this is:
    + *  val benchmark = new Benchmark("My Benchmark", valuesPerIteration)
    + *   benchmark.addCase("V1", <function>")
    + *   benchmark.addCase("V2", <function>")
    + *   benchmark.run
    + * This will output the average time to run each function and the rate of each function.
    + *
    + * The benchmark function takes one argument that is the iteration that's being run
    + */
    +class Benchmark(name: String, valuesPerIteration: Long, iters: Int = 5) {
    +  val benchmarks = mutable.ArrayBuffer.empty[Benchmark.Case]
    +
    +  def addCase(name: String, f: Int => Unit): Unit = {
    +    benchmarks += Benchmark.Case(name, f)
    +  }
    +
    +  /**
    +   * Runs the benchmark and outputs the results to stdout. This should be copied and added as
    +   * a comment with the benchmark. Although the results vary from machine to machine, it should
    +   * provide some baseline.
    +   */
    +  def run(): Unit = {
    +    require(benchmarks.nonEmpty)
    +    val results = benchmarks.map { c =>
    +      Benchmark.measure(valuesPerIteration, c.fn, iters)
    +    }
    +    val firstRate = results.head.avgRate
    +    // scalastyle:off
    +    // The results are going to be processor specific so it is useful to include that.
    +    println(Benchmark.getProcessorName())
    +    printf("%-30s %16s %16s %14s\n", name + ":", "Avg Time(ms)", "Avg Rate(M/s)", "Relative Rate")
    +    println("-------------------------------------------------------------------------------")
    +    results.zip(benchmarks).foreach { r =>
    +      printf("%-30s %16s %16s %14s\n", r._2.name, r._1.avgMs.toString, "%10.2f" format r._1.avgRate,
    +        "%6.2f X" format (r._1.avgRate / firstRate))
    +    }
    +    println
    +    // scalastyle:on
    +  }
    +}
    +
    +object Benchmark {
    +  case class Case(name: String, fn: Int => Unit)
    +  case class Result(avgMs: Double, avgRate: Double)
    +
    +  /**
    +   * This should return a user helpful processor information. Getting at this depends on the OS.
    +   * This should return something like "Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz"
    +   */
    +  def getProcessorName(): String = {
    +    if (SystemUtils.IS_OS_MAC_OSX) {
    +      Utils.executeAndGetOutput(Seq("/usr/sbin/sysctl", "-n", "machdep.cpu.brand_string"))
    +    } else if (SystemUtils.IS_OS_LINUX) {
    +      Utils.executeAndGetOutput(Seq("/usr/bin/grep", "-m", "1", "\"model name\"", "/proc/cpuinfo"))
    +    } else {
    +      System.getenv("PROCESSOR_IDENTIFIER")
    +    }
    +  }
    +
    +  /**
    +   * Runs a single function `f` for iters, returning the average time the function took and
    +   * the rate of the function.
    +   */
    +  def measure(num: Long, f: Int => Unit, iters: Int): Result = {
    +    var totalTime = 0L
    +    for (i <- 0 until iters + 1) {
    +      val start = System.currentTimeMillis()
    --- End diff --
    
    JMH looks great and it would be good to add. The utility I added is super simple and gets us going and I don't think it's worth blocking other work items. It would be great if someone added JMH to spark and deleted this benchmarking harness.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10593#discussion_r49822045
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java ---
    @@ -0,0 +1,271 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources.parquet;
    +
    +import org.apache.parquet.Preconditions;
    +import org.apache.parquet.bytes.BytesUtils;
    +import org.apache.parquet.column.values.ValuesReader;
    +import org.apache.parquet.column.values.bitpacking.BytePacker;
    +import org.apache.parquet.column.values.bitpacking.Packer;
    +import org.apache.parquet.io.ParquetDecodingException;
    +import org.apache.spark.sql.execution.vectorized.ColumnVector;
    +
    +/**
    + * A values reader for Parquet's run-length encoded data. This is based off of the version in
    + * parquet-mr with these changes:
    + *  - Supports the vectorized interface.
    + *  - Works on byte arrays(byte[]) instead of making byte streams.
    + *
    + * This encoding is used in multiple places:
    + *  - Definition/Repetition levels
    + *  - Dictionary ids.
    + */
    +public final class VectorizedRleValuesReader extends ValuesReader {
    +  // Current decoding mode.
    +  private enum MODE {
    +    RLE,
    --- End diff --
    
    add some comment to explain the difference between RLE and PACKED? for people that are not as familiar with Parquet.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by nongli <gi...@git.apache.org>.

Github user nongli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10593#discussion_r49765061
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java ---
    @@ -154,6 +184,45 @@ public float getProgress() throws IOException, InterruptedException {
         return (float) rowsReturned / totalRowCount;
       }
     
    +  /**
    +   * Returns the ColumnarBatch object that will be used for all rows returned by this reader.
    +   * This object is reused. Calling this enables the vectorized reader. It is not valid to
    +   * call nextKeyValue/nextBatch before calling this.
    +   */
    +  public ColumnarBatch resultBatch() {
    +    return resultBatch(DEFAULT_OFFHEAP);
    +  }
    +  public ColumnarBatch resultBatch(boolean offHeap) {
    +    if (columnarBatch == null) {
    +      columnarBatch = ColumnarBatch.allocate(sparkSchema, offHeap);
    +    }
    +    return columnarBatch;
    +  }
    +
    +  /**
    +   * Advances to the next batch of rows. Returns false if there are no more.
    +   */
    +  public boolean nextBatch() throws IOException {
    +    assert(columnarBatch != null);
    +    columnarBatch.reset();
    +    if (rowsReturned >= totalRowCount) return false;
    +    checkEndOfRowGroup();
    +
    +    int num = (int)Math.min((long) columnarBatch.capacity(), totalRowCount - rowsReturned);
    +    for (int i = 0; i < columnReaders.length; ++i) {
    +      switch (columnReaders[i].descriptor.getType()) {
    +        case INT32:
    --- End diff --
    
    No. I plan to support binary as nested types which involves more complicated changes in the ColumnarBatch classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10593#discussion_r49822023
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java ---
    @@ -0,0 +1,271 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources.parquet;
    +
    +import org.apache.parquet.Preconditions;
    +import org.apache.parquet.bytes.BytesUtils;
    +import org.apache.parquet.column.values.ValuesReader;
    +import org.apache.parquet.column.values.bitpacking.BytePacker;
    +import org.apache.parquet.column.values.bitpacking.Packer;
    +import org.apache.parquet.io.ParquetDecodingException;
    +import org.apache.spark.sql.execution.vectorized.ColumnVector;
    +
    +/**
    + * A values reader for Parquet's run-length encoded data. This is based off of the version in
    + * parquet-mr with these changes:
    + *  - Supports the vectorized interface.
    + *  - Works on byte arrays(byte[]) instead of making byte streams.
    + *
    + * This encoding is used in multiple places:
    + *  - Definition/Repetition levels
    + *  - Dictionary ids.
    + */
    +public final class VectorizedRleValuesReader extends ValuesReader {
    +  // Current decoding mode.
    +  private enum MODE {
    +    RLE,
    +    PACKED
    +  }
    +
    +  // Encoded data.
    +  private byte[] in;
    +  private int end;
    +  private int offset;
    +
    +  // bit/byte width of decoded data and utility to batch unpack them.
    +  private int bitWidth;
    +  private int bytesWidth;
    +  private BytePacker packer;
    +
    +  // Current decoding mode and values
    +  private MODE mode;
    +  private int currentCount;
    +  private int currentValue;
    +
    +  // Buffer of decoded values if the values are PACKED.
    +  private int[] currentBuffer = new int[16];
    +  private int currentBufferIdx = 0;
    +
    +  // If true, the bit width is fixed. This decoder is used in different places and this also
    +  // controls if we need to read the bitwidth from the beginning of the data stream.
    +  private final boolean fixedWidth;
    +
    +  public VectorizedRleValuesReader() {
    +    fixedWidth = false;
    +  }
    +
    +  public VectorizedRleValuesReader(int bitWidth) {
    +    fixedWidth = true;
    +    init(bitWidth);
    +  }
    +
    +  @Override
    +  public void initFromPage(int valueCount, byte[] page, int start) {
    +    this.offset = start;
    +    this.in = page;
    +    if (fixedWidth) {
    +      int length = readIntLittleEndian();
    +      this.end = this.offset + length;
    +    } else {
    +      this.end = page.length;
    +      if (this.end != this.offset) init(page[this.offset++] & 255);
    +    }
    +    this.currentCount = 0;
    +  }
    +
    +  /**
    +   * Initializes the internal state for decoding ints of `bitWidth`.
    +   */
    +  private void init(int bitWidth) {
    +    Preconditions.checkArgument(bitWidth >= 0 && bitWidth <= 32, "bitWidth must be >= 0 and <= 32");
    +    this.bitWidth = bitWidth;
    +    this.bytesWidth = BytesUtils.paddedByteCountFromBits(bitWidth);
    +    this.packer = Packer.LITTLE_ENDIAN.newBytePacker(bitWidth);
    +  }
    +
    +  @Override
    +  public int getNextOffset() {
    +    return this.end;
    +  }
    +
    +  @Override
    +  public boolean readBoolean() {
    +    return this.readInteger() != 0;
    +  }
    +
    +  @Override
    +  public void skip() {
    +    this.readInteger();
    +  }
    +
    +  @Override
    +  public int readValueDictionaryId() {
    +    return readInteger();
    +  }
    +
    +  @Override
    +  public int readInteger() {
    +    if (this.currentCount == 0) { this.readNextGroup(); }
    +
    +    --this.currentCount;
    --- End diff --
    
    not a big deal, but this is old c++ habit :)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10593#issuecomment-171894735
  
    **[Test build #2383 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2383/consoleFull)** for PR 10593 at commit [`6b160dc`](https://github.com/apache/spark/commit/6b160dcba559cc072c92942de9801f94c3286d9c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by nongli <gi...@git.apache.org>.

Github user nongli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10593#discussion_r49764984
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java ---
    @@ -154,6 +184,45 @@ public float getProgress() throws IOException, InterruptedException {
         return (float) rowsReturned / totalRowCount;
       }
     
    +  /**
    +   * Returns the ColumnarBatch object that will be used for all rows returned by this reader.
    +   * This object is reused. Calling this enables the vectorized reader. It is not valid to
    +   * call nextKeyValue/nextBatch before calling this.
    +   */
    +  public ColumnarBatch resultBatch() {
    +    return resultBatch(DEFAULT_OFFHEAP);
    --- End diff --
    
    The default is false so doing what you are suggesting. I think in a follow up PR, I'm going to turn this into an enum, the boolean is confusing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by nongli <gi...@git.apache.org>.

Github user nongli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10593#discussion_r49892714
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java ---
    @@ -103,6 +105,25 @@
       private static final int DEFAULT_VAR_LEN_SIZE = 32;
     
       /**
    +   * columnBatch object that is used for batch decoding. This is created on first use and triggers
    +   * batched decoding. It is not valid to interleave calls to the batched interface with the row
    +   * by row RecordReader APIs.
    +   * This is only enabled with additional flags for development. This is still a work in progress
    +   * and currently unsupported cases will fail with potentially difficult to diagnose errors.
    +   * This should be only turned on for development to work on this feature.
    +   *
    +   * TODOs:
    +   *  - Implement all the encodings to support vectorized.
    +   *  - Implement v2 page formats (just make sure we create the correct decoders).
    +   */
    +  private ColumnarBatch columnarBatch;
    +
    +  /**
    +   * The default config on whether columnarBatch should be offheap.
    +   */
    +  private static final boolean DEFAULT_OFFHEAP = false;
    --- End diff --
    
    Yea, I've been wanting to change it to a enum for a while. Let me just do it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10593#issuecomment-172116685
  
    **[Test build #49476 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49476/consoleFull)** for PR 10593 at commit [`e0f427f`](https://github.com/apache/spark/commit/e0f427f9d4083068fefaed6a29719f0a7eeb22b3).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by velvia <gi...@git.apache.org>.

Github user velvia commented on the pull request:

    https://github.com/apache/spark/pull/10593#issuecomment-169412883
  
    @nongli this is awesome.  I suppose changes to the logical/physical query planner are coming also?  Is that going to be in a separate PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10593#discussion_r49639656
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java ---
    @@ -154,6 +184,45 @@ public float getProgress() throws IOException, InterruptedException {
         return (float) rowsReturned / totalRowCount;
       }
     
    +  /**
    +   * Returns the ColumnarBatch object that will be used for all rows returned by this reader.
    +   * This object is reused. Calling this enables the vectorized reader. It is not valid to
    +   * call nextKeyValue/nextBatch before calling this.
    +   */
    +  public ColumnarBatch resultBatch() {
    +    return resultBatch(DEFAULT_OFFHEAP);
    +  }
    +  public ColumnarBatch resultBatch(boolean offHeap) {
    +    if (columnarBatch == null) {
    +      columnarBatch = ColumnarBatch.allocate(sparkSchema, offHeap);
    +    }
    +    return columnarBatch;
    +  }
    +
    +  /**
    +   * Advances to the next batch of rows. Returns false if there are no more.
    +   */
    +  public boolean nextBatch() throws IOException {
    +    assert(columnarBatch != null);
    +    columnarBatch.reset();
    +    if (rowsReturned >= totalRowCount) return false;
    +    checkEndOfRowGroup();
    +
    +    int num = (int)Math.min((long) columnarBatch.capacity(), totalRowCount - rowsReturned);
    +    for (int i = 0; i < columnReaders.length; ++i) {
    +      switch (columnReaders[i].descriptor.getType()) {
    +        case INT32:
    --- End diff --
    
    Will you include Binary in this PR to demo how to support var-length object?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10593#issuecomment-171816014
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10593#issuecomment-171141262
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10593#discussion_r49637590
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java ---
    @@ -154,6 +184,45 @@ public float getProgress() throws IOException, InterruptedException {
         return (float) rowsReturned / totalRowCount;
       }
     
    +  /**
    +   * Returns the ColumnarBatch object that will be used for all rows returned by this reader.
    +   * This object is reused. Calling this enables the vectorized reader. It is not valid to
    +   * call nextKeyValue/nextBatch before calling this.
    +   */
    +  public ColumnarBatch resultBatch() {
    +    return resultBatch(DEFAULT_OFFHEAP);
    --- End diff --
    
    Should we use ONHEAP as default? (change it later with other offheap staff)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10593#issuecomment-172064494
  
    **[Test build #49472 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49472/consoleFull)** for PR 10593 at commit [`1f8ffe9`](https://github.com/apache/spark/commit/1f8ffe9969bcbd988b6503182f410c9502be2ff5).
     * This patch **fails Java style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10593#issuecomment-168934501
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48749/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by nongli <gi...@git.apache.org>.

Github user nongli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10593#discussion_r49765499
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedValuesReader.java ---
    @@ -0,0 +1,37 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources.parquet;
    +
    +import org.apache.spark.sql.execution.vectorized.ColumnVector;
    +
    +/**
    + * Interface for value decoding that supports vectorized (aka batched) decoding.
    + * TODO: merge this into parquet-mr.
    + */
    +public interface VectorizedValuesReader {
    --- End diff --
    
    Which class are you referring to? 
    
    There is an effort in parquet-mr to add a vectorized reader that is based on this work. 
    https://issues.apache.org/jira/browse/PARQUET-424


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10593#discussion_r49821996
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java ---
    @@ -0,0 +1,50 @@
    +package org.apache.spark.sql.execution.datasources.parquet;
    +
    +import java.io.IOException;
    +
    +import org.apache.spark.sql.execution.vectorized.ColumnVector;
    +import org.apache.spark.unsafe.Platform;
    +
    +import org.apache.parquet.column.values.ValuesReader;
    +
    +/**
    + * An implementation of the Parquet PLAIN decoder that supports the vectorized interface.
    + */
    +public class VectorizedPlainValuesReader extends ValuesReader implements VectorizedValuesReader {
    +  private byte[] buffer;
    +  private int offset;
    +  private final int byteSize;
    +
    +  public VectorizedPlainValuesReader(int byteSize) {
    +    this.byteSize = byteSize;
    +  }
    +
    +  @Override
    +  public void initFromPage(int valueCount, byte[] bytes, int offset) throws IOException {
    +    this.buffer = bytes;
    +    this.offset = offset + Platform.BYTE_ARRAY_OFFSET;
    +  }
    +
    +  @Override
    +  public void skip() {
    +    offset += byteSize;
    +  }
    +
    +  @Override
    +  public void skip(int n) {
    +    offset += n * byteSize;
    +  }
    +
    +  @Override
    +  public void readIntegers(int total, ColumnVector c, int rowId) {
    +    c.putIntsLittleEndian(rowId, total, buffer, offset - Platform.BYTE_ARRAY_OFFSET);
    --- End diff --
    
    this is not this pr but i'm confused by putIntsLittleEndian. it just assumes it is using the native endianness, not little?
    
    is an implicit assumption here that "buffer" is encoded using the same endianness?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10593#issuecomment-172117435
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10593#issuecomment-172080212
  
    **[Test build #49476 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49476/consoleFull)** for PR 10593 at commit [`e0f427f`](https://github.com/apache/spark/commit/e0f427f9d4083068fefaed6a29719f0a7eeb22b3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10593#issuecomment-172064506
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49472/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10593#discussion_r49822085
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java ---
    @@ -0,0 +1,271 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources.parquet;
    +
    +import org.apache.parquet.Preconditions;
    +import org.apache.parquet.bytes.BytesUtils;
    +import org.apache.parquet.column.values.ValuesReader;
    +import org.apache.parquet.column.values.bitpacking.BytePacker;
    +import org.apache.parquet.column.values.bitpacking.Packer;
    +import org.apache.parquet.io.ParquetDecodingException;
    +import org.apache.spark.sql.execution.vectorized.ColumnVector;
    +
    +/**
    + * A values reader for Parquet's run-length encoded data. This is based off of the version in
    + * parquet-mr with these changes:
    + *  - Supports the vectorized interface.
    + *  - Works on byte arrays(byte[]) instead of making byte streams.
    + *
    + * This encoding is used in multiple places:
    + *  - Definition/Repetition levels
    + *  - Dictionary ids.
    + */
    +public final class VectorizedRleValuesReader extends ValuesReader {
    +  // Current decoding mode.
    +  private enum MODE {
    +    RLE,
    +    PACKED
    +  }
    +
    +  // Encoded data.
    +  private byte[] in;
    +  private int end;
    +  private int offset;
    +
    +  // bit/byte width of decoded data and utility to batch unpack them.
    +  private int bitWidth;
    +  private int bytesWidth;
    +  private BytePacker packer;
    +
    +  // Current decoding mode and values
    +  private MODE mode;
    +  private int currentCount;
    +  private int currentValue;
    +
    +  // Buffer of decoded values if the values are PACKED.
    +  private int[] currentBuffer = new int[16];
    +  private int currentBufferIdx = 0;
    +
    +  // If true, the bit width is fixed. This decoder is used in different places and this also
    +  // controls if we need to read the bitwidth from the beginning of the data stream.
    +  private final boolean fixedWidth;
    +
    +  public VectorizedRleValuesReader() {
    +    fixedWidth = false;
    +  }
    +
    +  public VectorizedRleValuesReader(int bitWidth) {
    +    fixedWidth = true;
    +    init(bitWidth);
    +  }
    +
    +  @Override
    +  public void initFromPage(int valueCount, byte[] page, int start) {
    --- End diff --
    
    question: does parquet always pass byte arrays around for pages?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10593#issuecomment-172064100
  
    **[Test build #49472 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49472/consoleFull)** for PR 10593 at commit [`1f8ffe9`](https://github.com/apache/spark/commit/1f8ffe9969bcbd988b6503182f410c9502be2ff5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10593#issuecomment-171141265
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49286/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by maropu <gi...@git.apache.org>.

Github user maropu commented on the pull request:

    https://github.com/apache/spark/pull/10593#issuecomment-171666309
  
    @nongli Great work. One question; we need not have common APIs to realize batched de-serialization for other data sources like ORC? I know we currently have no specialized `RecordReader` for ORC though, I think we can apply the same optimization into the data source in future.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10593#issuecomment-172064504
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10593#discussion_r49821596
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java ---
    @@ -154,6 +184,45 @@ public float getProgress() throws IOException, InterruptedException {
         return (float) rowsReturned / totalRowCount;
       }
     
    +  /**
    +   * Returns the ColumnarBatch object that will be used for all rows returned by this reader.
    +   * This object is reused. Calling this enables the vectorized reader. It is not valid to
    +   * call nextKeyValue/nextBatch before calling this.
    +   */
    +  public ColumnarBatch resultBatch() {
    +    return resultBatch(DEFAULT_OFFHEAP);
    +  }
    +  public ColumnarBatch resultBatch(boolean offHeap) {
    --- End diff --
    
    and add documentation on when this is valid to call?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10593#discussion_r49821548
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java ---
    @@ -103,6 +105,25 @@
       private static final int DEFAULT_VAR_LEN_SIZE = 32;
     
       /**
    +   * columnBatch object that is used for batch decoding. This is created on first use and triggers
    +   * batched decoding. It is not valid to interleave calls to the batched interface with the row
    +   * by row RecordReader APIs.
    +   * This is only enabled with additional flags for development. This is still a work in progress
    +   * and currently unsupported cases will fail with potentially difficult to diagnose errors.
    +   * This should be only turned on for development to work on this feature.
    +   *
    +   * TODOs:
    +   *  - Implement all the encodings to support vectorized.
    +   *  - Implement v2 page formats (just make sure we create the correct decoders).
    +   */
    +  private ColumnarBatch columnarBatch;
    +
    +  /**
    +   * The default config on whether columnarBatch should be offheap.
    +   */
    +  private static final boolean DEFAULT_OFFHEAP = false;
    --- End diff --
    
    this naming is fairly confusing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/10593


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10593#issuecomment-169552960
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by kiszk <gi...@git.apache.org>.

Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10593#discussion_r48920875
  
    --- Diff: core/src/test/scala/org/apache/spark/Benchmark.scala ---
    @@ -0,0 +1,102 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark
    +
    +import scala.collection.mutable
    +
    +import org.apache.commons.lang3.SystemUtils
    +import org.apache.spark.util.Utils
    +
    +/**
    + * Utility class to benchmark components. An example of how to use this is:
    + *  val benchmark = new Benchmark("My Benchmark", valuesPerIteration)
    + *   benchmark.addCase("V1", <function>")
    + *   benchmark.addCase("V2", <function>")
    + *   benchmark.run
    + * This will output the average time to run each function and the rate of each function.
    + *
    + * The benchmark function takes one argument that is the iteration that's being run
    + */
    +class Benchmark(name: String, valuesPerIteration: Long, iters: Int = 5) {
    +  val benchmarks = mutable.ArrayBuffer.empty[Benchmark.Case]
    +
    +  def addCase(name: String, f: Int => Unit): Unit = {
    +    benchmarks += Benchmark.Case(name, f)
    +  }
    +
    +  /**
    +   * Runs the benchmark and outputs the results to stdout. This should be copied and added as
    +   * a comment with the benchmark. Although the results vary from machine to machine, it should
    +   * provide some baseline.
    +   */
    +  def run(): Unit = {
    +    require(benchmarks.nonEmpty)
    +    val results = benchmarks.map { c =>
    +      Benchmark.measure(valuesPerIteration, c.fn, iters)
    +    }
    +    val firstRate = results.head.avgRate
    +    // scalastyle:off
    +    // The results are going to be processor specific so it is useful to include that.
    +    println(Benchmark.getProcessorName())
    +    printf("%-30s %16s %16s %14s\n", name + ":", "Avg Time(ms)", "Avg Rate(M/s)", "Relative Rate")
    +    println("-------------------------------------------------------------------------------")
    +    results.zip(benchmarks).foreach { r =>
    +      printf("%-30s %16s %16s %14s\n", r._2.name, r._1.avgMs.toString, "%10.2f" format r._1.avgRate,
    +        "%6.2f X" format (r._1.avgRate / firstRate))
    +    }
    +    println
    +    // scalastyle:on
    +  }
    +}
    +
    +object Benchmark {
    +  case class Case(name: String, fn: Int => Unit)
    +  case class Result(avgMs: Double, avgRate: Double)
    +
    +  /**
    +   * This should return a user helpful processor information. Getting at this depends on the OS.
    +   * This should return something like "Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz"
    +   */
    +  def getProcessorName(): String = {
    +    if (SystemUtils.IS_OS_MAC_OSX) {
    +      Utils.executeAndGetOutput(Seq("/usr/sbin/sysctl", "-n", "machdep.cpu.brand_string"))
    +    } else if (SystemUtils.IS_OS_LINUX) {
    +      Utils.executeAndGetOutput(Seq("/usr/bin/grep", "-m", "1", "\"model name\"", "/proc/cpuinfo"))
    +    } else {
    +      System.getenv("PROCESSOR_IDENTIFIER")
    +    }
    +  }
    +
    +  /**
    +   * Runs a single function `f` for iters, returning the average time the function took and
    +   * the rate of the function.
    +   */
    +  def measure(num: Long, f: Int => Unit, iters: Int): Result = {
    +    var totalTime = 0L
    +    for (i <- 0 until iters + 1) {
    +      val start = System.currentTimeMillis()
    --- End diff --
    
    How about calling System.nanoTime() for short-running benchmarks instead of System.currentTimeMillis()?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10593#issuecomment-168934496
  
    **[Test build #48749 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48749/consoleFull)** for PR 10593 at commit [`d99659d`](https://github.com/apache/spark/commit/d99659d89a7709df8223ab86b1edd244b1e63086).
     * This patch **fails RAT tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `public class VectorizedPlainValuesReader extends ValuesReader implements VectorizedValuesReader `
      * `public final class VectorizedRleValuesReader extends ValuesReader `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10593#issuecomment-172117440
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49476/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by velvia <gi...@git.apache.org>.

Github user velvia commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10593#discussion_r48991290
  
    --- Diff: core/src/test/scala/org/apache/spark/Benchmark.scala ---
    @@ -0,0 +1,102 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark
    +
    +import scala.collection.mutable
    +
    +import org.apache.commons.lang3.SystemUtils
    +import org.apache.spark.util.Utils
    +
    +/**
    + * Utility class to benchmark components. An example of how to use this is:
    + *  val benchmark = new Benchmark("My Benchmark", valuesPerIteration)
    + *   benchmark.addCase("V1", <function>")
    + *   benchmark.addCase("V2", <function>")
    + *   benchmark.run
    + * This will output the average time to run each function and the rate of each function.
    + *
    + * The benchmark function takes one argument that is the iteration that's being run
    + */
    +class Benchmark(name: String, valuesPerIteration: Long, iters: Int = 5) {
    +  val benchmarks = mutable.ArrayBuffer.empty[Benchmark.Case]
    +
    +  def addCase(name: String, f: Int => Unit): Unit = {
    +    benchmarks += Benchmark.Case(name, f)
    +  }
    +
    +  /**
    +   * Runs the benchmark and outputs the results to stdout. This should be copied and added as
    +   * a comment with the benchmark. Although the results vary from machine to machine, it should
    +   * provide some baseline.
    +   */
    +  def run(): Unit = {
    +    require(benchmarks.nonEmpty)
    +    val results = benchmarks.map { c =>
    +      Benchmark.measure(valuesPerIteration, c.fn, iters)
    +    }
    +    val firstRate = results.head.avgRate
    +    // scalastyle:off
    +    // The results are going to be processor specific so it is useful to include that.
    +    println(Benchmark.getProcessorName())
    +    printf("%-30s %16s %16s %14s\n", name + ":", "Avg Time(ms)", "Avg Rate(M/s)", "Relative Rate")
    +    println("-------------------------------------------------------------------------------")
    +    results.zip(benchmarks).foreach { r =>
    +      printf("%-30s %16s %16s %14s\n", r._2.name, r._1.avgMs.toString, "%10.2f" format r._1.avgRate,
    +        "%6.2f X" format (r._1.avgRate / firstRate))
    +    }
    +    println
    +    // scalastyle:on
    +  }
    +}
    +
    +object Benchmark {
    +  case class Case(name: String, fn: Int => Unit)
    +  case class Result(avgMs: Double, avgRate: Double)
    +
    +  /**
    +   * This should return a user helpful processor information. Getting at this depends on the OS.
    +   * This should return something like "Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz"
    +   */
    +  def getProcessorName(): String = {
    +    if (SystemUtils.IS_OS_MAC_OSX) {
    +      Utils.executeAndGetOutput(Seq("/usr/sbin/sysctl", "-n", "machdep.cpu.brand_string"))
    +    } else if (SystemUtils.IS_OS_LINUX) {
    +      Utils.executeAndGetOutput(Seq("/usr/bin/grep", "-m", "1", "\"model name\"", "/proc/cpuinfo"))
    +    } else {
    +      System.getenv("PROCESSOR_IDENTIFIER")
    +    }
    +  }
    +
    +  /**
    +   * Runs a single function `f` for iters, returning the average time the function took and
    +   * the rate of the function.
    +   */
    +  def measure(num: Long, f: Int => Unit, iters: Int): Result = {
    +    var totalTime = 0L
    +    for (i <- 0 until iters + 1) {
    +      val start = System.currentTimeMillis()
    --- End diff --
    
    Or use JMH (see sbt-jmh), which takes care of a lot of common issues like JVM warmups and environment isolation when running benchmarks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/10593#issuecomment-172144550
  
    Going to merge this. Thanks.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10593#issuecomment-169552961
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48898/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10593#discussion_r49640021
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedValuesReader.java ---
    @@ -0,0 +1,37 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources.parquet;
    +
    +import org.apache.spark.sql.execution.vectorized.ColumnVector;
    +
    +/**
    + * Interface for value decoding that supports vectorized (aka batched) decoding.
    + * TODO: merge this into parquet-mr.
    + */
    +public interface VectorizedValuesReader {
    --- End diff --
    
    There is also a vectorized parquet column reader, what's the difference than that?
    
    And also it will be great if we can have a list of component that we can use or need to re-implemented from parquet-mr?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10593#issuecomment-171876218
  
    **[Test build #2383 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2383/consoleFull)** for PR 10593 at commit [`6b160dc`](https://github.com/apache/spark/commit/6b160dcba559cc072c92942de9801f94c3286d9c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10593#discussion_r49821924
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java ---
    @@ -0,0 +1,50 @@
    +package org.apache.spark.sql.execution.datasources.parquet;
    +
    +import java.io.IOException;
    +
    +import org.apache.spark.sql.execution.vectorized.ColumnVector;
    +import org.apache.spark.unsafe.Platform;
    +
    +import org.apache.parquet.column.values.ValuesReader;
    +
    +/**
    + * An implementation of the Parquet PLAIN decoder that supports the vectorized interface.
    + */
    +public class VectorizedPlainValuesReader extends ValuesReader implements VectorizedValuesReader {
    +  private byte[] buffer;
    +  private int offset;
    +  private final int byteSize;
    +
    +  public VectorizedPlainValuesReader(int byteSize) {
    +    this.byteSize = byteSize;
    +  }
    +
    +  @Override
    +  public void initFromPage(int valueCount, byte[] bytes, int offset) throws IOException {
    +    this.buffer = bytes;
    +    this.offset = offset + Platform.BYTE_ARRAY_OFFSET;
    +  }
    +
    +  @Override
    +  public void skip() {
    +    offset += byteSize;
    +  }
    +
    +  @Override
    +  public void skip(int n) {
    +    offset += n * byteSize;
    +  }
    +
    +  @Override
    +  public void readIntegers(int total, ColumnVector c, int rowId) {
    +    c.putIntsLittleEndian(rowId, total, buffer, offset - Platform.BYTE_ARRAY_OFFSET);
    +    offset += 4 * total;
    +  }
    +
    +  @Override
    +  public int readInteger() {
    +    int v = Platform.getInt(buffer, offset);
    +    offset += 4;
    +    return v;
    +  }
    +}
    --- End diff --
    
    add a space here - otherwise this will fail the java style check now it is turned on


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10593#issuecomment-169552892
  
    **[Test build #48898 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48898/consoleFull)** for PR 10593 at commit [`eda59c3`](https://github.com/apache/spark/commit/eda59c3e6a3f5f4e3c55fd8ed05430f67a1f402e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10593#discussion_r49821580
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java ---
    @@ -154,6 +184,45 @@ public float getProgress() throws IOException, InterruptedException {
         return (float) rowsReturned / totalRowCount;
       }
     
    +  /**
    +   * Returns the ColumnarBatch object that will be used for all rows returned by this reader.
    +   * This object is reused. Calling this enables the vectorized reader. It is not valid to
    +   * call nextKeyValue/nextBatch before calling this.
    +   */
    +  public ColumnarBatch resultBatch() {
    +    return resultBatch(DEFAULT_OFFHEAP);
    +  }
    +  public ColumnarBatch resultBatch(boolean offHeap) {
    --- End diff --
    
    add a blank line here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org