You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@orc.apache.org by Dongjoon Hyun <do...@gmail.com> on 2018/01/10 19:14:37 UTC
Vectorized ORC Reader in Apache Spark 2.3 with Apache ORC 1.4.1.
Hi, All.
Vectorized ORC Reader is now supported in Apache Spark 2.3.
https://issues.apache.org/jira/browse/SPARK-16060
It has been a long journey. From now, Spark can read ORC files faster
without feature penalty.
Thank you for all your support, especially Wenchen Fan.
It's done by two commits.
[SPARK-16060][SQL] Support Vectorized ORC Reader
https://github.com/apache/spark/commit/f44ba910f58083458e1133502e193a
9d6f2bf766
[SPARK-16060][SQL][FOLLOW-UP] add a wrapper solution for vectorized orc
reader
https://github.com/apache/spark/commit/eaac60a1e20e29084b7151ffca964c
faa5ba99d1
Please check OrcReadBenchmark for the final speed-up from `Hive built-in
ORC` to `Native ORC Vectorized`.
https://github.com/apache/spark/blob/master/sql/hive/
src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala
Thank you.
Bests,
Dongjoon.
Re: Vectorized ORC Reader in Apache Spark 2.3 with Apache ORC 1.4.1.
Posted by Dongjoon Hyun <do...@gmail.com>.
Hi, Nicolas.
Yes. In Apache Spark 2.3, there are new sub-improvements for SPARK-20901
(Feature parity for ORC with Parquet).
For your questions, the following three are related.
1. spark.sql.orc.impl="native"
By default, `native` ORC implementation (based on the latest ORC 1.4.1)
is added.
The old one is `hive` implementation.
2. spark.sql.orc.enableVectorizedReader="true"
By default, `native` ORC implementation uses Vectorized Reader code
path if possible.
Please note that `Vectorization(Parquet/ORC) in Apache Spark` is only
supported only for simple data types.
3. spark.sql.hive.convertMetastoreOrc=true
Like Parquet, by default, Hive tables are converted into file-based
data sources to use Vectorization technique.
Bests,
Dongjoon.
On Sun, Jan 28, 2018 at 4:15 AM, Nicolas Paris <ni...@gmail.com> wrote:
> Hi
>
> Thanks for this work.
>
> Will this affect both:
> 1) spark.read.format("orc").load("...")
> 2) spark.sql("select ... from my_orc_table_in_hive")
>
> ?
>
>
> Le 10 janv. 2018 à 20:14, Dongjoon Hyun écrivait :
> > Hi, All.
> >
> > Vectorized ORC Reader is now supported in Apache Spark 2.3.
> >
> > https://issues.apache.org/jira/browse/SPARK-16060
> >
> > It has been a long journey. From now, Spark can read ORC files faster
> without
> > feature penalty.
> >
> > Thank you for all your support, especially Wenchen Fan.
> >
> > It's done by two commits.
> >
> > [SPARK-16060][SQL] Support Vectorized ORC Reader
> > https://github.com/apache/spark/commit/
> f44ba910f58083458e1133502e193a
> > 9d6f2bf766
> >
> > [SPARK-16060][SQL][FOLLOW-UP] add a wrapper solution for vectorized
> orc
> > reader
> > https://github.com/apache/spark/commit/
> eaac60a1e20e29084b7151ffca964c
> > faa5ba99d1
> >
> > Please check OrcReadBenchmark for the final speed-up from `Hive built-in
> ORC`
> > to `Native ORC Vectorized`.
> >
> > https://github.com/apache/spark/blob/master/sql/hive/
> src/test/scala/org/
> > apache/spark/sql/hive/orc/OrcReadBenchmark.scala
> >
> > Thank you.
> >
> > Bests,
> > Dongjoon.
>
Re: Vectorized ORC Reader in Apache Spark 2.3 with Apache ORC 1.4.1.
Posted by Dongjoon Hyun <do...@gmail.com>.
Hi, Nicolas.
Yes. In Apache Spark 2.3, there are new sub-improvements for SPARK-20901
(Feature parity for ORC with Parquet).
For your questions, the following three are related.
1. spark.sql.orc.impl="native"
By default, `native` ORC implementation (based on the latest ORC 1.4.1)
is added.
The old one is `hive` implementation.
2. spark.sql.orc.enableVectorizedReader="true"
By default, `native` ORC implementation uses Vectorized Reader code
path if possible.
Please note that `Vectorization(Parquet/ORC) in Apache Spark` is only
supported only for simple data types.
3. spark.sql.hive.convertMetastoreOrc=true
Like Parquet, by default, Hive tables are converted into file-based
data sources to use Vectorization technique.
Bests,
Dongjoon.
On Sun, Jan 28, 2018 at 4:15 AM, Nicolas Paris <ni...@gmail.com> wrote:
> Hi
>
> Thanks for this work.
>
> Will this affect both:
> 1) spark.read.format("orc").load("...")
> 2) spark.sql("select ... from my_orc_table_in_hive")
>
> ?
>
>
> Le 10 janv. 2018 à 20:14, Dongjoon Hyun écrivait :
> > Hi, All.
> >
> > Vectorized ORC Reader is now supported in Apache Spark 2.3.
> >
> > https://issues.apache.org/jira/browse/SPARK-16060
> >
> > It has been a long journey. From now, Spark can read ORC files faster
> without
> > feature penalty.
> >
> > Thank you for all your support, especially Wenchen Fan.
> >
> > It's done by two commits.
> >
> > [SPARK-16060][SQL] Support Vectorized ORC Reader
> > https://github.com/apache/spark/commit/
> f44ba910f58083458e1133502e193a
> > 9d6f2bf766
> >
> > [SPARK-16060][SQL][FOLLOW-UP] add a wrapper solution for vectorized
> orc
> > reader
> > https://github.com/apache/spark/commit/
> eaac60a1e20e29084b7151ffca964c
> > faa5ba99d1
> >
> > Please check OrcReadBenchmark for the final speed-up from `Hive built-in
> ORC`
> > to `Native ORC Vectorized`.
> >
> > https://github.com/apache/spark/blob/master/sql/hive/
> src/test/scala/org/
> > apache/spark/sql/hive/orc/OrcReadBenchmark.scala
> >
> > Thank you.
> >
> > Bests,
> > Dongjoon.
>
Re: Vectorized ORC Reader in Apache Spark 2.3 with Apache ORC 1.4.1.
Posted by Dongjoon Hyun <do...@gmail.com>.
Hi, Nicolas.
Yes. In Apache Spark 2.3, there are new sub-improvements for SPARK-20901
(Feature parity for ORC with Parquet).
For your questions, the following three are related.
1. spark.sql.orc.impl="native"
By default, `native` ORC implementation (based on the latest ORC 1.4.1)
is added.
The old one is `hive` implementation.
2. spark.sql.orc.enableVectorizedReader="true"
By default, `native` ORC implementation uses Vectorized Reader code
path if possible.
Please note that `Vectorization(Parquet/ORC) in Apache Spark` is only
supported only for simple data types.
3. spark.sql.hive.convertMetastoreOrc=true
Like Parquet, by default, Hive tables are converted into file-based
data sources to use Vectorization technique.
Bests,
Dongjoon.
On Sun, Jan 28, 2018 at 4:15 AM, Nicolas Paris <ni...@gmail.com> wrote:
> Hi
>
> Thanks for this work.
>
> Will this affect both:
> 1) spark.read.format("orc").load("...")
> 2) spark.sql("select ... from my_orc_table_in_hive")
>
> ?
>
>
> Le 10 janv. 2018 à 20:14, Dongjoon Hyun écrivait :
> > Hi, All.
> >
> > Vectorized ORC Reader is now supported in Apache Spark 2.3.
> >
> > https://issues.apache.org/jira/browse/SPARK-16060
> >
> > It has been a long journey. From now, Spark can read ORC files faster
> without
> > feature penalty.
> >
> > Thank you for all your support, especially Wenchen Fan.
> >
> > It's done by two commits.
> >
> > [SPARK-16060][SQL] Support Vectorized ORC Reader
> > https://github.com/apache/spark/commit/
> f44ba910f58083458e1133502e193a
> > 9d6f2bf766
> >
> > [SPARK-16060][SQL][FOLLOW-UP] add a wrapper solution for vectorized
> orc
> > reader
> > https://github.com/apache/spark/commit/
> eaac60a1e20e29084b7151ffca964c
> > faa5ba99d1
> >
> > Please check OrcReadBenchmark for the final speed-up from `Hive built-in
> ORC`
> > to `Native ORC Vectorized`.
> >
> > https://github.com/apache/spark/blob/master/sql/hive/
> src/test/scala/org/
> > apache/spark/sql/hive/orc/OrcReadBenchmark.scala
> >
> > Thank you.
> >
> > Bests,
> > Dongjoon.
>
Re: Vectorized ORC Reader in Apache Spark 2.3 with Apache ORC 1.4.1.
Posted by Dongjoon Hyun <do...@gmail.com>.
Hi, Nicolas.
Yes. In Apache Spark 2.3, there are new sub-improvements for SPARK-20901
(Feature parity for ORC with Parquet).
For your questions, the following three are related.
1. spark.sql.orc.impl="native"
By default, `native` ORC implementation (based on the latest ORC 1.4.1)
is added.
The old one is `hive` implementation.
2. spark.sql.orc.enableVectorizedReader="true"
By default, `native` ORC implementation uses Vectorized Reader code
path if possible.
Please note that `Vectorization(Parquet/ORC) in Apache Spark` is only
supported only for simple data types.
3. spark.sql.hive.convertMetastoreOrc=true
Like Parquet, by default, Hive tables are converted into file-based
data sources to use Vectorization technique.
Bests,
Dongjoon.
On Sun, Jan 28, 2018 at 4:15 AM, Nicolas Paris <ni...@gmail.com> wrote:
> Hi
>
> Thanks for this work.
>
> Will this affect both:
> 1) spark.read.format("orc").load("...")
> 2) spark.sql("select ... from my_orc_table_in_hive")
>
> ?
>
>
> Le 10 janv. 2018 à 20:14, Dongjoon Hyun écrivait :
> > Hi, All.
> >
> > Vectorized ORC Reader is now supported in Apache Spark 2.3.
> >
> > https://issues.apache.org/jira/browse/SPARK-16060
> >
> > It has been a long journey. From now, Spark can read ORC files faster
> without
> > feature penalty.
> >
> > Thank you for all your support, especially Wenchen Fan.
> >
> > It's done by two commits.
> >
> > [SPARK-16060][SQL] Support Vectorized ORC Reader
> > https://github.com/apache/spark/commit/
> f44ba910f58083458e1133502e193a
> > 9d6f2bf766
> >
> > [SPARK-16060][SQL][FOLLOW-UP] add a wrapper solution for vectorized
> orc
> > reader
> > https://github.com/apache/spark/commit/
> eaac60a1e20e29084b7151ffca964c
> > faa5ba99d1
> >
> > Please check OrcReadBenchmark for the final speed-up from `Hive built-in
> ORC`
> > to `Native ORC Vectorized`.
> >
> > https://github.com/apache/spark/blob/master/sql/hive/
> src/test/scala/org/
> > apache/spark/sql/hive/orc/OrcReadBenchmark.scala
> >
> > Thank you.
> >
> > Bests,
> > Dongjoon.
>
Re: Vectorized ORC Reader in Apache Spark 2.3 with Apache ORC 1.4.1.
Posted by Nicolas Paris <ni...@gmail.com>.
Hi
Thanks for this work.
Will this affect both:
1) spark.read.format("orc").load("...")
2) spark.sql("select ... from my_orc_table_in_hive")
?
Le 10 janv. 2018 à 20:14, Dongjoon Hyun écrivait :
> Hi, All.
>
> Vectorized ORC Reader is now supported in Apache Spark 2.3.
>
> https://issues.apache.org/jira/browse/SPARK-16060
>
> It has been a long journey. From now, Spark can read ORC files faster without
> feature penalty.
>
> Thank you for all your support, especially Wenchen Fan.
>
> It's done by two commits.
>
> [SPARK-16060][SQL] Support Vectorized ORC Reader
> https://github.com/apache/spark/commit/f44ba910f58083458e1133502e193a
> 9d6f2bf766
>
> [SPARK-16060][SQL][FOLLOW-UP] add a wrapper solution for vectorized orc
> reader
> https://github.com/apache/spark/commit/eaac60a1e20e29084b7151ffca964c
> faa5ba99d1
>
> Please check OrcReadBenchmark for the final speed-up from `Hive built-in ORC`
> to `Native ORC Vectorized`.
>
> https://github.com/apache/spark/blob/master/sql/hive/src/test/scala/org/
> apache/spark/sql/hive/orc/OrcReadBenchmark.scala
>
> Thank you.
>
> Bests,
> Dongjoon.
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org