You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Dong Joon Hyun <dh...@hortonworks.com> on 2017/07/11 22:47:05 UTC

Re: Faster Spark on ORC with Apache ORC

Hi, All.

Since Apache Spark 2.2 vote passed successfully last week,
I think it’s a good time for me to ask your opinions again about the following PR.

https://github.com/apache/spark/pull/17980  (+3,887, −86)

It’s for the following issues.


  *   SPARK-20728: Make ORCFileFormat configurable between sql/hive and sql/core
  *   SPARK-20682: Support a new faster ORC data source based on Apache ORC

Basically, the approach is trying to use the latest Apache ORC 1.4.0 officially.
You can switch between the legacy ORC data source and new ORC datasource.

Could you help me to progress this in order to improve Apache Spark 2.3?

Bests,
Dongjoon.

From: Dong Joon Hyun <dh...@hortonworks.com>
Date: Tuesday, May 9, 2017 at 6:15 PM
To: "dev@spark.apache.org" <de...@spark.apache.org>
Subject: Faster Spark on ORC with Apache ORC

Hi, All.

Apache Spark always has been a fast and general engine, and
since SPARK-2883, Spark supports Apache ORC inside `sql/hive` module with Hive dependency.

With Apache ORC 1.4.0 (released yesterday), we can make Spark on ORC faster and get some benefits.

    - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together which means full vectorization support.

    - Stability: Apache ORC 1.4.0 already has many fixes and we can depend on ORC community effort in the future.

    - Usability: Users can use `ORC` data sources without hive module (-Phive)

    - Maintainability: Reduce the Hive dependency and eventually remove some old legacy code from `sql/hive` module.

As a first step, I made a PR adding a new ORC data source into `sql/core` module.

https://github.com/apache/spark/pull/17924  (+ 3,691 lines, -0)

Could you give some opinions on this approach?

Bests,
Dongjoon.

Re: Faster Spark on ORC with Apache ORC

Posted by Jeff Zhang <zj...@gmail.com>.
Awesome, Dong Joon, It's a great improvement. Looking forward its merge.





Dong Joon Hyun <dh...@hortonworks.com>于2017年7月12日周三 上午6:53写道:

> Hi, All.
>
>
>
> Since Apache Spark 2.2 vote passed successfully last week,
>
> I think it’s a good time for me to ask your opinions again about the
> following PR.
>
>
>
> https://github.com/apache/spark/pull/17980  (+3,887, −86)
>
>
>
> It’s for the following issues.
>
>
>
>    - SPARK-20728: Make ORCFileFormat configurable between sql/hive and
>    sql/core
>    - SPARK-20682: Support a new faster ORC data source based on Apache ORC
>
>
>
> Basically, the approach is trying to use the latest Apache ORC 1.4.0
> officially.
>
> You can switch between the legacy ORC data source and new ORC datasource.
>
>
>
> Could you help me to progress this in order to improve Apache Spark 2.3?
>
>
>
> Bests,
>
> Dongjoon.
>
>
>
> *From: *Dong Joon Hyun <dh...@hortonworks.com>
>
>
> *Date: *Tuesday, May 9, 2017 at 6:15 PM
> *To: *"dev@spark.apache.org" <de...@spark.apache.org>
> *Subject: *Faster Spark on ORC with Apache ORC
>
>
>
> Hi, All.
>
>
>
> Apache Spark always has been a fast and general engine, and
>
> since SPARK-2883, Spark supports Apache ORC inside `sql/hive` module with
> Hive dependency.
>
>
>
> With Apache ORC 1.4.0 (released yesterday), we can make Spark on ORC
> faster and get some benefits.
>
>
>
>     - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together
> which means full vectorization support.
>
>
>
>     - Stability: Apache ORC 1.4.0 already has many fixes and we can depend
> on ORC community effort in the future.
>
>
>
>     - Usability: Users can use `ORC` data sources without hive module
> (-Phive)
>
>
>
>     - Maintainability: Reduce the Hive dependency and eventually remove
> some old legacy code from `sql/hive` module.
>
>
>
> As a first step, I made a PR adding a new ORC data source into `sql/core`
> module.
>
>
>
> https://github.com/apache/spark/pull/17924  (+ 3,691 lines, -0)
>
>
>
> Could you give some opinions on this approach?
>
>
>
> Bests,
>
> Dongjoon.
>