You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Dong Joon Hyun <dh...@hortonworks.com> on 2017/08/04 15:05:08 UTC

Use Apache ORC in Apache Spark 2.3

Hi, All.

Apache Spark always has been a fast and general engine, and
supports Apache ORC inside `sql/hive` module with Hive dependency since Spark 1.4.X (SPARK-2883).
However, there are many open issues about `Feature parity for ORC with Parquet (SPARK-20901)` as of today.

With new Apache ORC 1.4 (released 8th May), Apache Spark is able to get the following benefits.

- Usability:
* Users can use `ORC` data sources without hive module (-Phive) like `Parquet` format.

- Stability & Maintanability:
* ORC 1.4 already has many fixes.
* In the future, Spark can upgrade ORC library independently from Hive
(similar to Parquet library, too)
* Eventually, reduce the dependecy on old Hive 1.2.1.

- Speed:
* Last but not least, Spark can use both Spark `ColumnarBatch` and ORC `RowBatch` together
which means full vectorization support.

First of all, I'd love to improve Apache Spark in the following steps in the time frame of Spark 2.3.

- SPARK-21422: Depend on Apache ORC 1.4.0
- SPARK-20682: Add a new faster ORC data source based on Apache ORC
- SPARK-20728: Make ORCFileFormat configurable between sql/hive and sql/core
- SPARK-16060: Vectorized Orc Reader

I’ve made above PRs since 9th May, the day after Apache ORC 1.4 release,
but the PRs seems to need more attention of PMC since this is an important change.
Since the discussion on Apache Spark 2.3 cadence is already started this week,
I thought it’s a best time to ask you about this.

Could anyone of you help me to proceed ORC improvement in Apache Spark community?

Please visit the minimal PR and JIRA issue as a starter.

* https://github.com/apache/spark/pull/18640
* https://issues.apache.org/jira/browse/SPARK-21422

Thank you in advance.

Bests,
Dongjoon Hyun.

Re: Use Apache ORC in Apache Spark 2.3

Posted by Sean Owen <so...@cloudera.com>.

-private@ list for future replies. This is not a PMC conversation.

On Fri, Aug 11, 2017 at 3:17 AM Andrew Ash <an...@andrewash.com> wrote:

> @Reynold no I don't use the HiveCatalog -- I'm using a custom
> implementation of ExternalCatalog instead.
>
> On Thu, Aug 10, 2017 at 3:34 PM, Dong Joon Hyun <dh...@hortonworks.com>
> wrote:
>
>> Thank you, Andrew and Reynold.
>>
>>
>>
>> Yes, it will reduce the old Hive dependency eventually, at least, ORC
>> codes.
>>
>>
>>
>> And, Spark without `-Phive` can ORC like Parquet.
>>
>>
>>
>> This is one milestone for `Feature parity for ORC with Parquet
>> (SPARK-20901)`.
>>
>>
>>
>> Bests,
>>
>> Dongjoon
>>
>>
>>

Re: Use Apache ORC in Apache Spark 2.3

Posted by Andrew Ash <an...@andrewash.com>.

@Reynold no I don't use the HiveCatalog -- I'm using a custom
implementation of ExternalCatalog instead.

On Thu, Aug 10, 2017 at 3:34 PM, Dong Joon Hyun <dh...@hortonworks.com>
wrote:

> Thank you, Andrew and Reynold.
>
>
>
> Yes, it will reduce the old Hive dependency eventually, at least, ORC
> codes.
>
>
>
> And, Spark without `-Phive` can ORC like Parquet.
>
>
>
> This is one milestone for `Feature parity for ORC with Parquet
> (SPARK-20901)`.
>
>
>
> Bests,
>
> Dongjoon
>
>
>
> *From: *Reynold Xin <rx...@databricks.com>
> *Date: *Thursday, August 10, 2017 at 3:23 PM
> *To: *Andrew Ash <an...@andrewash.com>
> *Cc: *Dong Joon Hyun <dh...@hortonworks.com>, "dev@spark.apache.org" <
> dev@spark.apache.org>, Apache Spark PMC <pr...@spark.apache.org>
> *Subject: *Re: Use Apache ORC in Apache Spark 2.3
>
>
>
> Do you not use the catalog?
>
>
>
>
>
> On Thu, Aug 10, 2017 at 3:22 PM, Andrew Ash <an...@andrewash.com> wrote:
>
> I would support moving ORC from sql/hive -> sql/core because it brings me
> one step closer to eliminating Hive from my Spark distribution by removing
> -Phive at build time.
>
>
>
> On Thu, Aug 10, 2017 at 9:48 AM, Dong Joon Hyun <dh...@hortonworks.com>
> wrote:
>
> Thank you again for coming and reviewing this PR.
>
>
>
> So far, we discussed the followings.
>
>
>
> 1. `Why are we adding this to core? Why not just the hive module?` (@rxin)
>
>    - `sql/core` module gives more benefit than `sql/hive`.
>
>    - Apache ORC library (`no-hive` version) is a general and resonably
> small library designed for non-hive apps.
>
>
>
> 2. `Can we add smaller amount of new code to use this, too?` (@kiszk)
>
>    - The previous #17980 , #17924, and #17943 are the complete examples
> containing this PR.
>
>    - This PR is focusing on dependency only.
>
>
>
> 3. `Why don't we then create a separate orc module? Just copy a few of the
> files over?` (@rxin)
>
>    -  Apache ORC library is the same with most of other data sources(CSV,
> JDBC, JSON, PARQUET, TEXT) which live inside `sql/core`
>
>    - It's better to use as a library instead of copying ORC files because
> Apache ORC shaded jar has many files. We had better depend on Apache ORC
> community's effort until an unavoidable reason for copying occurs.
>
>
>
> 4. `I do worry in the future whether ORC would bring in a lot more jars`
> (@rxin)
>
>    - The ORC core library's dependency tree is aggressively kept as small
> as possible. I've gone through and excluded unnecessary jars from our
> dependencies. I also kick back pull requests that add unnecessary new
> dependencies. (@omalley)
>
>
>
> 5. `In the long term, Spark should move to using only the vectorized
> reader in ORC's core” (@omalley)
>
> - Of course.
>
>
>
> I’ve been waiting for new comments and discussion since last week.
>
> Apparently, there is no further comments except the last comment(5) from
> Owen in this week.
>
>
>
> Please give your opinion if you think we need some change on the current
> PR (as-is).
>
> FYI, there is one LGTM on the PR (as-is) and no -1 so far.
>
>
>
> Thank you again for supporting new ORC improvement in Apache Spark.
>
>
>
> Bests,
>
> Dongjoon.
>
>
>
>
>
> *From: *Dong Joon Hyun <dh...@hortonworks.com>
> *Date: *Friday, August 4, 2017 at 8:05 AM
> *To: *"dev@spark.apache.org" <de...@spark.apache.org>
> *Cc: *Apache Spark PMC <pr...@spark.apache.org>
> *Subject: *Use Apache ORC in Apache Spark 2.3
>
>
>
> Hi, All.
>
>
>
> Apache Spark always has been a fast and general engine, and
>
> supports Apache ORC inside `sql/hive` module with Hive dependency since
> Spark 1.4.X (SPARK-2883).
>
> However, there are many open issues about `Feature parity for ORC with
> Parquet (SPARK-20901)` as of today.
>
>
>
> With new Apache ORC 1.4 (released 8th May), Apache Spark is able to get
> the following benefits.
>
>
>
>     - Usability:
>
>         * Users can use `ORC` data sources without hive module (-Phive)
> like `Parquet` format.
>
>
>
>     - Stability & Maintanability:
>
>         * ORC 1.4 already has many fixes.
>
>         * In the future, Spark can upgrade ORC library independently from
> Hive
>            (similar to Parquet library, too)
>
>         * Eventually, reduce the dependecy on old Hive 1.2.1.
>
>
>
>     - Speed:
>
>         * Last but not least, Spark can use both Spark `ColumnarBatch` and
> ORC `RowBatch` together
>
>           which means full vectorization support.
>
>
>
> First of all, I'd love to improve Apache Spark in the following steps in
> the time frame of Spark 2.3.
>
>
>
>     - SPARK-21422: Depend on Apache ORC 1.4.0
>
>     - SPARK-20682: Add a new faster ORC data source based on Apache ORC
>
>     - SPARK-20728: Make ORCFileFormat configurable between sql/hive and
> sql/core
>
>     - SPARK-16060: Vectorized Orc Reader
>
>
>
> I’ve made above PRs since 9th May, the day after Apache ORC 1.4 release,
>
> but the PRs seems to need more attention of PMC since this is an important
> change.
>
> Since the discussion on Apache Spark 2.3 cadence is already started this
> week,
>
> I thought it’s a best time to ask you about this.
>
>
>
> Could anyone of you help me to proceed ORC improvement in Apache Spark
> community?
>
>
>
> Please visit the minimal PR and JIRA issue as a starter.
>
>
>
>    - https://github.com/apache/spark/pull/18640
>    - https://issues.apache.org/jira/browse/SPARK-21422
>
>
>
> Thank you in advance.
>
>
>
> Bests,
>
> Dongjoon Hyun.
>
>
>
>
>

Re: Use Apache ORC in Apache Spark 2.3

Posted by Dong Joon Hyun <dh...@hortonworks.com>.

Thank you, Andrew and Reynold.

Yes, it will reduce the old Hive dependency eventually, at least, ORC codes.

And, Spark without `-Phive` can ORC like Parquet.

This is one milestone for `Feature parity for ORC with Parquet (SPARK-20901)`.

Bests,
Dongjoon

From: Reynold Xin <rx...@databricks.com>
Date: Thursday, August 10, 2017 at 3:23 PM
To: Andrew Ash <an...@andrewash.com>
Cc: Dong Joon Hyun <dh...@hortonworks.com>, "dev@spark.apache.org" <de...@spark.apache.org>, Apache Spark PMC <pr...@spark.apache.org>
Subject: Re: Use Apache ORC in Apache Spark 2.3

Do you not use the catalog?


On Thu, Aug 10, 2017 at 3:22 PM, Andrew Ash <an...@andrewash.com>> wrote:
I would support moving ORC from sql/hive -> sql/core because it brings me one step closer to eliminating Hive from my Spark distribution by removing -Phive at build time.

On Thu, Aug 10, 2017 at 9:48 AM, Dong Joon Hyun <dh...@hortonworks.com>> wrote:
Thank you again for coming and reviewing this PR.

So far, we discussed the followings.

1. `Why are we adding this to core? Why not just the hive module?` (@rxin)
   - `sql/core` module gives more benefit than `sql/hive`.
   - Apache ORC library (`no-hive` version) is a general and resonably small library designed for non-hive apps.

2. `Can we add smaller amount of new code to use this, too?` (@kiszk)
   - The previous #17980 , #17924, and #17943 are the complete examples containing this PR.
   - This PR is focusing on dependency only.

3. `Why don't we then create a separate orc module? Just copy a few of the files over?` (@rxin)
   -  Apache ORC library is the same with most of other data sources(CSV, JDBC, JSON, PARQUET, TEXT) which live inside `sql/core`
   - It's better to use as a library instead of copying ORC files because Apache ORC shaded jar has many files. We had better depend on Apache ORC community's effort until an unavoidable reason for copying occurs.

4. `I do worry in the future whether ORC would bring in a lot more jars` (@rxin)
   - The ORC core library's dependency tree is aggressively kept as small as possible. I've gone through and excluded unnecessary jars from our dependencies. I also kick back pull requests that add unnecessary new dependencies. (@omalley)

5. `In the long term, Spark should move to using only the vectorized reader in ORC's core” (@omalley)
- Of course.

I’ve been waiting for new comments and discussion since last week.
Apparently, there is no further comments except the last comment(5) from Owen in this week.

Please give your opinion if you think we need some change on the current PR (as-is).
FYI, there is one LGTM on the PR (as-is) and no -1 so far.

Thank you again for supporting new ORC improvement in Apache Spark.

Bests,
Dongjoon.


From: Dong Joon Hyun <dh...@hortonworks.com>>
Date: Friday, August 4, 2017 at 8:05 AM
To: "dev@spark.apache.org<ma...@spark.apache.org>" <de...@spark.apache.org>>
Cc: Apache Spark PMC <pr...@spark.apache.org>>
Subject: Use Apache ORC in Apache Spark 2.3

Hi, All.

Apache Spark always has been a fast and general engine, and
supports Apache ORC inside `sql/hive` module with Hive dependency since Spark 1.4.X (SPARK-2883).
However, there are many open issues about `Feature parity for ORC with Parquet (SPARK-20901)` as of today.

With new Apache ORC 1.4 (released 8th May), Apache Spark is able to get the following benefits.

    - Usability:
        * Users can use `ORC` data sources without hive module (-Phive) like `Parquet` format.

    - Stability & Maintanability:
        * ORC 1.4 already has many fixes.
        * In the future, Spark can upgrade ORC library independently from Hive
           (similar to Parquet library, too)
        * Eventually, reduce the dependecy on old Hive 1.2.1.

    - Speed:
        * Last but not least, Spark can use both Spark `ColumnarBatch` and ORC `RowBatch` together
          which means full vectorization support.

First of all, I'd love to improve Apache Spark in the following steps in the time frame of Spark 2.3.

    - SPARK-21422: Depend on Apache ORC 1.4.0
    - SPARK-20682: Add a new faster ORC data source based on Apache ORC
    - SPARK-20728: Make ORCFileFormat configurable between sql/hive and sql/core
    - SPARK-16060: Vectorized Orc Reader

I’ve made above PRs since 9th May, the day after Apache ORC 1.4 release,
but the PRs seems to need more attention of PMC since this is an important change.
Since the discussion on Apache Spark 2.3 cadence is already started this week,
I thought it’s a best time to ask you about this.

Could anyone of you help me to proceed ORC improvement in Apache Spark community?

Please visit the minimal PR and JIRA issue as a starter.


  *   https://github.com/apache/spark/pull/18640
  *   https://issues.apache.org/jira/browse/SPARK-21422

Thank you in advance.

Bests,
Dongjoon Hyun.

Re: Use Apache ORC in Apache Spark 2.3

Posted by Reynold Xin <rx...@databricks.com>.

Do you not use the catalog?


On Thu, Aug 10, 2017 at 3:22 PM, Andrew Ash <an...@andrewash.com> wrote:

> I would support moving ORC from sql/hive -> sql/core because it brings me
> one step closer to eliminating Hive from my Spark distribution by removing
> -Phive at build time.
>
> On Thu, Aug 10, 2017 at 9:48 AM, Dong Joon Hyun <dh...@hortonworks.com>
> wrote:
>
>> Thank you again for coming and reviewing this PR.
>>
>>
>>
>> So far, we discussed the followings.
>>
>>
>>
>> 1. `Why are we adding this to core? Why not just the hive module?` (@rxin)
>>
>>    - `sql/core` module gives more benefit than `sql/hive`.
>>
>>    - Apache ORC library (`no-hive` version) is a general and resonably
>> small library designed for non-hive apps.
>>
>>
>>
>> 2. `Can we add smaller amount of new code to use this, too?` (@kiszk)
>>
>>    - The previous #17980 , #17924, and #17943 are the complete examples
>> containing this PR.
>>
>>    - This PR is focusing on dependency only.
>>
>>
>>
>> 3. `Why don't we then create a separate orc module? Just copy a few of
>> the files over?` (@rxin)
>>
>>    -  Apache ORC library is the same with most of other data sources(CSV,
>> JDBC, JSON, PARQUET, TEXT) which live inside `sql/core`
>>
>>    - It's better to use as a library instead of copying ORC files because
>> Apache ORC shaded jar has many files. We had better depend on Apache ORC
>> community's effort until an unavoidable reason for copying occurs.
>>
>>
>>
>> 4. `I do worry in the future whether ORC would bring in a lot more jars`
>> (@rxin)
>>
>>    - The ORC core library's dependency tree is aggressively kept as small
>> as possible. I've gone through and excluded unnecessary jars from our
>> dependencies. I also kick back pull requests that add unnecessary new
>> dependencies. (@omalley)
>>
>>
>>
>> 5. `In the long term, Spark should move to using only the vectorized
>> reader in ORC's core” (@omalley)
>>
>> - Of course.
>>
>>
>>
>> I’ve been waiting for new comments and discussion since last week.
>>
>> Apparently, there is no further comments except the last comment(5) from
>> Owen in this week.
>>
>>
>>
>> Please give your opinion if you think we need some change on the current
>> PR (as-is).
>>
>> FYI, there is one LGTM on the PR (as-is) and no -1 so far.
>>
>>
>>
>> Thank you again for supporting new ORC improvement in Apache Spark.
>>
>>
>>
>> Bests,
>>
>> Dongjoon.
>>
>>
>>
>>
>>
>> *From: *Dong Joon Hyun <dh...@hortonworks.com>
>> *Date: *Friday, August 4, 2017 at 8:05 AM
>> *To: *"dev@spark.apache.org" <de...@spark.apache.org>
>> *Cc: *Apache Spark PMC <pr...@spark.apache.org>
>> *Subject: *Use Apache ORC in Apache Spark 2.3
>>
>>
>>
>> Hi, All.
>>
>>
>>
>> Apache Spark always has been a fast and general engine, and
>>
>> supports Apache ORC inside `sql/hive` module with Hive dependency since
>> Spark 1.4.X (SPARK-2883).
>>
>> However, there are many open issues about `Feature parity for ORC with
>> Parquet (SPARK-20901)` as of today.
>>
>>
>>
>> With new Apache ORC 1.4 (released 8th May), Apache Spark is able to get
>> the following benefits.
>>
>>
>>
>>     - Usability:
>>
>>         * Users can use `ORC` data sources without hive module (-Phive)
>> like `Parquet` format.
>>
>>
>>
>>     - Stability & Maintanability:
>>
>>         * ORC 1.4 already has many fixes.
>>
>>         * In the future, Spark can upgrade ORC library independently from
>> Hive
>>            (similar to Parquet library, too)
>>
>>         * Eventually, reduce the dependecy on old Hive 1.2.1.
>>
>>
>>
>>     - Speed:
>>
>>         * Last but not least, Spark can use both Spark `ColumnarBatch`
>> and ORC `RowBatch` together
>>
>>           which means full vectorization support.
>>
>>
>>
>> First of all, I'd love to improve Apache Spark in the following steps in
>> the time frame of Spark 2.3.
>>
>>
>>
>>     - SPARK-21422: Depend on Apache ORC 1.4.0
>>
>>     - SPARK-20682: Add a new faster ORC data source based on Apache ORC
>>
>>     - SPARK-20728: Make ORCFileFormat configurable between sql/hive and
>> sql/core
>>
>>     - SPARK-16060: Vectorized Orc Reader
>>
>>
>>
>> I’ve made above PRs since 9th May, the day after Apache ORC 1.4 release,
>>
>> but the PRs seems to need more attention of PMC since this is an
>> important change.
>>
>> Since the discussion on Apache Spark 2.3 cadence is already started this
>> week,
>>
>> I thought it’s a best time to ask you about this.
>>
>>
>>
>> Could anyone of you help me to proceed ORC improvement in Apache Spark
>> community?
>>
>>
>>
>> Please visit the minimal PR and JIRA issue as a starter.
>>
>>
>>
>>    - https://github.com/apache/spark/pull/18640
>>    - https://issues.apache.org/jira/browse/SPARK-21422
>>
>>
>>
>> Thank you in advance.
>>
>>
>>
>> Bests,
>>
>> Dongjoon Hyun.
>>
>
>

Re: Use Apache ORC in Apache Spark 2.3

Posted by Andrew Ash <an...@andrewash.com>.

I would support moving ORC from sql/hive -> sql/core because it brings me
one step closer to eliminating Hive from my Spark distribution by removing
-Phive at build time.

On Thu, Aug 10, 2017 at 9:48 AM, Dong Joon Hyun <dh...@hortonworks.com>
wrote:

> Thank you again for coming and reviewing this PR.
>
>
>
> So far, we discussed the followings.
>
>
>
> 1. `Why are we adding this to core? Why not just the hive module?` (@rxin)
>
>    - `sql/core` module gives more benefit than `sql/hive`.
>
>    - Apache ORC library (`no-hive` version) is a general and resonably
> small library designed for non-hive apps.
>
>
>
> 2. `Can we add smaller amount of new code to use this, too?` (@kiszk)
>
>    - The previous #17980 , #17924, and #17943 are the complete examples
> containing this PR.
>
>    - This PR is focusing on dependency only.
>
>
>
> 3. `Why don't we then create a separate orc module? Just copy a few of the
> files over?` (@rxin)
>
>    -  Apache ORC library is the same with most of other data sources(CSV,
> JDBC, JSON, PARQUET, TEXT) which live inside `sql/core`
>
>    - It's better to use as a library instead of copying ORC files because
> Apache ORC shaded jar has many files. We had better depend on Apache ORC
> community's effort until an unavoidable reason for copying occurs.
>
>
>
> 4. `I do worry in the future whether ORC would bring in a lot more jars`
> (@rxin)
>
>    - The ORC core library's dependency tree is aggressively kept as small
> as possible. I've gone through and excluded unnecessary jars from our
> dependencies. I also kick back pull requests that add unnecessary new
> dependencies. (@omalley)
>
>
>
> 5. `In the long term, Spark should move to using only the vectorized
> reader in ORC's core” (@omalley)
>
> - Of course.
>
>
>
> I’ve been waiting for new comments and discussion since last week.
>
> Apparently, there is no further comments except the last comment(5) from
> Owen in this week.
>
>
>
> Please give your opinion if you think we need some change on the current
> PR (as-is).
>
> FYI, there is one LGTM on the PR (as-is) and no -1 so far.
>
>
>
> Thank you again for supporting new ORC improvement in Apache Spark.
>
>
>
> Bests,
>
> Dongjoon.
>
>
>
>
>
> *From: *Dong Joon Hyun <dh...@hortonworks.com>
> *Date: *Friday, August 4, 2017 at 8:05 AM
> *To: *"dev@spark.apache.org" <de...@spark.apache.org>
> *Cc: *Apache Spark PMC <pr...@spark.apache.org>
> *Subject: *Use Apache ORC in Apache Spark 2.3
>
>
>
> Hi, All.
>
>
>
> Apache Spark always has been a fast and general engine, and
>
> supports Apache ORC inside `sql/hive` module with Hive dependency since
> Spark 1.4.X (SPARK-2883).
>
> However, there are many open issues about `Feature parity for ORC with
> Parquet (SPARK-20901)` as of today.
>
>
>
> With new Apache ORC 1.4 (released 8th May), Apache Spark is able to get
> the following benefits.
>
>
>
>     - Usability:
>
>         * Users can use `ORC` data sources without hive module (-Phive)
> like `Parquet` format.
>
>
>
>     - Stability & Maintanability:
>
>         * ORC 1.4 already has many fixes.
>
>         * In the future, Spark can upgrade ORC library independently from
> Hive
>            (similar to Parquet library, too)
>
>         * Eventually, reduce the dependecy on old Hive 1.2.1.
>
>
>
>     - Speed:
>
>         * Last but not least, Spark can use both Spark `ColumnarBatch` and
> ORC `RowBatch` together
>
>           which means full vectorization support.
>
>
>
> First of all, I'd love to improve Apache Spark in the following steps in
> the time frame of Spark 2.3.
>
>
>
>     - SPARK-21422: Depend on Apache ORC 1.4.0
>
>     - SPARK-20682: Add a new faster ORC data source based on Apache ORC
>
>     - SPARK-20728: Make ORCFileFormat configurable between sql/hive and
> sql/core
>
>     - SPARK-16060: Vectorized Orc Reader
>
>
>
> I’ve made above PRs since 9th May, the day after Apache ORC 1.4 release,
>
> but the PRs seems to need more attention of PMC since this is an important
> change.
>
> Since the discussion on Apache Spark 2.3 cadence is already started this
> week,
>
> I thought it’s a best time to ask you about this.
>
>
>
> Could anyone of you help me to proceed ORC improvement in Apache Spark
> community?
>
>
>
> Please visit the minimal PR and JIRA issue as a starter.
>
>
>
>    - https://github.com/apache/spark/pull/18640
>    - https://issues.apache.org/jira/browse/SPARK-21422
>
>
>
> Thank you in advance.
>
>
>
> Bests,
>
> Dongjoon Hyun.
>

Re: Use Apache ORC in Apache Spark 2.3

Posted by Dong Joon Hyun <dh...@hortonworks.com>.

Thank you again for coming and reviewing this PR.

So far, we discussed the followings.

1. `Why are we adding this to core? Why not just the hive module?` (@rxin)
   - `sql/core` module gives more benefit than `sql/hive`.
   - Apache ORC library (`no-hive` version) is a general and resonably small library designed for non-hive apps.

2. `Can we add smaller amount of new code to use this, too?` (@kiszk)
   - The previous #17980 , #17924, and #17943 are the complete examples containing this PR.
   - This PR is focusing on dependency only.

3. `Why don't we then create a separate orc module? Just copy a few of the files over?` (@rxin)
   -  Apache ORC library is the same with most of other data sources(CSV, JDBC, JSON, PARQUET, TEXT) which live inside `sql/core`
   - It's better to use as a library instead of copying ORC files because Apache ORC shaded jar has many files. We had better depend on Apache ORC community's effort until an unavoidable reason for copying occurs.

4. `I do worry in the future whether ORC would bring in a lot more jars` (@rxin)
   - The ORC core library's dependency tree is aggressively kept as small as possible. I've gone through and excluded unnecessary jars from our dependencies. I also kick back pull requests that add unnecessary new dependencies. (@omalley)

5. `In the long term, Spark should move to using only the vectorized reader in ORC's core” (@omalley)
- Of course.

I’ve been waiting for new comments and discussion since last week.
Apparently, there is no further comments except the last comment(5) from Owen in this week.

Please give your opinion if you think we need some change on the current PR (as-is).
FYI, there is one LGTM on the PR (as-is) and no -1 so far.

Thank you again for supporting new ORC improvement in Apache Spark.

Bests,
Dongjoon.


From: Dong Joon Hyun <dh...@hortonworks.com>
Date: Friday, August 4, 2017 at 8:05 AM
To: "dev@spark.apache.org" <de...@spark.apache.org>
Cc: Apache Spark PMC <pr...@spark.apache.org>
Subject: Use Apache ORC in Apache Spark 2.3

Hi, All.

Apache Spark always has been a fast and general engine, and
supports Apache ORC inside `sql/hive` module with Hive dependency since Spark 1.4.X (SPARK-2883).
However, there are many open issues about `Feature parity for ORC with Parquet (SPARK-20901)` as of today.

With new Apache ORC 1.4 (released 8th May), Apache Spark is able to get the following benefits.

    - Usability:
        * Users can use `ORC` data sources without hive module (-Phive) like `Parquet` format.

    - Stability & Maintanability:
        * ORC 1.4 already has many fixes.
        * In the future, Spark can upgrade ORC library independently from Hive
           (similar to Parquet library, too)
        * Eventually, reduce the dependecy on old Hive 1.2.1.

    - Speed:
        * Last but not least, Spark can use both Spark `ColumnarBatch` and ORC `RowBatch` together
          which means full vectorization support.

First of all, I'd love to improve Apache Spark in the following steps in the time frame of Spark 2.3.

    - SPARK-21422: Depend on Apache ORC 1.4.0
    - SPARK-20682: Add a new faster ORC data source based on Apache ORC
    - SPARK-20728: Make ORCFileFormat configurable between sql/hive and sql/core
    - SPARK-16060: Vectorized Orc Reader

I’ve made above PRs since 9th May, the day after Apache ORC 1.4 release,
but the PRs seems to need more attention of PMC since this is an important change.
Since the discussion on Apache Spark 2.3 cadence is already started this week,
I thought it’s a best time to ask you about this.

Could anyone of you help me to proceed ORC improvement in Apache Spark community?

Please visit the minimal PR and JIRA issue as a starter.


  *   https://github.com/apache/spark/pull/18640
  *   https://issues.apache.org/jira/browse/SPARK-21422

Thank you in advance.

Bests,
Dongjoon Hyun.

Re: Use Apache ORC in Apache Spark 2.3

Posted by Dong Joon Hyun <dh...@hortonworks.com>.

Thank you so much, Owen!

Bests,
Dongjoon.

From: Owen O'Malley <ow...@gmail.com>
Date: Friday, August 4, 2017 at 9:59 AM
To: Dong Joon Hyun <dh...@hortonworks.com>
Cc: "dev@spark.apache.org" <de...@spark.apache.org>, Apache Spark PMC <pr...@spark.apache.org>
Subject: Re: Use Apache ORC in Apache Spark 2.3

The ORC community is really eager to get this work integrated in to Spark so that Spark users can have fast access to their ORC data. Let us know if we can help the integration.

Thanks,
   Owen

On Fri, Aug 4, 2017 at 8:05 AM, Dong Joon Hyun <dh...@hortonworks.com>> wrote:
Hi, All.

Apache Spark always has been a fast and general engine, and
supports Apache ORC inside `sql/hive` module with Hive dependency since Spark 1.4.X (SPARK-2883).
However, there are many open issues about `Feature parity for ORC with Parquet (SPARK-20901)` as of today.

With new Apache ORC 1.4 (released 8th May), Apache Spark is able to get the following benefits.

    - Usability:
        * Users can use `ORC` data sources without hive module (-Phive) like `Parquet` format.

    - Stability & Maintanability:
        * ORC 1.4 already has many fixes.
        * In the future, Spark can upgrade ORC library independently from Hive
           (similar to Parquet library, too)
        * Eventually, reduce the dependecy on old Hive 1.2.1.

    - Speed:
        * Last but not least, Spark can use both Spark `ColumnarBatch` and ORC `RowBatch` together
          which means full vectorization support.

First of all, I'd love to improve Apache Spark in the following steps in the time frame of Spark 2.3.

    - SPARK-21422: Depend on Apache ORC 1.4.0
    - SPARK-20682: Add a new faster ORC data source based on Apache ORC
    - SPARK-20728: Make ORCFileFormat configurable between sql/hive and sql/core
    - SPARK-16060: Vectorized Orc Reader

I’ve made above PRs since 9th May, the day after Apache ORC 1.4 release,
but the PRs seems to need more attention of PMC since this is an important change.
Since the discussion on Apache Spark 2.3 cadence is already started this week,
I thought it’s a best time to ask you about this.

Could anyone of you help me to proceed ORC improvement in Apache Spark community?

Please visit the minimal PR and JIRA issue as a starter.

  *   https://github.com/apache/spark/pull/18640
  *   https://issues.apache.org/jira/browse/SPARK-21422

Thank you in advance.

Bests,
Dongjoon Hyun.

Re: Use Apache ORC in Apache Spark 2.3

Posted by Owen O'Malley <ow...@gmail.com>.

The ORC community is really eager to get this work integrated in to Spark
so that Spark users can have fast access to their ORC data. Let us know if
we can help the integration.

Thanks,
   Owen

On Fri, Aug 4, 2017 at 8:05 AM, Dong Joon Hyun <dh...@hortonworks.com>
wrote:

> Hi, All.
>
>
>
> Apache Spark always has been a fast and general engine, and
>
> supports Apache ORC inside `sql/hive` module with Hive dependency since
> Spark 1.4.X (SPARK-2883).
>
> However, there are many open issues about `Feature parity for ORC with
> Parquet (SPARK-20901)` as of today.
>
>
>
> With new Apache ORC 1.4 (released 8th May), Apache Spark is able to get
> the following benefits.
>
>
>
>     - Usability:
>
>         * Users can use `ORC` data sources without hive module (-Phive)
> like `Parquet` format.
>
>
>
>     - Stability & Maintanability:
>
>         * ORC 1.4 already has many fixes.
>
>         * In the future, Spark can upgrade ORC library independently from
> Hive
>            (similar to Parquet library, too)
>
>         * Eventually, reduce the dependecy on old Hive 1.2.1.
>
>
>
>     - Speed:
>
>         * Last but not least, Spark can use both Spark `ColumnarBatch` and
> ORC `RowBatch` together
>
>           which means full vectorization support.
>
>
>
> First of all, I'd love to improve Apache Spark in the following steps in
> the time frame of Spark 2.3.
>
>
>
>     - SPARK-21422: Depend on Apache ORC 1.4.0
>
>     - SPARK-20682: Add a new faster ORC data source based on Apache ORC
>
>     - SPARK-20728: Make ORCFileFormat configurable between sql/hive and
> sql/core
>
>     - SPARK-16060: Vectorized Orc Reader
>
>
>
> I’ve made above PRs since 9th May, the day after Apache ORC 1.4 release,
>
> but the PRs seems to need more attention of PMC since this is an important
> change.
>
> Since the discussion on Apache Spark 2.3 cadence is already started this
> week,
>
> I thought it’s a best time to ask you about this.
>
>
>
> Could anyone of you help me to proceed ORC improvement in Apache Spark
> community?
>
>
>
> Please visit the minimal PR and JIRA issue as a starter.
>
>
>
>    - https://github.com/apache/spark/pull/18640
>    - https://issues.apache.org/jira/browse/SPARK-21422
>
>
>
> Thank you in advance.
>
>
>
> Bests,
>
> Dongjoon Hyun.
>