You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by dongjoon-hyun <gi...@git.apache.org> on 2017/05/11 01:59:51 UTC

[GitHub] spark pull request #17943: [SPARK-20682][SQL] Implement new ORC data source ...

GitHub user dongjoon-hyun opened a pull request:

    https://github.com/apache/spark/pull/17943

    [SPARK-20682][SQL] Implement new ORC data source based on Apache ORC

    ## What changes were proposed in this pull request?
    
    Since [SPARK-2883](https://issues.apache.org/jira/browse/SPARK-2883), Apache Spark supports Apache ORC inside `sql/hive` module with Hive dependency. This issue aims to add a new and faster ORC data source inside `sql/core` and to replace the old ORC data source eventually. In this issue, the latest [Apache ORC 1.4.0](https://orc.apache.org/news/2017/05/08/ORC-1.4.0/) (released yesterday) is used.
    
    There are four key benefits.
    
    - **Speed**: Use both Spark `ColumnarBatch` and ORC `RowBatch` together later. In this PR, only `RowBatch` is used. This is faster than the current implementation in Spark. For `ColumnarBatch`, we need to benchmark and choose the fastest way to use it later. (Please refer some discussion on #17924)
    - **Stability**: Apache ORC 1.4.0 has many fixes and we can depend on ORC community more.
    - **Usability**: User can use `ORC` data sources without hive module, i.e, `-Phive`.
    - **Maintainability**: Reduce the Hive dependency and can remove old legacy code later.
    
    The followings are two examples of comparisons in `OrcReadBenchmark.scala`.
    ```
    Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.12.4
    Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz
    
    SQL Single Int Column Scan:              Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    SQL ORC Vectorized Reader                      278 /  320         56.5          17.7       1.0X
    SQL ORC MR Reader                              348 /  358         45.2          22.1       0.8X
    HIVE ORC MR Reader                             418 /  430         37.6          26.6       0.7X
    
    Partitioned Table:                       Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    SQL Read data column                           273 /  283         57.6          17.4       1.0X
    SQL Read partition column                      252 /  266         62.5          16.0       1.1X
    SQL Read both columns                          283 /  293         55.5          18.0       1.0X
    HIVE Read data column                          510 /  520         30.8          32.4       0.5X
    HIVE Read partition column                     420 /  425         37.5          26.7       0.7X
    HIVE Read both columns                         527 /  538         29.9          33.5       0.5X
    ```
    
    ## How was this patch tested?
    
    Pass the Jenkins tests with newly added test suites in `sql/core`.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dongjoon-hyun/spark SPARK-20682-2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17943.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17943
    
----
commit 70bc00e3695ddec164ce626602a5e7f4b425f780
Author: Dongjoon Hyun <do...@apache.org>
Date:   2017-04-25T19:30:24Z

    [SPARK-20682][SQL] Implement new ORC data source based on Apache ORC

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17943: [SPARK-20682][SQL] Implement new ORC data source based o...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17943
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17943: [SPARK-20682][SQL] Implement new ORC data source based o...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17943
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17943: [SPARK-20682][SQL] Implement new ORC data source based o...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17943
  
    **[Test build #78193 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78193/testReport)** for PR 17943 at commit [`d1417aa`](https://github.com/apache/spark/commit/d1417aa50f41b9b69f4a816a264d5ce96f201130).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17943: [SPARK-20682][SQL] Implement new ORC data source based o...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17943
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17943: [SPARK-20682][SQL] Implement new ORC data source based o...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/17943
  
    Hi, @cloud-fan and @viirya .
    Could you review new ORC data source (without ColumnarBatch part) again?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17943: [SPARK-20682][SQL] Implement new ORC data source based o...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17943
  
    **[Test build #76764 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76764/testReport)** for PR 17943 at commit [`70bc00e`](https://github.com/apache/spark/commit/70bc00e3695ddec164ce626602a5e7f4b425f780).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17943: [SPARK-20682][SQL] Implement new ORC data source based o...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17943
  
    **[Test build #76776 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76776/testReport)** for PR 17943 at commit [`d1417aa`](https://github.com/apache/spark/commit/d1417aa50f41b9b69f4a816a264d5ce96f201130).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17943: [SPARK-20682][SQL] Implement new ORC data source based o...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17943
  
    **[Test build #76764 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76764/testReport)** for PR 17943 at commit [`70bc00e`](https://github.com/apache/spark/commit/70bc00e3695ddec164ce626602a5e7f4b425f780).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17943: [SPARK-20682][SQL] Implement new ORC data source based o...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/17943
  
    Please refer the superset PR #17980 .


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17943: [SPARK-20682][SQL] Implement new ORC data source based o...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17943
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76790/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17943: [SPARK-20682][SQL] Implement new ORC data source based o...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/17943
  
    Retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17943: [SPARK-20682][SQL] Implement new ORC data source based o...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17943
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76764/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17943: [SPARK-20682][SQL] Implement new ORC data source based o...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17943
  
    **[Test build #76790 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76790/testReport)** for PR 17943 at commit [`d1417aa`](https://github.com/apache/spark/commit/d1417aa50f41b9b69f4a816a264d5ce96f201130).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17943: [SPARK-20682][SQL] Implement new ORC data source based o...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17943
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78193/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17943: [SPARK-20682][SQL] Implement new ORC data source based o...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17943
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17943: [SPARK-20682][SQL] Implement new ORC data source ...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun closed the pull request at:

    https://github.com/apache/spark/pull/17943


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17943: [SPARK-20682][SQL] Implement new ORC data source based o...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17943
  
    **[Test build #78193 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78193/testReport)** for PR 17943 at commit [`d1417aa`](https://github.com/apache/spark/commit/d1417aa50f41b9b69f4a816a264d5ce96f201130).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17943: [SPARK-20682][SQL] Implement new ORC data source based o...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17943
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76776/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17943: [SPARK-20682][SQL] Implement new ORC data source based o...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17943
  
    **[Test build #76790 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76790/testReport)** for PR 17943 at commit [`d1417aa`](https://github.com/apache/spark/commit/d1417aa50f41b9b69f4a816a264d5ce96f201130).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17943: [SPARK-20682][SQL] Implement new ORC data source based o...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/17943
  
    Retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org