You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by liancheng <gi...@git.apache.org> on 2017/02/03 19:08:10 UTC

[GitHub] spark pull request #16791: [SPARK-19409][SPARK-17213] Cleanup Parquet workar...

GitHub user liancheng opened a pull request:

    https://github.com/apache/spark/pull/16791

    [SPARK-19409][SPARK-17213] Cleanup Parquet workarounds/hacks due to bugs of old Parquet versions

    ## What changes were proposed in this pull request?
    
    We've already upgraded parquet-mr to 1.8.2. This PR does some further cleanup by removing a workaround of PARQUET-686 and a hack due to PARQUET-363 and PARQUET-278. All three Parquet issues are fixed in parquet-mr 1.8.2.
    
    ## How was this patch tested?
    
    Existing unit tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/liancheng/spark parquet-1.8.2-cleanup

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16791.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16791
    
----
commit 5d97b39d2c196ba30a5eaa0c5e782f6c115668fc
Author: Cheng Lian <li...@databricks.com>
Date:   2017-01-23T21:05:44Z

    Upgrade Parquet to 1.8.2-rc1

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16791: [SPARK-19409][SPARK-17213] Cleanup Parquet workarounds/h...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/16791
  
    Originally, it's reported at #16751 by @robbinspg .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16791: [SPARK-19409][SPARK-17213] Cleanup Parquet workarounds/h...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the issue:

    https://github.com/apache/spark/pull/16791
  
    @dongjoon-hyun Ah, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16791: [SPARK-19409][SPARK-17213] Cleanup Parquet workarounds/h...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/16791
  
    BTW, LGTM for the changes in `ParquetSchrmaConverter` as i have the exact same changes ready for a PR in my local. However, do you mind if I ask PARQUET-686 is fixed in another JIRA? it seems not resolved. If it is fixed in 1.8.2, then, I believe we should enable the tests in `ParquetFilterSuite` back..


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16791: [SPARK-19409][SPARK-17213] Cleanup Parquet workarounds/h...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/16791
  
    Yep. The failure is on #16751 . It's my bad.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16791: [SPARK-19409][SPARK-17213] Cleanup Parquet workarounds/h...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the issue:

    https://github.com/apache/spark/pull/16791
  
    @dongjoon-hyun Actually, could you please point me the Maven build failure? I don't think this failure is caused by this PR, is it? Are you refering to some existing PR introduced by some earlier commit(s)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16791: [SPARK-19409][SPARK-17213] Cleanup Parquet workarounds/h...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/16791
  
    I'll create another PR and ping you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16791: [SPARK-19409][SPARK-17213] Cleanup Parquet workarounds/h...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/16791
  
    Merging in master.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16791: [SPARK-19409][SPARK-17213] Cleanup Parquet workarounds/h...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the issue:

    https://github.com/apache/spark/pull/16791
  
    Hope we finally have proper Parquet filter push-down for string/binary columns (cross fingers)!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16791: [SPARK-19409][SPARK-17213] Cleanup Parquet workarounds/h...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16791
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16791: [SPARK-19409][SPARK-17213] Cleanup Parquet workarounds/h...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the issue:

    https://github.com/apache/spark/pull/16791
  
    cc @cloud-fan @rxin @HyukjinKwon 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16791: [SPARK-19409][SPARK-17213] Cleanup Parquet workarounds/h...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/16791
  
    LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16791: [SPARK-19409][SPARK-17213] Cleanup Parquet workarounds/h...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/16791
  
    Here,
    
    https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/2530/consoleFull
    
    ```
    ParquetAvroCompatibilitySuite:
    *** RUN ABORTED ***
      java.lang.NoClassDefFoundError: org/apache/avro/LogicalType
      at org.apache.parquet.avro.AvroParquetWriter.writeSupport(AvroParquetWriter.java:144)
      at org.apache.parquet.avro.AvroParquetWriter.access$100(AvroParquetWriter.java:35)
      at org.apache.parquet.avro.AvroParquetWriter$Builder.getWriteSupport(AvroParquetWriter.java:173)
      at org.apache.parquet.hadoop.ParquetWriter$Builder.build(ParquetWriter.java:495)
      at org.apache.spark.sql.execution.datasources.parquet.ParquetAvroCompatibilitySuite.org$apache$spark$sql$execution$datasources$parquet$ParquetAvroCompatibilitySuite$$withWriter(ParquetAvroCompatibilitySuite.scala:46)
      at org.apache.spark.sql.execution.datasources.parquet.ParquetAvroCompatibilitySuite$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ParquetAvroCompatibilitySuite.scala:54)
      at org.apache.spark.sql.execution.datasources.parquet.ParquetAvroCompatibilitySuite$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ParquetAvroCompatibilitySuite.scala:51)
      at org.apache.spark.sql.test.SQLTestUtils$class.withTempPath(SQLTestUtils.scala:122)
      at org.apache.spark.sql.execution.datasources.parquet.ParquetCompatibilityTest.withTempPath(ParquetCompatibilityTest.scala:35)
      at org.apache.spark.sql.execution.datasources.parquet.ParquetAvroCompatibilitySuite$$anonfun$1.apply$mcV$sp(ParquetAvroCompatibilitySuite.scala:51)
      ...
      Cause: java.lang.ClassNotFoundException: org.apache.avro.LogicalType
      at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
      at org.apache.parquet.avro.AvroParquetWriter.writeSupport(AvroParquetWriter.java:144)
      at org.apache.parquet.avro.AvroParquetWriter.access$100(AvroParquetWriter.java:35)
      at org.apache.parquet.avro.AvroParquetWriter$Builder.getWriteSupport(AvroParquetWriter.java:173)
      at org.apache.parquet.hadoop.ParquetWriter$Builder.build(ParquetWriter.java:495)
      at org.apache.spark.sql.execution.datasources.parquet.ParquetAvroCompatibilitySuite.org$apache$spark$sql$execution$datasources$parquet$ParquetAvroCompatibilitySuite$$withWriter(ParquetAvroCompatibilitySuite.scala:46)
      at org.apache.spark.sql.execution.datasources.parquet.ParquetAvroCompatibilitySuite$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ParquetAvroCompatibilitySuite.scala:54)
      ...
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16791: [SPARK-19409][SPARK-17213] Cleanup Parquet workarounds/h...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the issue:

    https://github.com/apache/spark/pull/16791
  
    I see, thanks for the context. But I'd like to keep this Maven build failure fix in a separate PR so that people can easily cherry-pick the fix. Also, it helps to keep this PR easier to follow. Could you please send another PR to fix the Maven failure?
    
    BTW, IIUC, this failure only affects Maven builds and probably that's why it wasn't caused by the Jenkins PR builder when testing #16751?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16791: [SPARK-19409][SPARK-17213] Cleanup Parquet workarounds/h...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/16791
  
    Great, @liancheng .
    BTW, could you specify the test dependency like the following in this PR?
    ```
    diff --git a/sql/core/pom.xml b/sql/core/pom.xml
    index b8aa698090..f9acf815e3 100644
    --- a/sql/core/pom.xml
    +++ b/sql/core/pom.xml
    @@ -130,6 +130,12 @@
           <scope>test</scope>
         </dependency>
         <dependency>
    +      <groupId>org.apache.avro</groupId>
    +      <artifactId>avro</artifactId>
    +      <version>1.8.0</version>
    +      <scope>test</scope>
    +    </dependency>
    +    <dependency>
           <groupId>org.mockito</groupId>
           <artifactId>mockito-core</artifactId>
           <scope>test</scope>
    ```
    
    It's reported the maven test fails on `ParquetAvroCompatibilitySuite`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16791: [SPARK-19409][SPARK-17213] Cleanup Parquet workarounds/h...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16791
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72315/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16791: [SPARK-19409][SPARK-17213] Cleanup Parquet workarounds/h...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16791
  
    **[Test build #72315 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72315/testReport)** for PR 16791 at commit [`5d97b39`](https://github.com/apache/spark/commit/5d97b39d2c196ba30a5eaa0c5e782f6c115668fc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16791: [SPARK-19409][SPARK-17213] Cleanup Parquet workarounds/h...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the issue:

    https://github.com/apache/spark/pull/16791
  
    @HyukjinKwon Sorry that I didn't see your comment before this PR got merged. I believe PARQUET-686 had already been fixed by apache/parquet-mr#367 but wasn't marked as resolved in JIRA. Thanks for sending out #16817 for re-enabling the tests!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16791: [SPARK-19409][SPARK-17213] Cleanup Parquet workar...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/16791


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16791: [SPARK-19409][SPARK-17213] Cleanup Parquet workarounds/h...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16791
  
    **[Test build #72315 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72315/testReport)** for PR 16791 at commit [`5d97b39`](https://github.com/apache/spark/commit/5d97b39d2c196ba30a5eaa0c5e782f6c115668fc).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16791: [SPARK-19409][SPARK-17213] Cleanup Parquet workarounds/h...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/16791
  
    I see. Sure!
    Right, it happens only maven build so far.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16791: [SPARK-19409][SPARK-17213] Cleanup Parquet workarounds/h...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/16791
  
    Ah, thank you for confirming and the information!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org