You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@crunch.apache.org by "Micah Whitacre (JIRA)" <ji...@apache.org> on 2015/09/03 04:52:45 UTC
[jira] [Comment Edited] (CRUNCH-557) Fix file distribution from HDFS in Crunch-on-Spark

    [ https://issues.apache.org/jira/browse/CRUNCH-557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14728196#comment-14728196 ] 

Micah Whitacre edited comment on CRUNCH-557 at 9/3/15 2:52 AM:
---------------------------------------------------------------

Sure, I'll try and write up some tests for this too.

Looks like we have tests for Mapside Joins already[1].  Based on the fix that Josh proposed I'm guessing the reason those tests didn't pass the tests is because the FS scheme was the same so it didn't matter.  I'm going to setup some tests using a MiniDFSCluster.  I think we could possibly reuse the instance being created in SparkHFileTargetIT that is created as a by-product of the HBaseTestingUtility.

[1] - https://github.com/apache/crunch/blob/master/crunch-spark/src/it/java/org/apache/crunch/SparkMapsideJoinIT.java


was (Author: mkwhitacre):
Sure, I'll try and write up some tests for this too.

Looks like we have tests for Mapside Joins already.  Based on the fix that Josh proposed I'm guessing the reason those tests didn't pass the tests is because the FS scheme was the same so it didn't matter.  I'm going to setup some tests using a MiniDFSCluster.  I think we could possibly reuse the instance being created in SparkHFileTargetIT that is created as a by-product of the HBaseTestingUtility.

> Fix file distribution from HDFS in Crunch-on-Spark
> --------------------------------------------------
>
>                 Key: CRUNCH-557
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-557
>             Project: Crunch
>          Issue Type: Bug
>            Reporter: Josh Wills
>         Attachments: CRUNCH-557.patch
>
>
> From the user list:
> I was trying to determine effect of changing JoinStrategy on a Spark pipeline. I noticed that my pipeline works fine with DefaultJoinStrategy, however I could not get it to working with MapSideJoinStrategy and BloomFilterJoinStrategy. For MapSideJoinStrategy I get an exceptions[1] on driver itself and for BloomFilterJoinStrategy I get exceptions[2] in one of the stages. I have not tried to do any configuration changes but I did run tests with datasets of different sizes to ensure that my PCollection is small enough to fit in memory. I am running spark in yarn-client mode with Crunch 0.11.0-cdh5.4.2.
> [1] https://gist.github.com/anonymous/15d6c691b743ad392d42
> [2] https://gist.github.com/anonymous/b02a82401a30a69f1cff
> The bug is in the SparkRuntime.distributeFiles method, which needs to include a scheme for the URI it's handing to Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)