You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@crunch.apache.org by Surbhi Mungre <mu...@gmail.com> on 2015/08/12 19:45:45 UTC

HFileOutputFormatForCrunch with spark pipeline

I am converting a MRPipeline to SparkPipeline with these[1] instructions.
My SparkPipeline fails with this[2] exception. In my pipeline I am trying
to write to HBase using HFiles. IIUC M/R job which creates HFiles uses a
custom partitioner. I am not sure how Crunch translates this to Spark. From
the exception stack trace it looks like Spark is using M/R partitioner. I
am completely new to Spark but I think I will have to create a custom spark
partitioner and use it instead. When I am converting a MRPipeline to
SparkPipeline, if a M/R job uses custom partitioner will Crunch handle it?


[1]
http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_ig_running_crunch_with_spark.html

[2] https://gist.github.com/anonymous/920c000f20229eaa76d8

Thanks,
Surbhi

Re: HFileOutputFormatForCrunch with spark pipeline

Posted by Josh Wills <jw...@cloudera.com>.
Tracking here: https://issues.apache.org/jira/browse/CRUNCH-556

On Wed, Aug 12, 2015 at 8:10 PM, Josh Wills <jw...@cloudera.com> wrote:

> Hey Surbhi,
>
> I think it's just a bug-- Crunch-on-Spark should be handling the
> partitioner stuff correctly w/o requiring you to write your own. I think
> the problem is that we set the location of the partition file (the one that
> the code is mad that it can't find in your gist) inside of the
> GroupingOptions class, and we're not updating the Configuration object that
> the Spark job is going to use w/the location of that file in the same way
> we do on MapReduce. I'll file a bug for it and see if I can't come up w/a
> fix and unit test tomorrow.
>
> Thanks!
> Josh
>
> On Wed, Aug 12, 2015 at 10:45 AM, Surbhi Mungre <mu...@gmail.com>
> wrote:
>
>> I am converting a MRPipeline to SparkPipeline with these[1] instructions.
>> My SparkPipeline fails with this[2] exception. In my pipeline I am trying
>> to write to HBase using HFiles. IIUC M/R job which creates HFiles uses a
>> custom partitioner. I am not sure how Crunch translates this to Spark. From
>> the exception stack trace it looks like Spark is using M/R partitioner. I
>> am completely new to Spark but I think I will have to create a custom spark
>> partitioner and use it instead. When I am converting a MRPipeline to
>> SparkPipeline, if a M/R job uses custom partitioner will Crunch handle it?
>>
>>
>> [1]
>> http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_ig_running_crunch_with_spark.html
>>
>> [2] https://gist.github.com/anonymous/920c000f20229eaa76d8
>>
>> Thanks,
>> Surbhi
>>
>>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: HFileOutputFormatForCrunch with spark pipeline

Posted by Josh Wills <jw...@cloudera.com>.
Hey Surbhi,

I think it's just a bug-- Crunch-on-Spark should be handling the
partitioner stuff correctly w/o requiring you to write your own. I think
the problem is that we set the location of the partition file (the one that
the code is mad that it can't find in your gist) inside of the
GroupingOptions class, and we're not updating the Configuration object that
the Spark job is going to use w/the location of that file in the same way
we do on MapReduce. I'll file a bug for it and see if I can't come up w/a
fix and unit test tomorrow.

Thanks!
Josh

On Wed, Aug 12, 2015 at 10:45 AM, Surbhi Mungre <mu...@gmail.com>
wrote:

> I am converting a MRPipeline to SparkPipeline with these[1] instructions.
> My SparkPipeline fails with this[2] exception. In my pipeline I am trying
> to write to HBase using HFiles. IIUC M/R job which creates HFiles uses a
> custom partitioner. I am not sure how Crunch translates this to Spark. From
> the exception stack trace it looks like Spark is using M/R partitioner. I
> am completely new to Spark but I think I will have to create a custom spark
> partitioner and use it instead. When I am converting a MRPipeline to
> SparkPipeline, if a M/R job uses custom partitioner will Crunch handle it?
>
>
> [1]
> http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_ig_running_crunch_with_spark.html
>
> [2] https://gist.github.com/anonymous/920c000f20229eaa76d8
>
> Thanks,
> Surbhi
>
>


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>