You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@crunch.apache.org by Daniel Siegmann <da...@velos.io> on 2014/06/19 22:41:35 UTC

Scrunch example project with SBT?

Does anyone have a self-contained example Scrunch project that builds with
SBT? I am having some difficulty setting up an example that will compile,
even when I try to take the dependencies list from Crunch POMs.

Also, the Scrunch website describes it as "experimental". Is it stable
enough to try in a production system?

-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegmann@velos.io W: www.velos.io

Re: Scrunch example project with SBT?

Posted by Josh Wills <jw...@cloudera.com>.

Ha! Not the prettiest thing, but it'll do. The CrunchTool trait also has a
done() method, so you can also do

pcol.write(to.textFile(outputPath))
done()


On Fri, Jun 20, 2014 at 2:32 PM, Daniel Siegmann <da...@velos.io>
wrote:

> Got it to work like so:
>
>
> read(from.textFile(inputPath)).write(to.textFile(outputPath)).native.getPipeline().done()
>
> Is that the correct way?
>
> Thanks for the help, I have a running word count example now. :-)
>
>
>
> On Fri, Jun 20, 2014 at 4:34 PM, Josh Wills <jw...@cloudera.com> wrote:
>
>> You need to manually call run() or done() to execute the pipeline if
>> you're not materializing the output. The user guide will be useful for the
>> basic concepts, even though it focuses on the Java API.
>>  On Jun 20, 2014 1:27 PM, "Daniel Siegmann" <da...@velos.io>
>> wrote:
>>
>>> Thanks Josh! The thrift and protobuf defs were what I was missing. I'm
>>> able to compile and run the code now. I also updated to Scrunch 0.10.0.
>>>
>>> Any idea why it might not write the output? If I have
>>>
>>> countWords(args(0)).materialize.foreach(line => println(s"**** $line"))
>>>
>>> I get all my output, but
>>>
>>> countWords(args(0)).write(to.textFile(args(1)))
>>>
>>> Doesn't even create the output directory, even though I see this in my
>>> logs
>>>
>>> 14/06/20 16:17:47 INFO impl.FileTargetImpl: Will write output files to
>>> new path:
>>> /var/folders/th/7vf9rjqd1955jnwnzg3x9ym40000gn/T/1403295466563-1/wordcounts
>>>
>>> No exceptions or anything. I'm probably missing something obvious. :-(
>>>
>>>
>>> On Thu, Jun 19, 2014 at 6:03 PM, Josh Wills <jw...@cloudera.com> wrote:
>>>
>>>> Here you go: https://github.com/jwills/scrunch-demo
>>>>
>>>> Did this w/Maven; you'll have to forgive me as my SBT-fu isn't great.
>>>> It looks like vanilla Hadoop 1.x doesn't include any thrift/protobuf
>>>> dependencies that Scrunch expects to be present at compile-time; I added
>>>> them as provided dependencies in this example and then verified that I
>>>> could run the -job.jar that I built w/mvn package under Hadoop 1.0.3.
>>>>
>>>> J
>>>>
>>>>
>>>> On Thu, Jun 19, 2014 at 2:33 PM, Daniel Siegmann <
>>>> daniel.siegmann@velos.io> wrote:
>>>>
>>>>> Hi Josh, thanks for the reply.
>>>>>
>>>>>  Which version of Hadoop are you looking to compile against?
>>>>>>
>>>>>
>>>>> I think any 1.x version will suffice (our production cluster is MapR).
>>>>>
>>>>> The Spotify comparison is interesting. Too bad they didn't evaluate
>>>>> Scoobi as well. Thanks for the info.
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Director of Data Science
>>>> Cloudera <http://www.cloudera.com>
>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>
>>>
>>>
>>>
>>> --
>>> Daniel Siegmann, Software Developer
>>> Velos
>>> Accelerating Machine Learning
>>>
>>> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
>>> E: daniel.siegmann@velos.io W: www.velos.io
>>>
>>
>
>
> --
> Daniel Siegmann, Software Developer
> Velos
> Accelerating Machine Learning
>
> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
> E: daniel.siegmann@velos.io W: www.velos.io
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: Scrunch example project with SBT?

Posted by Daniel Siegmann <da...@velos.io>.

Got it to work like so:

read(from.textFile(inputPath)).write(to.textFile(outputPath)).native.getPipeline().done()

Is that the correct way?

Thanks for the help, I have a running word count example now. :-)



On Fri, Jun 20, 2014 at 4:34 PM, Josh Wills <jw...@cloudera.com> wrote:

> You need to manually call run() or done() to execute the pipeline if
> you're not materializing the output. The user guide will be useful for the
> basic concepts, even though it focuses on the Java API.
> On Jun 20, 2014 1:27 PM, "Daniel Siegmann" <da...@velos.io>
> wrote:
>
>> Thanks Josh! The thrift and protobuf defs were what I was missing. I'm
>> able to compile and run the code now. I also updated to Scrunch 0.10.0.
>>
>> Any idea why it might not write the output? If I have
>>
>> countWords(args(0)).materialize.foreach(line => println(s"**** $line"))
>>
>> I get all my output, but
>>
>> countWords(args(0)).write(to.textFile(args(1)))
>>
>> Doesn't even create the output directory, even though I see this in my
>> logs
>>
>> 14/06/20 16:17:47 INFO impl.FileTargetImpl: Will write output files to
>> new path:
>> /var/folders/th/7vf9rjqd1955jnwnzg3x9ym40000gn/T/1403295466563-1/wordcounts
>>
>> No exceptions or anything. I'm probably missing something obvious. :-(
>>
>>
>> On Thu, Jun 19, 2014 at 6:03 PM, Josh Wills <jw...@cloudera.com> wrote:
>>
>>> Here you go: https://github.com/jwills/scrunch-demo
>>>
>>> Did this w/Maven; you'll have to forgive me as my SBT-fu isn't great. It
>>> looks like vanilla Hadoop 1.x doesn't include any thrift/protobuf
>>> dependencies that Scrunch expects to be present at compile-time; I added
>>> them as provided dependencies in this example and then verified that I
>>> could run the -job.jar that I built w/mvn package under Hadoop 1.0.3.
>>>
>>> J
>>>
>>>
>>> On Thu, Jun 19, 2014 at 2:33 PM, Daniel Siegmann <
>>> daniel.siegmann@velos.io> wrote:
>>>
>>>> Hi Josh, thanks for the reply.
>>>>
>>>>  Which version of Hadoop are you looking to compile against?
>>>>>
>>>>
>>>> I think any 1.x version will suffice (our production cluster is MapR).
>>>>
>>>> The Spotify comparison is interesting. Too bad they didn't evaluate
>>>> Scoobi as well. Thanks for the info.
>>>>
>>>
>>>
>>>
>>> --
>>> Director of Data Science
>>> Cloudera <http://www.cloudera.com>
>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>
>>
>>
>>
>> --
>> Daniel Siegmann, Software Developer
>> Velos
>> Accelerating Machine Learning
>>
>> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
>> E: daniel.siegmann@velos.io W: www.velos.io
>>
>


-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegmann@velos.io W: www.velos.io

Re: Scrunch example project with SBT?

Posted by Josh Wills <jw...@cloudera.com>.

You need to manually call run() or done() to execute the pipeline if you're
not materializing the output. The user guide will be useful for the basic
concepts, even though it focuses on the Java API.
On Jun 20, 2014 1:27 PM, "Daniel Siegmann" <da...@velos.io> wrote:

> Thanks Josh! The thrift and protobuf defs were what I was missing. I'm
> able to compile and run the code now. I also updated to Scrunch 0.10.0.
>
> Any idea why it might not write the output? If I have
>
> countWords(args(0)).materialize.foreach(line => println(s"**** $line"))
>
> I get all my output, but
>
> countWords(args(0)).write(to.textFile(args(1)))
>
> Doesn't even create the output directory, even though I see this in my logs
>
> 14/06/20 16:17:47 INFO impl.FileTargetImpl: Will write output files to new
> path:
> /var/folders/th/7vf9rjqd1955jnwnzg3x9ym40000gn/T/1403295466563-1/wordcounts
>
> No exceptions or anything. I'm probably missing something obvious. :-(
>
>
> On Thu, Jun 19, 2014 at 6:03 PM, Josh Wills <jw...@cloudera.com> wrote:
>
>> Here you go: https://github.com/jwills/scrunch-demo
>>
>> Did this w/Maven; you'll have to forgive me as my SBT-fu isn't great. It
>> looks like vanilla Hadoop 1.x doesn't include any thrift/protobuf
>> dependencies that Scrunch expects to be present at compile-time; I added
>> them as provided dependencies in this example and then verified that I
>> could run the -job.jar that I built w/mvn package under Hadoop 1.0.3.
>>
>> J
>>
>>
>> On Thu, Jun 19, 2014 at 2:33 PM, Daniel Siegmann <
>> daniel.siegmann@velos.io> wrote:
>>
>>> Hi Josh, thanks for the reply.
>>>
>>>  Which version of Hadoop are you looking to compile against?
>>>>
>>>
>>> I think any 1.x version will suffice (our production cluster is MapR).
>>>
>>> The Spotify comparison is interesting. Too bad they didn't evaluate
>>> Scoobi as well. Thanks for the info.
>>>
>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>
>
>
> --
> Daniel Siegmann, Software Developer
> Velos
> Accelerating Machine Learning
>
> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
> E: daniel.siegmann@velos.io W: www.velos.io
>

Re: Scrunch example project with SBT?

Posted by Daniel Siegmann <da...@velos.io>.

Thanks Josh! The thrift and protobuf defs were what I was missing. I'm able
to compile and run the code now. I also updated to Scrunch 0.10.0.

Any idea why it might not write the output? If I have

countWords(args(0)).materialize.foreach(line => println(s"**** $line"))

I get all my output, but

countWords(args(0)).write(to.textFile(args(1)))

Doesn't even create the output directory, even though I see this in my logs

14/06/20 16:17:47 INFO impl.FileTargetImpl: Will write output files to new
path:
/var/folders/th/7vf9rjqd1955jnwnzg3x9ym40000gn/T/1403295466563-1/wordcounts

No exceptions or anything. I'm probably missing something obvious. :-(


On Thu, Jun 19, 2014 at 6:03 PM, Josh Wills <jw...@cloudera.com> wrote:

> Here you go: https://github.com/jwills/scrunch-demo
>
> Did this w/Maven; you'll have to forgive me as my SBT-fu isn't great. It
> looks like vanilla Hadoop 1.x doesn't include any thrift/protobuf
> dependencies that Scrunch expects to be present at compile-time; I added
> them as provided dependencies in this example and then verified that I
> could run the -job.jar that I built w/mvn package under Hadoop 1.0.3.
>
> J
>
>
> On Thu, Jun 19, 2014 at 2:33 PM, Daniel Siegmann <daniel.siegmann@velos.io
> > wrote:
>
>> Hi Josh, thanks for the reply.
>>
>>  Which version of Hadoop are you looking to compile against?
>>>
>>
>> I think any 1.x version will suffice (our production cluster is MapR).
>>
>> The Spotify comparison is interesting. Too bad they didn't evaluate
>> Scoobi as well. Thanks for the info.
>>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>



-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegmann@velos.io W: www.velos.io

Re: Scrunch example project with SBT?

Posted by Josh Wills <jw...@cloudera.com>.

Here you go: https://github.com/jwills/scrunch-demo

Did this w/Maven; you'll have to forgive me as my SBT-fu isn't great. It
looks like vanilla Hadoop 1.x doesn't include any thrift/protobuf
dependencies that Scrunch expects to be present at compile-time; I added
them as provided dependencies in this example and then verified that I
could run the -job.jar that I built w/mvn package under Hadoop 1.0.3.

J

On Thu, Jun 19, 2014 at 2:33 PM, Daniel Siegmann <da...@velos.io>
wrote:

> Hi Josh, thanks for the reply.
>
>  Which version of Hadoop are you looking to compile against?
>>
>
> I think any 1.x version will suffice (our production cluster is MapR).
>
> The Spotify comparison is interesting. Too bad they didn't evaluate Scoobi
> as well. Thanks for the info.
>

-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: Scrunch example project with SBT?

Posted by Daniel Siegmann <da...@velos.io>.

Hi Josh, thanks for the reply.

Which version of Hadoop are you looking to compile against?
>

I think any 1.x version will suffice (our production cluster is MapR).

The Spotify comparison is interesting. Too bad they didn't evaluate Scoobi
as well. Thanks for the info.

Re: Scrunch example project with SBT?

Posted by Josh Wills <jw...@cloudera.com>.

Hey Daniel,

Which version of Hadoop are you looking to compile against?

Re: stability, most of my effort lately has been on expanding the
functionality of Scrunch, doing things like adding more support for
automatic inference of types, improving the Avro reflection-based
serialization, and adding more Scala stuff like partial function support on
PCollections (CRUNCH-422) and Algebird-based aggregations (CRUNCH-424). I
have one large customer that runs it in production, and I understand that
the folks at Spotify are fans as well: http://thewit.ch/shug/

J

On Thu, Jun 19, 2014 at 1:41 PM, Daniel Siegmann <da...@velos.io>
wrote:

> Does anyone have a self-contained example Scrunch project that builds with
> SBT? I am having some difficulty setting up an example that will compile,
> even when I try to take the dependencies list from Crunch POMs.
>
> Also, the Scrunch website describes it as "experimental". Is it stable
> enough to try in a production system?
>
> --
> Daniel Siegmann, Software Developer
> Velos
> Accelerating Machine Learning
>
> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
> E: daniel.siegmann@velos.io W: www.velos.io
>

-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>