You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@crunch.apache.org by Tim van Heugten <st...@gmail.com> on 2013/01/30 10:33:49 UTC

MemPipeline and context

Hi,

Since april I'm using Crunch for a project. We're not doing only linear
executions of the pipeline, so we're sometimes having issues with how
Crunch is optimizing our execution graph. We need to add materializations
here and there as hints to what parts of the graph can be shared for
outputs and so on.

Recently we decided to see if 0.4.0-incubating would provide us any
improvements (I'm afraid not yet). Trying to adapt our code to the new API,
however, exposed some difficulties and issues. A few bugs have been
reported regarding those issue (CRUNCH-152 to CRUNCH-155), thank you for
picking them up.

The difficulties arise from the newly introduced tight bound with
TaskInputOutputContext. Now in our jUnit tests we need to inject this
before the tests can run (many of our DoFns adjust counters of perform
progress() calls). So far so good, I can use
CrunchTestSupport.getTestContext(config) with a mocked config and call
setContext() on the DoFn. But there is some unclarity:
*Should I call initialize() after setContext()?
*I can see initialize() is called in setContext(), but this doesn't seem
documented or guarenteed. Should setContext() be made final so it can be
documented that initialize does not need to be called after?

In our more elaborate tests we use MemPipeline to see the combined effect
of our DoFns. But there:
*MemCollection shows ambiguous behaviour wrt initialize/setContext.
*A parallelDo with a PCollection output makes a call to
*just*initialize(), and a parallelDo with a PTable output makes a call
to
*both* initialize() and setContext(). Currently this fails some of our
tests because we use counters and progress().

Finally, I've had to create our own implementation of MemCollection
altogether*, because the stubbed TaskInputOutputContext is too limited for
our tests.
*Stubbed TaskInputOutputContext in MemCollection is unable to handle
Counters*.
I'm aware that so much is stated in the javadoc, but I can't choose
*not*to use counters when testing the business code. Because Counters
were
handled (and even counted) in 0.3.0 I'm feeling confident enough about this
to raise the issue.

I'm very happy with the api of crunch and would love for this project to
become more reliable and widely adopted. If there is anything I can do (or
instruction on where to begin understanding the planning component) please
let me know.

Cheers,

Tim van Heugten

* Because I use MemPipeline in test contexts only I rely on the mocked
instance from CrunchTestSupport.getTestContext(conf);, further, replacing
MemCollection implies replacing MemTable and MemGroupedTable as well.

Re: MemPipeline and context

Posted by Josh Wills <jw...@cloudera.com>.

I created https://issues.apache.org/jira/browse/CRUNCH-157 and
https://issues.apache.org/jira/browse/CRUNCH-158 to track these.


On Wed, Jan 30, 2013 at 6:07 AM, Josh Wills <jw...@cloudera.com> wrote:

> Hey Tim,
>
> Replies inlined.
>
>
> On Wed, Jan 30, 2013 at 1:33 AM, Tim van Heugten <st...@gmail.com> wrote:
>
>> Hi,
>>
>> Since april I'm using Crunch for a project. We're not doing only linear
>> executions of the pipeline, so we're sometimes having issues with how
>> Crunch is optimizing our execution graph. We need to add materializations
>> here and there as hints to what parts of the graph can be shared for
>> outputs and so on.
>>
>> Recently we decided to see if 0.4.0-incubating would provide us any
>> improvements (I'm afraid not yet). Trying to adapt our code to the new API,
>> however, exposed some difficulties and issues. A few bugs have been
>> reported regarding those issue (CRUNCH-152 to CRUNCH-155), thank you for
>> picking them up.
>>
>> The difficulties arise from the newly introduced tight bound with
>> TaskInputOutputContext. Now in our jUnit tests we need to inject this
>> before the tests can run (many of our DoFns adjust counters of perform
>> progress() calls). So far so good, I can use
>> CrunchTestSupport.getTestContext(config) with a mocked config and call
>> setContext() on the DoFn. But there is some unclarity:
>> *Should I call initialize() after setContext()?
>> *I can see initialize() is called in setContext(), but this doesn't seem
>> documented or guarenteed. Should setContext() be made final so it can be
>> documented that initialize does not need to be called after?
>>
>
> Yes, I think so. I'm not sure of the implications of doing that, but I'll
> create a branch and see what fails and what needs fixing.
>
>
>>
>> In our more elaborate tests we use MemPipeline to see the combined effect
>> of our DoFns. But there:
>> *MemCollection shows ambiguous behaviour wrt initialize/setContext.
>> *A parallelDo with a PCollection output makes a call to *just*initialize(), and a parallelDo with a PTable output makes a call to
>> *both* initialize() and setContext(). Currently this fails some of our
>> tests because we use counters and progress().
>>
>
> Yeah, that's no good. Let's open a JIRA for that one.
>
>
>>
>> Finally, I've had to create our own implementation of MemCollection
>> altogether*, because the stubbed TaskInputOutputContext is too limited for
>> our tests.
>> *Stubbed TaskInputOutputContext in MemCollection is unable to handle
>> Counters*.
>> I'm aware that so much is stated in the javadoc, but I can't choose *not*to use counters when testing the business code. Because Counters were
>> handled (and even counted) in 0.3.0 I'm feeling confident enough about this
>> to raise the issue.
>>
>
> Agreed-- I didn't realize MemCollection gave up the ability to use
> Counters in that change-- I'm a bit surprised by that. Let's create a JIRA
> to put it back in.
>
>
>> I'm very happy with the api of crunch and would love for this project to
>> become more reliable and widely adopted. If there is anything I can do (or
>> instruction on where to begin understanding the planning component) please
>> let me know.
>>
>> Cheers,
>>
>> Tim van Heugten
>>
>> * Because I use MemPipeline in test contexts only I rely on the mocked
>> instance from CrunchTestSupport.getTestContext(conf);, further, replacing
>> MemCollection implies replacing MemTable and MemGroupedTable as well.
>>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>