You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@crunch.apache.org by Mārtiņš Kalvāns <ma...@gmail.com> on 2015/01/28 18:26:25 UTC

.materialize() returns empty collection on pipeline error?

Hi.

When pipeline fails on cluster with some exception, materialize() returns
empty collection and just logs error message.

I'm (very, very) puzzled about this behaviour:
https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/materialize/MaterializableIterable.java#L92
Is this really intended behaviour?

If so, then some documentation for materialize() function about this
behaviour would be really nice to have. :)


--
Mārtiņš

Re: .materialize() returns empty collection on pipeline error?

Posted by Mārtiņš Kalvāns <ma...@gmail.com>.
I would prefer CrunchRuntimeException.
I was thinking about sending patch for that. :)
If someone makes patch with configurable options, I don't mind.
Still, by default in my mind Exception would make more sense :)


M.

2015-01-28 19:45 GMT+01:00 Josh Wills <jw...@cloudera.com>:

> Yeah, I think that before, we would just fail catastrophically by throwing
> a CrunchRuntimeException, which I found annoying. Do you prefer that
> behavior? It's certainly something that could be configurable.
>
> J
>
> On Wed, Jan 28, 2015 at 10:36 AM, Jinal Shah <ji...@gmail.com>
> wrote:
>
> > I think it was intented from these commits I see here
> >
> >
> https://github.com/apache/crunch/commit/3711cea61bded4c90b235a01163ae5f855089917
> > and
> >
> >
> https://github.com/apache/crunch/commit/ded504eb133fa0814e2d90ff2a662e72a67e04bb
> > .
> > Josh can enhance on this more.
> >
> > On Wed, Jan 28, 2015 at 9:26 AM, Mārtiņš Kalvāns <
> > martins.kalvans@gmail.com>
> > wrote:
> >
> > > Hi.
> > >
> > > When pipeline fails on cluster with some exception, materialize()
> returns
> > > empty collection and just logs error message.
> > >
> > > I'm (very, very) puzzled about this behaviour:
> > >
> > >
> >
> https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/materialize/MaterializableIterable.java#L92
> > > Is this really intended behaviour?
> > >
> > > If so, then some documentation for materialize() function about this
> > > behaviour would be really nice to have. :)
> > >
> > >
> > > --
> > > Mārtiņš
> > >
> >
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Re: .materialize() returns empty collection on pipeline error?

Posted by Josh Wills <jw...@cloudera.com>.
Yeah, that's not good. Will file a JIRA to revert-- sorry about that,
everybody.

J

On Wed, Jan 28, 2015 at 11:27 AM, Allan Shoup <al...@gmail.com> wrote:

> I also prefer the exception. I recently found out about this behavior and
> am now facing a task of going back through my code base to explicitly add
> special error handling to detect this case.
>
> On Wed, Jan 28, 2015 at 1:21 PM, David Whiting <da...@gmail.com>
> wrote:
>
> > I think "fail catastrophically" is probably exactly what should happen
> > here. You can always catch and use an empty iterable if it fails. A
> common
> > use case here is to do one step, materialize it into a collection or map,
> > then pass that into a DoFn to use as a small lookup table. This failure
> > mode means that future steps silently continue to execute with empty
> lookup
> > tables as part of their processing on the cluster.
> >
> > On 28 January 2015 at 13:45, Josh Wills <jw...@cloudera.com> wrote:
> >
> > > Yeah, I think that before, we would just fail catastrophically by
> > throwing
> > > a CrunchRuntimeException, which I found annoying. Do you prefer that
> > > behavior? It's certainly something that could be configurable.
> > >
> > > J
> > >
> > > On Wed, Jan 28, 2015 at 10:36 AM, Jinal Shah <ji...@gmail.com>
> > > wrote:
> > >
> > > > I think it was intented from these commits I see here
> > > >
> > > >
> > >
> >
> https://github.com/apache/crunch/commit/3711cea61bded4c90b235a01163ae5f855089917
> > > > and
> > > >
> > > >
> > >
> >
> https://github.com/apache/crunch/commit/ded504eb133fa0814e2d90ff2a662e72a67e04bb
> > > > .
> > > > Josh can enhance on this more.
> > > >
> > > > On Wed, Jan 28, 2015 at 9:26 AM, Mārtiņš Kalvāns <
> > > > martins.kalvans@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi.
> > > > >
> > > > > When pipeline fails on cluster with some exception, materialize()
> > > returns
> > > > > empty collection and just logs error message.
> > > > >
> > > > > I'm (very, very) puzzled about this behaviour:
> > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/materialize/MaterializableIterable.java#L92
> > > > > Is this really intended behaviour?
> > > > >
> > > > > If so, then some documentation for materialize() function about
> this
> > > > > behaviour would be really nice to have. :)
> > > > >
> > > > >
> > > > > --
> > > > > Mārtiņš
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Director of Data Science
> > > Cloudera <http://www.cloudera.com>
> > > Twitter: @josh_wills <http://twitter.com/josh_wills>
> > >
> >
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: .materialize() returns empty collection on pipeline error?

Posted by Allan Shoup <al...@gmail.com>.
I also prefer the exception. I recently found out about this behavior and
am now facing a task of going back through my code base to explicitly add
special error handling to detect this case.

On Wed, Jan 28, 2015 at 1:21 PM, David Whiting <da...@gmail.com>
wrote:

> I think "fail catastrophically" is probably exactly what should happen
> here. You can always catch and use an empty iterable if it fails. A common
> use case here is to do one step, materialize it into a collection or map,
> then pass that into a DoFn to use as a small lookup table. This failure
> mode means that future steps silently continue to execute with empty lookup
> tables as part of their processing on the cluster.
>
> On 28 January 2015 at 13:45, Josh Wills <jw...@cloudera.com> wrote:
>
> > Yeah, I think that before, we would just fail catastrophically by
> throwing
> > a CrunchRuntimeException, which I found annoying. Do you prefer that
> > behavior? It's certainly something that could be configurable.
> >
> > J
> >
> > On Wed, Jan 28, 2015 at 10:36 AM, Jinal Shah <ji...@gmail.com>
> > wrote:
> >
> > > I think it was intented from these commits I see here
> > >
> > >
> >
> https://github.com/apache/crunch/commit/3711cea61bded4c90b235a01163ae5f855089917
> > > and
> > >
> > >
> >
> https://github.com/apache/crunch/commit/ded504eb133fa0814e2d90ff2a662e72a67e04bb
> > > .
> > > Josh can enhance on this more.
> > >
> > > On Wed, Jan 28, 2015 at 9:26 AM, Mārtiņš Kalvāns <
> > > martins.kalvans@gmail.com>
> > > wrote:
> > >
> > > > Hi.
> > > >
> > > > When pipeline fails on cluster with some exception, materialize()
> > returns
> > > > empty collection and just logs error message.
> > > >
> > > > I'm (very, very) puzzled about this behaviour:
> > > >
> > > >
> > >
> >
> https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/materialize/MaterializableIterable.java#L92
> > > > Is this really intended behaviour?
> > > >
> > > > If so, then some documentation for materialize() function about this
> > > > behaviour would be really nice to have. :)
> > > >
> > > >
> > > > --
> > > > Mārtiņš
> > > >
> > >
> >
> >
> >
> > --
> > Director of Data Science
> > Cloudera <http://www.cloudera.com>
> > Twitter: @josh_wills <http://twitter.com/josh_wills>
> >
>

Re: .materialize() returns empty collection on pipeline error?

Posted by David Whiting <da...@gmail.com>.
I think "fail catastrophically" is probably exactly what should happen
here. You can always catch and use an empty iterable if it fails. A common
use case here is to do one step, materialize it into a collection or map,
then pass that into a DoFn to use as a small lookup table. This failure
mode means that future steps silently continue to execute with empty lookup
tables as part of their processing on the cluster.

On 28 January 2015 at 13:45, Josh Wills <jw...@cloudera.com> wrote:

> Yeah, I think that before, we would just fail catastrophically by throwing
> a CrunchRuntimeException, which I found annoying. Do you prefer that
> behavior? It's certainly something that could be configurable.
>
> J
>
> On Wed, Jan 28, 2015 at 10:36 AM, Jinal Shah <ji...@gmail.com>
> wrote:
>
> > I think it was intented from these commits I see here
> >
> >
> https://github.com/apache/crunch/commit/3711cea61bded4c90b235a01163ae5f855089917
> > and
> >
> >
> https://github.com/apache/crunch/commit/ded504eb133fa0814e2d90ff2a662e72a67e04bb
> > .
> > Josh can enhance on this more.
> >
> > On Wed, Jan 28, 2015 at 9:26 AM, Mārtiņš Kalvāns <
> > martins.kalvans@gmail.com>
> > wrote:
> >
> > > Hi.
> > >
> > > When pipeline fails on cluster with some exception, materialize()
> returns
> > > empty collection and just logs error message.
> > >
> > > I'm (very, very) puzzled about this behaviour:
> > >
> > >
> >
> https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/materialize/MaterializableIterable.java#L92
> > > Is this really intended behaviour?
> > >
> > > If so, then some documentation for materialize() function about this
> > > behaviour would be really nice to have. :)
> > >
> > >
> > > --
> > > Mārtiņš
> > >
> >
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Re: .materialize() returns empty collection on pipeline error?

Posted by Josh Wills <jw...@cloudera.com>.
Yeah, I think that before, we would just fail catastrophically by throwing
a CrunchRuntimeException, which I found annoying. Do you prefer that
behavior? It's certainly something that could be configurable.

J

On Wed, Jan 28, 2015 at 10:36 AM, Jinal Shah <ji...@gmail.com>
wrote:

> I think it was intented from these commits I see here
>
> https://github.com/apache/crunch/commit/3711cea61bded4c90b235a01163ae5f855089917
> and
>
> https://github.com/apache/crunch/commit/ded504eb133fa0814e2d90ff2a662e72a67e04bb
> .
> Josh can enhance on this more.
>
> On Wed, Jan 28, 2015 at 9:26 AM, Mārtiņš Kalvāns <
> martins.kalvans@gmail.com>
> wrote:
>
> > Hi.
> >
> > When pipeline fails on cluster with some exception, materialize() returns
> > empty collection and just logs error message.
> >
> > I'm (very, very) puzzled about this behaviour:
> >
> >
> https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/materialize/MaterializableIterable.java#L92
> > Is this really intended behaviour?
> >
> > If so, then some documentation for materialize() function about this
> > behaviour would be really nice to have. :)
> >
> >
> > --
> > Mārtiņš
> >
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: .materialize() returns empty collection on pipeline error?

Posted by Jinal Shah <ji...@gmail.com>.
I think it was intented from these commits I see here
https://github.com/apache/crunch/commit/3711cea61bded4c90b235a01163ae5f855089917
and
https://github.com/apache/crunch/commit/ded504eb133fa0814e2d90ff2a662e72a67e04bb.
Josh can enhance on this more.

On Wed, Jan 28, 2015 at 9:26 AM, Mārtiņš Kalvāns <ma...@gmail.com>
wrote:

> Hi.
>
> When pipeline fails on cluster with some exception, materialize() returns
> empty collection and just logs error message.
>
> I'm (very, very) puzzled about this behaviour:
>
> https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/materialize/MaterializableIterable.java#L92
> Is this really intended behaviour?
>
> If so, then some documentation for materialize() function about this
> behaviour would be really nice to have. :)
>
>
> --
> Mārtiņš
>