You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@crunch.apache.org by Matthias Friedrich <ma...@mafr.de> on 2012/09/22 09:52:29 UTC

Logging/Debugging

Hi,

I'd like to discuss two things regarding logging and debugging:

1) Crunch currently ships a log4j.properties which can have precedence
over users' log4j.properties, depending on classpath order. Libraries
should never ship logging config as it forces users to repackage
Crunch if they want to use their own. Our Nexus at work has a nice
collection of repackaged libs.

2) Discussion about Pipeline.enableDebug() came up in CRUNCH-70. I
believe it really shouldn't mess with logging configuration. Right now
it bypasses the commons-logging facade and directly accesses log4j,
causing a compile time dependency on log4j. It changes VM-wide state
beyond Crunch as other Hadoop-related code executed afterwards will
get changed logging config, too. And, most importantly, it's the
responsibility of the operations team, not the developer to configure
logging. Admins are used to log4j.properties, we shouldn't invent
another non-standard way of doing things that overrides the usual
way.

My vote for 1) would be to remove our log4j.properties.

For 2) I think the best solution would be to offer an example
log4j.properties in our documentation (section "Debugging Your
Pipelines" or something) that has the effect Pipeline.enableDebug()
has now.

Regards,
  Matthias

Re: Logging/Debugging

Posted by Matthias Friedrich <ma...@mafr.de>.
On Monday, 2012-09-24, Josh Wills wrote:
> On Mon, Sep 24, 2012 at 11:09 AM, Matthias Friedrich <ma...@mafr.de> wrote:
[...] 
>> I'd say let's remove log4j.properties from Crunch, because users
>> can't defend themselves against it. We have local applications at
>> work that run some parts locally, without anything Hadoop-specific;
>> shipping a log4j.properties with Crunch would cause problems for us.
 
>> We could then add a log4j.properties to src/main/resources in the
>> archetype with an explanation of when exactly this configuration is
>> used (only when running from the IDE). We would keep enableDebug()
>> with its setting of "crunch.debug", but remove the log4j code, and add
>> a "provided" log4j dependency to the archetype (because log4j is
>> missing from hadoop-core).
 
>> Does this make sense? Will this give you the logging/debugging output
>> that you need?
 
> I'm on board with that plan. My one tweak would be to add support for
> hacking log4j to turn on Hadoop's WARN
> logs into the crunch-test functionality, which I think will serve my
> needs from within the IDE and won't interfere with
> any production or client log settings. Does that meet your goals as well?

Alright, let's do it then. I'll make a patch right after we're done with
the archetype (probably this weekend).
 
Regards,
  Matthias

Re: Logging/Debugging

Posted by Josh Wills <jw...@cloudera.com>.
On Mon, Sep 24, 2012 at 11:09 AM, Matthias Friedrich <ma...@mafr.de> wrote:
> On Sunday, 2012-09-23, Josh Wills wrote:
>> On Sun, Sep 23, 2012 at 2:11 AM, Matthias Friedrich <ma...@mafr.de> wrote:
>>> On Saturday, 2012-09-22, Josh Wills wrote:
>>> [...]
>
>> Ah, okay. So what we want for debugging is the Hadoop WARN logs. When
>> a hadoop job fails on the cluster, we have those logs available on the
>> JobTracker webpage (at least, I do in CDH, I assume it works the same
>> way in Hadoop 1.0.3), so enableDebug doesn't do anything for us
>> (besides altering the Configuration to force Crunch to put try-catch
>> blocks around the DoNode tasks, which I assume still works fine). I
>> use enableDebug to force the logging of Hadoop WARN statements on my
>> machine when I'm testing out pipelines, so in that case, it's only
>> effecting LocalJobRunner.
>
> Yep. I think we could remove log4j.properties, the log4j setup code in
> enableDebug(), and the log4j dependency from Crunch and the behavior
> on the cluster should still be the same. The same holds for
> LocalJobRunner when running via "hadoop jar".
>
> Running the LocalJobRunner from the IDE is the problem because then
> we need a logging backend on the classpath. If we don't have log4j,
> then java.util.logging is used, which logs everything on INFO level.
> As soon as log4j is on the classpath, however, the user really needs
> a log4j.properties or log4j will complain that it doesn't have
> configuration (and logs nothing).
>
>> Given that, what's the best approach here? Javadoc statement on the
>> function indicating its intended use, or is there a better option?
>
> I'd say let's remove log4j.properties from Crunch, because users
> can't defend themselves against it. We have local applications at
> work that run some parts locally, without anything Hadoop-specific;
> shipping a log4j.properties with Crunch would cause problems for us.
>
> We could then add a log4j.properties to src/main/resources in the
> archetype with an explanation of when exactly this configuration is
> used (only when running from the IDE). We would keep enableDebug()
> with its setting of "crunch.debug", but remove the log4j code, and add
> a "provided" log4j dependency to the archetype (because log4j is
> missing from hadoop-core).
>
> Does this make sense? Will this give you the logging/debugging output
> that you need?

I'm on board with that plan. My one tweak would be to add support for
hacking log4j to turn on Hadoop's WARN
logs into the crunch-test functionality, which I think will serve my
needs from within the IDE and won't interfere with
any production or client log settings. Does that meet your goals as well?

>
> [...]
>>> Ah, that reminds me: We haven't decided yet if we want an archetype in
>>> Crunch.
>
>> I want one. I thought you created it? I remember seeing an email-- if
>> I didn't reply, it was b/c I was in the midst of that crazy travel
>> week and my sleep schedule was off (honestly, I'm just now
>> recovering.)
>
> No worries, I'm a bit sleep-deprived myself so I can relate. With
> Gabriel we're +3 pro archetype, so I'll make a patch this weekend.
>
> Regards,
>   Matthias



-- 
Director of Data Science
Cloudera
Twitter: @josh_wills

Re: Logging/Debugging

Posted by Matthias Friedrich <ma...@mafr.de>.
On Sunday, 2012-09-23, Josh Wills wrote:
> On Sun, Sep 23, 2012 at 2:11 AM, Matthias Friedrich <ma...@mafr.de> wrote:
>> On Saturday, 2012-09-22, Josh Wills wrote:
>> [...]
 
> Ah, okay. So what we want for debugging is the Hadoop WARN logs. When
> a hadoop job fails on the cluster, we have those logs available on the
> JobTracker webpage (at least, I do in CDH, I assume it works the same
> way in Hadoop 1.0.3), so enableDebug doesn't do anything for us
> (besides altering the Configuration to force Crunch to put try-catch
> blocks around the DoNode tasks, which I assume still works fine). I
> use enableDebug to force the logging of Hadoop WARN statements on my
> machine when I'm testing out pipelines, so in that case, it's only
> effecting LocalJobRunner.

Yep. I think we could remove log4j.properties, the log4j setup code in
enableDebug(), and the log4j dependency from Crunch and the behavior
on the cluster should still be the same. The same holds for
LocalJobRunner when running via "hadoop jar".

Running the LocalJobRunner from the IDE is the problem because then
we need a logging backend on the classpath. If we don't have log4j,
then java.util.logging is used, which logs everything on INFO level.
As soon as log4j is on the classpath, however, the user really needs
a log4j.properties or log4j will complain that it doesn't have
configuration (and logs nothing).

> Given that, what's the best approach here? Javadoc statement on the
> function indicating its intended use, or is there a better option?

I'd say let's remove log4j.properties from Crunch, because users
can't defend themselves against it. We have local applications at
work that run some parts locally, without anything Hadoop-specific;
shipping a log4j.properties with Crunch would cause problems for us.
 
We could then add a log4j.properties to src/main/resources in the
archetype with an explanation of when exactly this configuration is
used (only when running from the IDE). We would keep enableDebug()
with its setting of "crunch.debug", but remove the log4j code, and add
a "provided" log4j dependency to the archetype (because log4j is
missing from hadoop-core).

Does this make sense? Will this give you the logging/debugging output
that you need?

[...] 
>> Ah, that reminds me: We haven't decided yet if we want an archetype in
>> Crunch.
 
> I want one. I thought you created it? I remember seeing an email-- if
> I didn't reply, it was b/c I was in the midst of that crazy travel
> week and my sleep schedule was off (honestly, I'm just now
> recovering.)
 
No worries, I'm a bit sleep-deprived myself so I can relate. With
Gabriel we're +3 pro archetype, so I'll make a patch this weekend.
 
Regards,
  Matthias

Re: Logging/Debugging

Posted by Gabriel Reid <ga...@gmail.com>.
On Sun, Sep 23, 2012 at 6:39 PM, Josh Wills <jw...@cloudera.com> wrote:
> On Sun, Sep 23, 2012 at 2:11 AM, Matthias Friedrich <ma...@mafr.de> wrote:
>> On Saturday, 2012-09-22, Josh Wills wrote:
[...]
>>> Assuming that is the case, perhaps we could add a default
>>> log4j.properties file to the maven archetype that generates Crunch
>>> projects?
>>
>> Ah, that reminds me: We haven't decided yet if we want an archetype in
>> Crunch.
>
> I want one. I thought you created it? I remember seeing an email-- if
> I didn't reply, it was b/c I was in the midst of that crazy travel
> week and my sleep schedule was off (honestly, I'm just now
> recovering.)

FWIW, I'm definitely in favour of adding the Crunch archetype. I think
it makes a huge difference in helping people to get started (and keep
them going).

Re: Logging/Debugging

Posted by Josh Wills <jw...@cloudera.com>.
Inlined.

On Sun, Sep 23, 2012 at 2:11 AM, Matthias Friedrich <ma...@mafr.de> wrote:
> On Saturday, 2012-09-22, Josh Wills wrote:
> [...]
>> Perhaps I'm being dense-- once we remove log4j.properties from the
>> core, won't we stop logging the Crunch status information unless the
>> developer explicitly configures it in their own log4j.properties?
>> That may of course be what the developer desires, but my thought was
>> that we typically do want that information logged, and that
>> repeating it in every log4j.properties file for every project would
>> be tedious.
>
> At second glance it seems there are a few misconceptions on how
> logging works with Hadoop. When running from the IDE with
> LocalJobRunner, log4j.properties has no effect because there's no
> log4j on the classpath; commons-logging just uses java.util.logging
> which logs on INFO level. Users who have log4j on the classpath will
> see different results, but they will have to add it themselves, as
> neither hadoop-client nor hadoop-core will provide it for them.
>
> When running jobs using "hadoop jar", our log4j.properties doesn't
> have an effect either because Hadoop's conf/log4j.properties takes
> precedence (even with HADOOP_USER_CLASSPATH_FIRST). This surprises
> me, to be honest, but Crunch's INFO messages are still printed, which
> is good. What also surprises me is that the log4j setup code in
> enableDebug() doesn't seem to have any effect; AFAICS, Hadoop INFO
> messages already appear on the console. All I get when calling
> enableDebug() is a log message "Could not find console appender 'A'".
>
> Is there something broken? What's the intention behind the log4j code
> in enableDebug()? Unless I'm overlooking something (quite possible)
> this seems like a no-op.

Ah, okay. So what we want for debugging is the Hadoop WARN logs. When
a hadoop job fails on the cluster, we have those logs available on the
JobTracker webpage (at least, I do in CDH, I assume it works the same
way in Hadoop 1.0.3), so enableDebug doesn't do anything for us
(besides altering the Configuration to force Crunch to put try-catch
blocks around the DoNode tasks, which I assume still works fine). I
use enableDebug to force the logging of Hadoop WARN statements on my
machine when I'm testing out pipelines, so in that case, it's only
effecting LocalJobRunner.

Given that, what's the best approach here? Javadoc statement on the
function indicating its intended use, or is there a better option?

>
>> Assuming that is the case, perhaps we could add a default
>> log4j.properties file to the maven archetype that generates Crunch
>> projects?
>
> Ah, that reminds me: We haven't decided yet if we want an archetype in
> Crunch.

I want one. I thought you created it? I remember seeing an email-- if
I didn't reply, it was b/c I was in the midst of that crazy travel
week and my sleep schedule was off (honestly, I'm just now
recovering.)

>
> Regards,
>   Matthias



-- 
Director of Data Science
Cloudera
Twitter: @josh_wills

Re: Logging/Debugging

Posted by Matthias Friedrich <ma...@mafr.de>.
On Saturday, 2012-09-22, Josh Wills wrote:
[...] 
> Perhaps I'm being dense-- once we remove log4j.properties from the
> core, won't we stop logging the Crunch status information unless the
> developer explicitly configures it in their own log4j.properties?
> That may of course be what the developer desires, but my thought was
> that we typically do want that information logged, and that
> repeating it in every log4j.properties file for every project would
> be tedious.

At second glance it seems there are a few misconceptions on how
logging works with Hadoop. When running from the IDE with
LocalJobRunner, log4j.properties has no effect because there's no
log4j on the classpath; commons-logging just uses java.util.logging
which logs on INFO level. Users who have log4j on the classpath will
see different results, but they will have to add it themselves, as
neither hadoop-client nor hadoop-core will provide it for them.

When running jobs using "hadoop jar", our log4j.properties doesn't
have an effect either because Hadoop's conf/log4j.properties takes
precedence (even with HADOOP_USER_CLASSPATH_FIRST). This surprises
me, to be honest, but Crunch's INFO messages are still printed, which
is good. What also surprises me is that the log4j setup code in
enableDebug() doesn't seem to have any effect; AFAICS, Hadoop INFO
messages already appear on the console. All I get when calling
enableDebug() is a log message "Could not find console appender 'A'".

Is there something broken? What's the intention behind the log4j code
in enableDebug()? Unless I'm overlooking something (quite possible)
this seems like a no-op.

> Assuming that is the case, perhaps we could add a default
> log4j.properties file to the maven archetype that generates Crunch
> projects?

Ah, that reminds me: We haven't decided yet if we want an archetype in
Crunch.
 
Regards,
  Matthias

Re: Logging/Debugging

Posted by Josh Wills <jw...@cloudera.com>.
On Sat, Sep 22, 2012 at 9:05 AM, Matthias Friedrich <ma...@mafr.de> wrote:
> On Saturday, 2012-09-22, Josh Wills wrote:
>> On Sat, Sep 22, 2012 at 12:52 AM, Matthias Friedrich <ma...@mafr.de> wrote:
> [...]
>>> 1) Crunch currently ships a log4j.properties which can have precedence
>>> over users' log4j.properties, depending on classpath order. Libraries
>>> should never ship logging config as it forces users to repackage
>>> Crunch if they want to use their own. Our Nexus at work has a nice
>>> collection of repackaged libs.
>
>> I'm on-board for this, with the caveat that we'll need to switch over
>> the logging we do now (e.g., which stage of the pipeline is currently
>> running, the URLs for job tracking, etc.) under a non-log4j based
>> control scheme.
>
> Sorry, I don't get what you mean. Crunch logs via commons-logging
> already. What do we need to change?

Perhaps I'm being dense-- once we remove log4j.properties from the
core, won't we stop logging the Crunch
status information unless the developer explicitly configures it in
their own log4j.properties? That may of course
be what the developer desires, but my thought was that we typically do
want that information logged, and that
repeating it in every log4j.properties file for every project would be tedious.

Assuming that is the case, perhaps we could add a default
log4j.properties file to the maven archetype that
generates Crunch projects?

>
>>> 2) Discussion about Pipeline.enableDebug() came up in CRUNCH-70. I
>>> believe it really shouldn't mess with logging configuration. Right now
>>> it bypasses the commons-logging facade and directly accesses log4j,
>>> causing a compile time dependency on log4j. It changes VM-wide state
>>> beyond Crunch as other Hadoop-related code executed afterwards will
>>> get changed logging config, too. And, most importantly, it's the
>>> responsibility of the operations team, not the developer to configure
>>> logging. Admins are used to log4j.properties, we shouldn't invent
>>> another non-standard way of doing things that overrides the usual
>>> way.
>
>> enableDebug is intended to be used by developers who are, well,
>> debugging their MR pipelines, which is something that every MR
>> developer spends a fair amount of time doing. I will argue against any
>> changes that make debugging more difficult-- if anything, I would like
>> to make it even easier. I don't draw a distinction between ops and
>> development when it comes to creating pipelines.
>
> Hmm. One of the things I firmly believe in is that libraries (any
> system actually) should cause the least amount of surprise in anything
> it does. enableDebug() changes VM-wide logging settings and overrides
> the well-understood configuration mechanism of log4j.properties. To me
> this is an unexpected side effect, at the very least we'll have to put
> a big fat warning in the javadocs.

I'm okay with a big fat warning (all caps, etc.) in the javadocs. But
I think the functionality is warranted, and I think it's reasonable
for developers to expect that a lot more logging will occur when they
put any library into debug mode. It is far more surprising, IMO, to
put a library into debug mode and then not actually provide any of the
information that will help debug the problem.

-- 
Director of Data Science
Cloudera
Twitter: @josh_wills

Re: Logging/Debugging

Posted by Matthias Friedrich <ma...@mafr.de>.
On Saturday, 2012-09-22, Josh Wills wrote:
> On Sat, Sep 22, 2012 at 12:52 AM, Matthias Friedrich <ma...@mafr.de> wrote:
[...] 
>> 1) Crunch currently ships a log4j.properties which can have precedence
>> over users' log4j.properties, depending on classpath order. Libraries
>> should never ship logging config as it forces users to repackage
>> Crunch if they want to use their own. Our Nexus at work has a nice
>> collection of repackaged libs.
 
> I'm on-board for this, with the caveat that we'll need to switch over
> the logging we do now (e.g., which stage of the pipeline is currently
> running, the URLs for job tracking, etc.) under a non-log4j based
> control scheme.

Sorry, I don't get what you mean. Crunch logs via commons-logging
already. What do we need to change?
 
>> 2) Discussion about Pipeline.enableDebug() came up in CRUNCH-70. I
>> believe it really shouldn't mess with logging configuration. Right now
>> it bypasses the commons-logging facade and directly accesses log4j,
>> causing a compile time dependency on log4j. It changes VM-wide state
>> beyond Crunch as other Hadoop-related code executed afterwards will
>> get changed logging config, too. And, most importantly, it's the
>> responsibility of the operations team, not the developer to configure
>> logging. Admins are used to log4j.properties, we shouldn't invent
>> another non-standard way of doing things that overrides the usual
>> way.
 
> enableDebug is intended to be used by developers who are, well,
> debugging their MR pipelines, which is something that every MR
> developer spends a fair amount of time doing. I will argue against any
> changes that make debugging more difficult-- if anything, I would like
> to make it even easier. I don't draw a distinction between ops and
> development when it comes to creating pipelines.
 
Hmm. One of the things I firmly believe in is that libraries (any
system actually) should cause the least amount of surprise in anything
it does. enableDebug() changes VM-wide logging settings and overrides
the well-understood configuration mechanism of log4j.properties. To me
this is an unexpected side effect, at the very least we'll have to put
a big fat warning in the javadocs.

Re: Logging/Debugging

Posted by Josh Wills <jw...@cloudera.com>.
Inlined.

On Sat, Sep 22, 2012 at 12:52 AM, Matthias Friedrich <ma...@mafr.de> wrote:
> Hi,
>
> I'd like to discuss two things regarding logging and debugging:
>
> 1) Crunch currently ships a log4j.properties which can have precedence
> over users' log4j.properties, depending on classpath order. Libraries
> should never ship logging config as it forces users to repackage
> Crunch if they want to use their own. Our Nexus at work has a nice
> collection of repackaged libs.

I'm on-board for this, with the caveat that we'll need to switch over
the logging we do now (e.g., which stage of the pipeline is currently
running, the URLs for job tracking, etc.) under a non-log4j based
control scheme.

>
> 2) Discussion about Pipeline.enableDebug() came up in CRUNCH-70. I
> believe it really shouldn't mess with logging configuration. Right now
> it bypasses the commons-logging facade and directly accesses log4j,
> causing a compile time dependency on log4j. It changes VM-wide state
> beyond Crunch as other Hadoop-related code executed afterwards will
> get changed logging config, too. And, most importantly, it's the
> responsibility of the operations team, not the developer to configure
> logging. Admins are used to log4j.properties, we shouldn't invent
> another non-standard way of doing things that overrides the usual
> way.

enableDebug is intended to be used by developers who are, well,
debugging their MR pipelines, which is something that every MR
developer spends a fair amount of time doing. I will argue against any
changes that make debugging more difficult-- if anything, I would like
to make it even easier. I don't draw a distinction between ops and
development when it comes to creating pipelines.

>
> My vote for 1) would be to remove our log4j.properties.
>
> For 2) I think the best solution would be to offer an example
> log4j.properties in our documentation (section "Debugging Your
> Pipelines" or something) that has the effect Pipeline.enableDebug()
> has now.
>
> Regards,
>   Matthias



-- 
Director of Data Science
Cloudera
Twitter: @josh_wills