You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Ferdy Galema <fe...@kalooga.com> on 2012/08/07 11:21:23 UTC

hadoop.job.history.user.location in nutch-default with CDH rendering job history useless

Hi,

There still is a property in nutch-default
'hadoop.job.history.user.location' that redirects the creation of history
files from job output locations to a custom location. I noticed that the
current value does not work well with CDH, because ${hadoop.log.dir} is not
defined. This actually causes the entire job history in the jobtracker to
show empty info. (With 'incomplete' job status).

Changing the value to /user/myname/history does work for example. However I
have done some more testing and it seems that this property can be set to
'none', because the job history is ALSO stored in the central jobtracker
location anyway. The 'hadoop.job.history.user.location' property specifies
an extra location. But if it is set to an invalid value, it causes the
central history location to NOT store it. Please see for more details:
http://hadoop.apache.org/common/docs/r1.0.3/cluster_setup.html

Setting this value to 'none' keeps the central history but prevents the job
to write history in the job output location. If a user wants to have an
extra copy of the history files, nothing prevents him/her from specifying
another value in nutch-site for example. Another option is to set it to
'history' which does work with CDH. (This writes all logs to 'history' in
the user directory in the configured filesystem, usually dfs). The final
option is to simply remove this value and not meddle with hadoop properties
at all. But that actually requires all jobs to correctly ignore these
files. I am not up to date how well this currently works with Nutch jobs.
This question is most relevant for trunk, since trunk heavily relies on the
filesystem for jobs.

What do you think? It would be great if anyone could do some testing with
trunk and possible another Hadoop distro. (I.e. the official 1.0.3). Then
we have some more input to decide what the best option is:
A) Set property to 'none'
B) Set property to 'history'
C) Remove property, see what happens, possibly fix jobs
D) ?

Ferdy.

Re: hadoop.job.history.user.location in nutch-default with CDH rendering job history useless

Posted by Ferdy Galema <fe...@kalooga.com>.

FYI I've created a Jira for followup discussion.
https://issues.apache.org/jira/browse/NUTCH-1452

On Tue, Aug 7, 2012 at 11:21 AM, Ferdy Galema <fe...@kalooga.com>wrote:

> Hi,
>
> There still is a property in nutch-default
> 'hadoop.job.history.user.location' that redirects the creation of history
> files from job output locations to a custom location. I noticed that the
> current value does not work well with CDH, because ${hadoop.log.dir} is not
> defined. This actually causes the entire job history in the jobtracker to
> show empty info. (With 'incomplete' job status).
>
> Changing the value to /user/myname/history does work for example. However
> I have done some more testing and it seems that this property can be set to
> 'none', because the job history is ALSO stored in the central jobtracker
> location anyway. The 'hadoop.job.history.user.location' property specifies
> an extra location. But if it is set to an invalid value, it causes the
> central history location to NOT store it. Please see for more details:
> http://hadoop.apache.org/common/docs/r1.0.3/cluster_setup.html
>
> Setting this value to 'none' keeps the central history but prevents the
> job to write history in the job output location. If a user wants to have an
> extra copy of the history files, nothing prevents him/her from specifying
> another value in nutch-site for example. Another option is to set it to
> 'history' which does work with CDH. (This writes all logs to 'history' in
> the user directory in the configured filesystem, usually dfs). The final
> option is to simply remove this value and not meddle with hadoop properties
> at all. But that actually requires all jobs to correctly ignore these
> files. I am not up to date how well this currently works with Nutch jobs.
> This question is most relevant for trunk, since trunk heavily relies on the
> filesystem for jobs.
>
> What do you think? It would be great if anyone could do some testing with
> trunk and possible another Hadoop distro. (I.e. the official 1.0.3). Then
> we have some more input to decide what the best option is:
> A) Set property to 'none'
> B) Set property to 'history'
> C) Remove property, see what happens, possibly fix jobs
> D) ?
>
> Ferdy.
>
>
>
>