You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Jay Vyas <ja...@gmail.com> on 2014/05/23 19:55:09 UTC

ParallelALSFactorizationJob: Long job names : not always picked up by JobHistoryServer?

Hi mahout:  Im getting a very hard to trace bug involving an NPE thrown in
the ParallelALSFactorizationJob.

1) It stems because the getCounters method fails, when trying to retrieve a
job from the JobHistoryServer by name.

2) Looking into JobHistoryServer, i can see that indeed some mahout jobs
that have long names in them, are not "picked up" as compeleted jobs when i
query the JHS rest api, or look at the JobHistory Web UI.

So the question : Why might it be that some mahout jobs - particularly
those with long job names, which are completed succesfully in
mr-history/tmp/... and have SUCCEEDED in the .jhist file name, are not seen
and transferred to done/ by JHS?

Here is an an example of one such file that goes "under the radar":


    ├── job_1400794299637_0010-1400808860349-tom-ParallelALSFactorizationJob%2DItemRatingVectorsMappe-1400808889684-1-1-SUCCEEDED-default.jhist"


...........................

FYI

- I've also filed a jira in hadoop about this, because i think better
logging in JobHistoryServer would be nice for debugging :
https://issues.apache.org/jira/browse/MAPREDUCE-5902.  It may be a bug,
maybe not, but either way, better logs would tell us more at runtime.

- I've also posted this similar error here :
http://mail-archives.apache.org/mod_mbox/mahout-user/201405.mbox/%3CCAAu13zGdPzw-J7b_SLAj5WDznYfm=J4P=B4e+w-n+4UP5OsOjQ@mail.gmail.com%3E

.........................

Summarizing: Ultimately, if the JobHistoryServer doesnt properly process a
file, you get a failure in the ALS job, because when it checks counters
from previous job, an NPE is thrown.

Im pretty lost on this, been looking into it on and off for some time - so
anyone has a thought let me know.

-- 
Jay Vyas
http://jayunit100.blogspot.com