You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2016/03/30 07:31:25 UTC

[jira] [Commented] (MAHOUT-1634) ALS don't work when it adds new files in Distributed Cache

    [ https://issues.apache.org/jira/browse/MAHOUT-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15217424#comment-15217424 ] 

ASF GitHub Bot commented on MAHOUT-1634:
----------------------------------------

GitHub user smarthi opened a pull request:

    https://github.com/apache/mahout/pull/208

    MAHOUT-1634: ALS don't work when it adds new files in Distributed Cache

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/smarthi/mahout MAHOUT-1634

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/mahout/pull/208.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #208
    
----
commit 80c3d05b64c643f1c47dc0cd88344a17fff40cd9
Author: smarthi <sm...@apache.org>
Date:   2016-03-30T05:29:59Z

    MAHOUT-1634: ALS don't work when it adds new files in Distributed Cache

----


> ALS don't work when it adds new files in Distributed Cache
> ----------------------------------------------------------
>
>                 Key: MAHOUT-1634
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1634
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering
>    Affects Versions: 0.10.1
>         Environment: Cloudera 5.1 VM, eclipse, zookeeper
>            Reporter: Cristian Galán
>            Assignee: Suneel Marthi
>              Labels: ALS, legacy
>             Fix For: 0.12.0
>
>         Attachments: mahout.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> ALS algorithm uses distributed cache to temp files, but the distributed cache have other uses too, especially to add dependencies
> (http://blog.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/), so when in a hadoop's job we add a dependency library (or other file) ALS fails because it reads ALL files in Distribution Cache without distinction.
> This occurs in the project of my company because we need to add Mahout dependencies (mahout, lucene,...) in an hadoop Configuration to run Mahout's jobs, otherwise the Mahout's job fails because it don't find the dependencies.
> I propose two options (I think two valid options):
> 1) Eliminate all .jar in the return of HadoopUtil.getCacheFiles
> 2) Elliminate all Path object distinct of /part-*
> I prefer the first because it's less aggressive, and I think this solution will be resolve all problems.
> Pd: Sorry if my english is wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)