You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Grant Ingersoll (JIRA)" <ji...@apache.org> on 2013/06/09 17:48:20 UTC

[jira] [Comment Edited] (MAHOUT-1247) cluster-reuters doesn't work on Hadoop

    [ https://issues.apache.org/jira/browse/MAHOUT-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13679090#comment-13679090 ] 

Grant Ingersoll edited comment on MAHOUT-1247 at 6/9/13 3:47 PM:
-----------------------------------------------------------------

I think I see the issue.  The cache file is "local", the Iterator, however, has a Hadoop conf that is expecting an HDFS file, hence it can't find it.

For instance, the logs show:
{quote}11:38:49,638 INFO org.apache.mahout.vectorizer.term.TFPartialVectorReducer: Cache Files: [/tmp/hadoop-grantingersoll/mapred/local/taskTracker/distcache/2677051046998143225_1262960862_697707077/localhostdicVec/dictionary.file-0]
2013{quote}

Notice it is missing the scheme.  Going to try explicitly setting the scheme to file://
                
      was (Author: gsingers):
    I think I see the issue.  The cache file is "local", the Iterator, however, has a Hadoop conf that is expecting an HDFS file, hence it can't find it.
                  
> cluster-reuters doesn't work on Hadoop
> --------------------------------------
>
>                 Key: MAHOUT-1247
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1247
>             Project: Mahout
>          Issue Type: Bug
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 0.8
>
>
> At least two issues:
> 1. MAHOUT-992 messed up the Distributed Cache stuff somehow
> 2. The ExtractReuters data is not being moved to HDFS.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: [jira] [Comment Edited] (MAHOUT-1247) cluster-reuters doesn't work on Hadoop

Posted by Sebastian Schelter <ss...@googlemail.com>.
A makeQualified call should help in case the file is not found:

LocalFileSystem localFs = FileSystem.getLocal(conf);
Path localCacheFile = localFs.makeQualified(localFiles[0]);

if you run in local mode (e.g. not on a cluster), you could have to use
a fallback to directly load the file, as it is done in
org.apache.mahout.cf.taste.hadoop.als.ALS#readMatrixByRowsFromDistributedCache

Best,
Sebastian

On 09.06.2013 17:48, Grant Ingersoll (JIRA) wrote:
> 
>     [ https://issues.apache.org/jira/browse/MAHOUT-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13679090#comment-13679090 ] 
> 
> Grant Ingersoll edited comment on MAHOUT-1247 at 6/9/13 3:47 PM:
> -----------------------------------------------------------------
> 
> I think I see the issue.  The cache file is "local", the Iterator, however, has a Hadoop conf that is expecting an HDFS file, hence it can't find it.
> 
> For instance, the logs show:
> {quote}11:38:49,638 INFO org.apache.mahout.vectorizer.term.TFPartialVectorReducer: Cache Files: [/tmp/hadoop-grantingersoll/mapred/local/taskTracker/distcache/2677051046998143225_1262960862_697707077/localhostdicVec/dictionary.file-0]
> 2013{quote}
> 
> Notice it is missing the scheme.  Going to try explicitly setting the scheme to file://
>                 
>       was (Author: gsingers):
>     I think I see the issue.  The cache file is "local", the Iterator, however, has a Hadoop conf that is expecting an HDFS file, hence it can't find it.
>                   
>> cluster-reuters doesn't work on Hadoop
>> --------------------------------------
>>
>>                 Key: MAHOUT-1247
>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1247
>>             Project: Mahout
>>          Issue Type: Bug
>>            Reporter: Grant Ingersoll
>>            Assignee: Grant Ingersoll
>>             Fix For: 0.8
>>
>>
>> At least two issues:
>> 1. MAHOUT-992 messed up the Distributed Cache stuff somehow
>> 2. The ExtractReuters data is not being moved to HDFS.
> 
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA administrators
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>