You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org> on 2015/06/18 21:51:04 UTC

[jira] [Updated] (MAHOUT-1629) Mahout cvb on AWS EMR: p(topic|docId) doesn't make sense when using s3 folder as --input

     [ https://issues.apache.org/jira/browse/MAHOUT-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Lyubimov updated MAHOUT-1629:
-------------------------------------
    Assignee: Suneel Marthi

> Mahout cvb on AWS EMR: p(topic|docId) doesn't make sense when using s3 folder as --input
> ----------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1629
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1629
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.9
>         Environment: AWS EMR with AMI 3.2.3
>            Reporter: Markus Paaso
>            Assignee: Suneel Marthi
>              Labels: legacy
>
> When running 'mahout cvb' command on AWS EMR having option --input with value like s3://mybucket/input/ or s3://mybucket/input/* (7 input files in my case) the content of doc-topic output is really non-sense. It seems like the docIds in doc-topic output are shuffled. But the topic model output (p(term|topic) for each topic) looks still fine.
> The workaround is to first copy input files from s3 to cluster's hdfs with command:
>  {code:none}hadoop fs -cp s3://mybucket/input /input{code}
> and then running mahout cvb with option --input /input .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)