You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Grant Ingersoll (JIRA)" <ji...@apache.org> on 2013/06/11 16:12:20 UTC

[jira] [Resolved] (MAHOUT-1233) Problem in processing datasets as a single chunk vs many chunks in HADOOP mode in mostly all the clustering algos

     [ https://issues.apache.org/jira/browse/MAHOUT-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll resolved MAHOUT-1233.
-------------------------------------

    Resolution: Incomplete

Please reopen if you have a repeatable test case, as I am not sure there is an issue here.
                
> Problem in processing datasets as a single chunk vs many chunks in HADOOP mode in mostly all the clustering algos
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1233
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1233
>             Project: Mahout
>          Issue Type: Question
>          Components: Clustering
>    Affects Versions: 0.7, 0.8
>            Reporter: yannis ats
>            Assignee: yannis ats
>            Priority: Minor
>             Fix For: 0.8
>
>
> I am trying to process a dataset and i do it in two ways.
> Firstly i give it as a single chunk(all the dataset) and secondly as many smaller chunks in order to increase the throughput of my machine.
> The problem is that when i perform the single chunk computation the results are fine 
> and by fine i mean that if i have in the input 1000 vectors i get in the output 1000 vectorids with their cluster_ids (i have tried in canopy,kmeans and fuzzy kmeans).
> However when i split the dataset in order to speed up the computations then strange phenomena occur.
> For instance the same dataset that contains 1000 vectors and is split in  for example 10 files then in the output i will obtain more vector ids(w.g 1100 vectorids with their corresponding clusterids).
> The question is, am i doing something wrong in the process?
> Is there a problem in clusterdump and seqdumper when the input is in many files?
> I have observed when mahout is performing the computations that in the screen says that processed the correct number of vectors.
> Am i missing something?
> I use as input the transformed to mvc weka vectors.
> I have tried this in v0.7 and the v0.8 snapshot.
> Thank you in advance for your time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira