You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Paritosh Ranjan (Created) (JIRA)" <ji...@apache.org> on 2012/01/04 18:14:40 UTC

[jira] [Created] (MAHOUT-940) Clusterdumper - Get rid of map based implementation

Clusterdumper - Get rid of map based implementation
---------------------------------------------------

                 Key: MAHOUT-940
                 URL: https://issues.apache.org/jira/browse/MAHOUT-940
             Project: Mahout
          Issue Type: Improvement
          Components: Clustering
    Affects Versions: 0.6
            Reporter: Paritosh Ranjan
             Fix For: 0.7


Current implementation of ClusterDumper puts clusters and related vectors in map. This generally results in OOM.

Since ClusterOutputProcessor is availabale now. The ClusterDumper will at first process the clusteredPoints, and then write down the clusters to a local file. 

The inability to properly read the clustering output due to ClusterDumper facing OOM is seen too often in the mailing list. This improvement will fix that problem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (MAHOUT-940) Clusterdumper - Get rid of map based implementation

Posted by "Paritosh Ranjan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paritosh Ranjan resolved MAHOUT-940.
------------------------------------

       Resolution: Won't Fix
    Fix Version/s:     (was: Backlog)
                   0.8

Marking it won't fix for the reasons stated in last few comments.
                
> Clusterdumper - Get rid of map based implementation
> ---------------------------------------------------
>
>                 Key: MAHOUT-940
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-940
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>            Assignee: Paritosh Ranjan
>              Labels: clustering
>             Fix For: 0.8
>
>
> Current implementation of ClusterDumper puts clusters and related vectors in map. This generally results in OOM.
> Since ClusterOutputProcessor is availabale now. The ClusterDumper will at first process the clusteredPoints, and then write down the clusters to a local file. 
> The inability to properly read the clustering output due to ClusterDumper facing OOM is seen too often in the mailing list. This improvement will fix that problem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (MAHOUT-940) Clusterdumper - Get rid of map based implementation

Posted by "Paritosh Ranjan (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paritosh Ranjan reassigned MAHOUT-940:
--------------------------------------

    Assignee: Paritosh Ranjan
    
> Clusterdumper - Get rid of map based implementation
> ---------------------------------------------------
>
>                 Key: MAHOUT-940
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-940
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>
> Current implementation of ClusterDumper puts clusters and related vectors in map. This generally results in OOM.
> Since ClusterOutputProcessor is availabale now. The ClusterDumper will at first process the clusteredPoints, and then write down the clusters to a local file. 
> The inability to properly read the clustering output due to ClusterDumper facing OOM is seen too often in the mailing list. This improvement will fix that problem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-940) Clusterdumper - Get rid of map based implementation

Posted by "Paritosh Ranjan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13399196#comment-13399196 ] 

Paritosh Ranjan commented on MAHOUT-940:
----------------------------------------

Considering nobody has encountered this issue since last few months, and also that now we have ClusterOutputPostProcessor (clusterpp), which can do it for any number of clusters/vectors, I would like to drop this issue.

I propose to mark it Won't Fix until someone feels otherwise. I will wait for a week or so, and then mark it won't fix.
                
> Clusterdumper - Get rid of map based implementation
> ---------------------------------------------------
>
>                 Key: MAHOUT-940
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-940
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>            Assignee: Paritosh Ranjan
>              Labels: clustering
>             Fix For: Backlog
>
>
> Current implementation of ClusterDumper puts clusters and related vectors in map. This generally results in OOM.
> Since ClusterOutputProcessor is availabale now. The ClusterDumper will at first process the clusteredPoints, and then write down the clusters to a local file. 
> The inability to properly read the clustering output due to ClusterDumper facing OOM is seen too often in the mailing list. This improvement will fix that problem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-940) Clusterdumper - Get rid of map based implementation

Posted by "Saikat Kanjilal (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13243165#comment-13243165 ] 

Saikat Kanjilal commented on MAHOUT-940:
----------------------------------------

Paritosh, 
I'm assuming OOM means out of memory, is that correct?   So to be clear is the solution to this to use the ClusteredOutputProcessor instead of the ClusterDumper?  I will research code and move forward with an implementation.  Will ask questions if I get stuck,
                
> Clusterdumper - Get rid of map based implementation
> ---------------------------------------------------
>
>                 Key: MAHOUT-940
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-940
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>
> Current implementation of ClusterDumper puts clusters and related vectors in map. This generally results in OOM.
> Since ClusterOutputProcessor is availabale now. The ClusterDumper will at first process the clusteredPoints, and then write down the clusters to a local file. 
> The inability to properly read the clustering output due to ClusterDumper facing OOM is seen too often in the mailing list. This improvement will fix that problem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-940) Clusterdumper - Get rid of map based implementation

Posted by "Saikat Kanjilal (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13267676#comment-13267676 ] 

Saikat Kanjilal commented on MAHOUT-940:
----------------------------------------

Paritosh,
I have not had the bandwidth to do anything else on this, I will try to script this in the next week or so but am swamped at work.
Thanks
                
> Clusterdumper - Get rid of map based implementation
> ---------------------------------------------------
>
>                 Key: MAHOUT-940
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-940
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>
> Current implementation of ClusterDumper puts clusters and related vectors in map. This generally results in OOM.
> Since ClusterOutputProcessor is availabale now. The ClusterDumper will at first process the clusteredPoints, and then write down the clusters to a local file. 
> The inability to properly read the clustering output due to ClusterDumper facing OOM is seen too often in the mailing list. This improvement will fix that problem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-940) Clusterdumper - Get rid of map based implementation

Posted by "Saikat Kanjilal (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277088#comment-13277088 ] 

Saikat Kanjilal commented on MAHOUT-940:
----------------------------------------

paritosh I was going to start writing up some code to generate these large files, should we check this into the mahout tree or will this be a one time thing?
                
> Clusterdumper - Get rid of map based implementation
> ---------------------------------------------------
>
>                 Key: MAHOUT-940
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-940
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>            Assignee: Paritosh Ranjan
>              Labels: clustering
>             Fix For: Backlog
>
>
> Current implementation of ClusterDumper puts clusters and related vectors in map. This generally results in OOM.
> Since ClusterOutputProcessor is availabale now. The ClusterDumper will at first process the clusteredPoints, and then write down the clusters to a local file. 
> The inability to properly read the clustering output due to ClusterDumper facing OOM is seen too often in the mailing list. This improvement will fix that problem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-940) Clusterdumper - Get rid of map based implementation

Posted by "Saikat Kanjilal (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13249578#comment-13249578 ] 

Saikat Kanjilal commented on MAHOUT-940:
----------------------------------------

Paritosh,
Having some time to work on this today, are there particular unit tests that cause the OOM to happen, maybe I can start with where the problem happens and work backwards from there, the unit tests on trunk seem to be compiling and running fine on my mac.   Some more examples would help.

Thanks
                
> Clusterdumper - Get rid of map based implementation
> ---------------------------------------------------
>
>                 Key: MAHOUT-940
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-940
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>
> Current implementation of ClusterDumper puts clusters and related vectors in map. This generally results in OOM.
> Since ClusterOutputProcessor is availabale now. The ClusterDumper will at first process the clusteredPoints, and then write down the clusters to a local file. 
> The inability to properly read the clustering output due to ClusterDumper facing OOM is seen too often in the mailing list. This improvement will fix that problem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (MAHOUT-940) Clusterdumper - Get rid of map based implementation

Posted by "Paritosh Ranjan (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13244952#comment-13244952 ] 

Paritosh Ranjan edited comment on MAHOUT-940 at 4/3/12 3:52 AM:
----------------------------------------------------------------

1) yes
2) It might be a good idea to do some testing before/after your code change. i.e. Running all Junit tests, and some manual testing using clusterdumper ( dump a cluster using new implementation which was getting OOM with the older implementation). It will make sure that the code is working.

Also, you can try to test quality before and after using the post processor. i.e. The results should be same, whether you use the map based or post processor based implementation.

So, to test it, do not get rid of the older coder, rather provide an option to use the map based/post processor based implementation. This will help in testing. Later it can be decided which version to keep i.e. new/both.
                
      was (Author: paritoshranjan):
    1) yes
2) It might be a good idea to do some testing before/after your code change. i.e. Running all Junit tests, and some manual testing using clusterdumper ( dump a cluster using new implementation which was getting OOM with the older implementation). It will make sure that the code is working.

Also, you can also try to test quality before after using the post processor. i.e. The results should be same, whether you use the map based or post processor based implementation.

So, to test it, do not get rid of the older coder, rather provide an option to use the map based/post processor based implementation. This will help in testing. Later it can be decided which version to keep i.e. new/both.
                  
> Clusterdumper - Get rid of map based implementation
> ---------------------------------------------------
>
>                 Key: MAHOUT-940
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-940
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>
> Current implementation of ClusterDumper puts clusters and related vectors in map. This generally results in OOM.
> Since ClusterOutputProcessor is availabale now. The ClusterDumper will at first process the clusteredPoints, and then write down the clusters to a local file. 
> The inability to properly read the clustering output due to ClusterDumper facing OOM is seen too often in the mailing list. This improvement will fix that problem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-940) Clusterdumper - Get rid of map based implementation

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13399199#comment-13399199 ] 

Sean Owen commented on MAHOUT-940:
----------------------------------

(Fine by me.) 
(For what it's worth, I have no problem just marking things WontFix if you have any reasonable feeling that it will not be controversial. These things can always be un-done if anyone objects. Just saves time and another 'hop' in the process.)
                
> Clusterdumper - Get rid of map based implementation
> ---------------------------------------------------
>
>                 Key: MAHOUT-940
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-940
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>            Assignee: Paritosh Ranjan
>              Labels: clustering
>             Fix For: Backlog
>
>
> Current implementation of ClusterDumper puts clusters and related vectors in map. This generally results in OOM.
> Since ClusterOutputProcessor is availabale now. The ClusterDumper will at first process the clusteredPoints, and then write down the clusters to a local file. 
> The inability to properly read the clustering output due to ClusterDumper facing OOM is seen too often in the mailing list. This improvement will fix that problem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-940) Clusterdumper - Get rid of map based implementation

Posted by "Paritosh Ranjan (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13247007#comment-13247007 ] 

Paritosh Ranjan commented on MAHOUT-940:
----------------------------------------

Even I am not familiar with this code. It will be great if you can find a way through.
The point is to fix the OOM.
                
> Clusterdumper - Get rid of map based implementation
> ---------------------------------------------------
>
>                 Key: MAHOUT-940
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-940
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>
> Current implementation of ClusterDumper puts clusters and related vectors in map. This generally results in OOM.
> Since ClusterOutputProcessor is availabale now. The ClusterDumper will at first process the clusteredPoints, and then write down the clusters to a local file. 
> The inability to properly read the clustering output due to ClusterDumper facing OOM is seen too often in the mailing list. This improvement will fix that problem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-940) Clusterdumper - Get rid of map based implementation

Posted by "Paritosh Ranjan (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13257303#comment-13257303 ] 

Paritosh Ranjan commented on MAHOUT-940:
----------------------------------------

I think the best way would be to write some code which generates sequence files with large number of random vectors.
                
> Clusterdumper - Get rid of map based implementation
> ---------------------------------------------------
>
>                 Key: MAHOUT-940
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-940
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>
> Current implementation of ClusterDumper puts clusters and related vectors in map. This generally results in OOM.
> Since ClusterOutputProcessor is availabale now. The ClusterDumper will at first process the clusteredPoints, and then write down the clusters to a local file. 
> The inability to properly read the clustering output due to ClusterDumper facing OOM is seen too often in the mailing list. This improvement will fix that problem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-940) Clusterdumper - Get rid of map based implementation

Posted by "Saikat Kanjilal (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13257279#comment-13257279 ] 

Saikat Kanjilal commented on MAHOUT-940:
----------------------------------------

Ok, sorry about the delayed response,  so have been swamped with work, ran a few examples with the reuters data set and couldn't repro the OOM issue.  I believe this used the ClusterDumper from what I could see in the output.  Are there other datasets I can try?
                
> Clusterdumper - Get rid of map based implementation
> ---------------------------------------------------
>
>                 Key: MAHOUT-940
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-940
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>
> Current implementation of ClusterDumper puts clusters and related vectors in map. This generally results in OOM.
> Since ClusterOutputProcessor is availabale now. The ClusterDumper will at first process the clusteredPoints, and then write down the clusters to a local file. 
> The inability to properly read the clustering output due to ClusterDumper facing OOM is seen too often in the mailing list. This improvement will fix that problem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-940) Clusterdumper - Get rid of map based implementation

Posted by "Paritosh Ranjan (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13244952#comment-13244952 ] 

Paritosh Ranjan commented on MAHOUT-940:
----------------------------------------

1) yes
2) It might be a good idea to do some testing before/after your code change. i.e. Running all Junit tests, and some manual testing using clusterdumper ( dump a cluster using new implementation which was getting OOM with the older implementation). It will make sure that the code is working.

Also, you can also try to test quality before after using the post processor. i.e. The results should be same, whether you use the map based or post processor based implementation.

So, to test it, do not get rid of the older coder, rather provide an option to use the map based/post processor based implementation. This will help in testing. Later it can be decided which version to keep i.e. new/both.
                
> Clusterdumper - Get rid of map based implementation
> ---------------------------------------------------
>
>                 Key: MAHOUT-940
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-940
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>
> Current implementation of ClusterDumper puts clusters and related vectors in map. This generally results in OOM.
> Since ClusterOutputProcessor is availabale now. The ClusterDumper will at first process the clusteredPoints, and then write down the clusters to a local file. 
> The inability to properly read the clustering output due to ClusterDumper facing OOM is seen too often in the mailing list. This improvement will fix that problem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-940) Clusterdumper - Get rid of map based implementation

Posted by "Paritosh Ranjan (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13249859#comment-13249859 ] 

Paritosh Ranjan commented on MAHOUT-940:
----------------------------------------

Clustering a large dataset and then using clusterdumper, will give OOM as all the data will be in map.

A new test case would be needed, which will fail due to OOM.
Then the new implementation can be tried out, which will eventually let the Junit Test Pass.
                
> Clusterdumper - Get rid of map based implementation
> ---------------------------------------------------
>
>                 Key: MAHOUT-940
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-940
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>
> Current implementation of ClusterDumper puts clusters and related vectors in map. This generally results in OOM.
> Since ClusterOutputProcessor is availabale now. The ClusterDumper will at first process the clusteredPoints, and then write down the clusters to a local file. 
> The inability to properly read the clustering output due to ClusterDumper facing OOM is seen too often in the mailing list. This improvement will fix that problem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-940) Clusterdumper - Get rid of map based implementation

Posted by "Saikat Kanjilal (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13244943#comment-13244943 ] 

Saikat Kanjilal commented on MAHOUT-940:
----------------------------------------

So after researching this some more:
1) I don't see any class called ClusteredOutputProcessor so I assume you mean ClusteredOutputPostProcessor is that correct
2) In looking at ClusterDumper more closely I see two maps,one for the postProcessingClusteredDirectories and the other called writersForClusters, I will replace both of those with the APIs nested inside ClusteredOutputPostProcessor

Let me know if you see any concerns with the above approach.
                
> Clusterdumper - Get rid of map based implementation
> ---------------------------------------------------
>
>                 Key: MAHOUT-940
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-940
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>
> Current implementation of ClusterDumper puts clusters and related vectors in map. This generally results in OOM.
> Since ClusterOutputProcessor is availabale now. The ClusterDumper will at first process the clusteredPoints, and then write down the clusters to a local file. 
> The inability to properly read the clustering output due to ClusterDumper facing OOM is seen too often in the mailing list. This improvement will fix that problem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-940) Clusterdumper - Get rid of map based implementation

Posted by "Saikat Kanjilal (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13247008#comment-13247008 ] 

Saikat Kanjilal commented on MAHOUT-940:
----------------------------------------

Ha that's funny, ok time to keep digging and see what I can figure out
                
> Clusterdumper - Get rid of map based implementation
> ---------------------------------------------------
>
>                 Key: MAHOUT-940
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-940
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>
> Current implementation of ClusterDumper puts clusters and related vectors in map. This generally results in OOM.
> Since ClusterOutputProcessor is availabale now. The ClusterDumper will at first process the clusteredPoints, and then write down the clusters to a local file. 
> The inability to properly read the clustering output due to ClusterDumper facing OOM is seen too often in the mailing list. This improvement will fix that problem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-940) Clusterdumper - Get rid of map based implementation

Posted by "Paritosh Ranjan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13267601#comment-13267601 ] 

Paritosh Ranjan commented on MAHOUT-940:
----------------------------------------

I would not be able to resolve this issue in 0.7. Neither it is that important. But I will surely like to take a shot at this later.
Can we move this to 0.8?
                
> Clusterdumper - Get rid of map based implementation
> ---------------------------------------------------
>
>                 Key: MAHOUT-940
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-940
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>
> Current implementation of ClusterDumper puts clusters and related vectors in map. This generally results in OOM.
> Since ClusterOutputProcessor is availabale now. The ClusterDumper will at first process the clusteredPoints, and then write down the clusters to a local file. 
> The inability to properly read the clustering output due to ClusterDumper facing OOM is seen too often in the mailing list. This improvement will fix that problem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-940) Clusterdumper - Get rid of map based implementation

Posted by "Paritosh Ranjan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paritosh Ranjan updated MAHOUT-940:
-----------------------------------

    Fix Version/s:     (was: 0.7)
                   Backlog
           Labels: clustering  (was: )
    
> Clusterdumper - Get rid of map based implementation
> ---------------------------------------------------
>
>                 Key: MAHOUT-940
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-940
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>            Assignee: Paritosh Ranjan
>              Labels: clustering
>             Fix For: Backlog
>
>
> Current implementation of ClusterDumper puts clusters and related vectors in map. This generally results in OOM.
> Since ClusterOutputProcessor is availabale now. The ClusterDumper will at first process the clusteredPoints, and then write down the clusters to a local file. 
> The inability to properly read the clustering output due to ClusterDumper facing OOM is seen too often in the mailing list. This improvement will fix that problem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-940) Clusterdumper - Get rid of map based implementation

Posted by "Saikat Kanjilal (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13247003#comment-13247003 ] 

Saikat Kanjilal commented on MAHOUT-940:
----------------------------------------

Still reading code to get a deeper understanding of what's happening, some more questions:

1)The createClusterWriter method inside ClusterDumper creates 3 types of writers depending on the outputFormat, so one of the arguments to these writers is the map in question is shown below:

private Map<Integer, List<WeightedVectorWritable>> clusterIdToPoints;

Its not clear to me whether we need to do a deeper refactoring to rewrite/replace these different types of writers with the ClusterOutputPostProcessor, any thoughts on this, should we have a choice to either use the writers or the ClusterOutputPostProcessor?

2) For the following line of code:
long numWritten = clusterWriter.write(new SequenceFileDirValueIterable<ClusterWritable>(new Path(seqFileDir, "part-*"), PathType.GLOB, conf));

Does the above just use an iterator to dump the points to different directories corresponding to the different clusters, the code is really hard to read and SequenceFileDirValueIterable is not well commented.

Thanks for your help in getting a better understanding of this.
                
> Clusterdumper - Get rid of map based implementation
> ---------------------------------------------------
>
>                 Key: MAHOUT-940
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-940
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>
> Current implementation of ClusterDumper puts clusters and related vectors in map. This generally results in OOM.
> Since ClusterOutputProcessor is availabale now. The ClusterDumper will at first process the clusteredPoints, and then write down the clusters to a local file. 
> The inability to properly read the clustering output due to ClusterDumper facing OOM is seen too often in the mailing list. This improvement will fix that problem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira