You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Han Hui Wen (JIRA)" <ji...@apache.org> on 2010/08/13 07:27:17 UTC

[jira] Created: (MAHOUT-474) Should compress output of Job pairwiseSimilarity and Job asMatrix

Should compress output of Job pairwiseSimilarity and Job asMatrix
-----------------------------------------------------------------

                 Key: MAHOUT-474
                 URL: https://issues.apache.org/jira/browse/MAHOUT-474
             Project: Mahout
          Issue Type: Improvement
            Reporter: Han Hui Wen 


!https://issues.apache.org/jira/secure/thumbnail/12451985/12451985_RowSimilarityJob-CooccurrencesMapper-SimilarityReducer.jpg!

 From above picture ,we can see that the output of pairwiseSimilarity is very large ,we should compress them.

      SequenceFileOutputFormat.setOutputCompressionType(job, style);
      SequenceFileOutputFormat.setCompressOutput(job, compress);
      SequenceFileOutputFormat.setOutputCompressorClass(job, codecClass)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-474) Should compress output of Job pairwiseSimilarity and Job asMatrix

Posted by "Han Hui Wen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Han Hui Wen  updated MAHOUT-474:
--------------------------------

    Attachment:     (was: after_patch_20100813.jpg)

> Should compress output of Job pairwiseSimilarity and Job asMatrix
> -----------------------------------------------------------------
>
>                 Key: MAHOUT-474
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-474
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.4
>            Reporter: Han Hui Wen 
>            Assignee: Sean Owen
>
> !https://issues.apache.org/jira/secure/attachment/12451985/RowSimilarityJob-CooccurrencesMapper-SimilarityReducer.jpg!
>  From above picture ,we can see that the output of pairwiseSimilarity is very large ,we should compress them.
>       SequenceFileOutputFormat.setOutputCompressionType(job, style);
>       SequenceFileOutputFormat.setCompressOutput(job, compress);
>       SequenceFileOutputFormat.setOutputCompressorClass(job, codecClass)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-474) Should compress output of Job pairwiseSimilarity and Job asMatrix

Posted by "Han Hui Wen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Han Hui Wen  updated MAHOUT-474:
--------------------------------

    Attachment: after_patch_20100813.jpg

> Should compress output of Job pairwiseSimilarity and Job asMatrix
> -----------------------------------------------------------------
>
>                 Key: MAHOUT-474
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-474
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.4
>            Reporter: Han Hui Wen 
>            Assignee: Sean Owen
>         Attachments: after_patch_20100813.jpg
>
>
> !https://issues.apache.org/jira/secure/attachment/12451985/RowSimilarityJob-CooccurrencesMapper-SimilarityReducer.jpg!
>  From above picture ,we can see that the output of pairwiseSimilarity is very large ,we should compress them.
>       SequenceFileOutputFormat.setOutputCompressionType(job, style);
>       SequenceFileOutputFormat.setCompressOutput(job, compress);
>       SequenceFileOutputFormat.setOutputCompressorClass(job, codecClass)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (MAHOUT-474) Should compress output of Job pairwiseSimilarity and Job asMatrix

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved MAHOUT-474.
------------------------------

         Assignee: Sean Owen
    Fix Version/s:     (was: 0.4)
       Resolution: Not A Problem

AbstractJob already helps make intermediate stages compress their output. I don't want to force the final output to be compressed. That is up to the caller to set if desired.

> Should compress output of Job pairwiseSimilarity and Job asMatrix
> -----------------------------------------------------------------
>
>                 Key: MAHOUT-474
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-474
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.4
>            Reporter: Han Hui Wen 
>            Assignee: Sean Owen
>
> !https://issues.apache.org/jira/secure/attachment/12451985/RowSimilarityJob-CooccurrencesMapper-SimilarityReducer.jpg!
>  From above picture ,we can see that the output of pairwiseSimilarity is very large ,we should compress them.
>       SequenceFileOutputFormat.setOutputCompressionType(job, style);
>       SequenceFileOutputFormat.setCompressOutput(job, compress);
>       SequenceFileOutputFormat.setOutputCompressorClass(job, codecClass)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-474) Should compress output of Job pairwiseSimilarity and Job asMatrix

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898346#action_12898346 ] 

Sean Owen commented on MAHOUT-474:
----------------------------------

I don't doubt that compression is a good idea. But it is up to the caller, not the code. The Hadoop default is to not compress and we follow that. This is how other jobs work.

But if you are finding a problem in passing arguments, you can identify that and provide a patch that fixes argument passing.

> Should compress output of Job pairwiseSimilarity and Job asMatrix
> -----------------------------------------------------------------
>
>                 Key: MAHOUT-474
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-474
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.4
>            Reporter: Han Hui Wen 
>            Assignee: Sean Owen
>
> !https://issues.apache.org/jira/secure/attachment/12451985/RowSimilarityJob-CooccurrencesMapper-SimilarityReducer.jpg!
>  From above picture ,we can see that the output of pairwiseSimilarity is very large ,we should compress them.
>       SequenceFileOutputFormat.setOutputCompressionType(job, style);
>       SequenceFileOutputFormat.setCompressOutput(job, compress);
>       SequenceFileOutputFormat.setOutputCompressorClass(job, codecClass)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-474) Should compress output of Job pairwiseSimilarity and Job asMatrix

Posted by "Han Hui Wen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Han Hui Wen  updated MAHOUT-474:
--------------------------------

    Affects Version/s: 0.4
          Description: 
!https://issues.apache.org/jira/secure/attachment/12451985/RowSimilarityJob-CooccurrencesMapper-SimilarityReducer.jpg!

 From above picture ,we can see that the output of pairwiseSimilarity is very large ,we should compress them.

      SequenceFileOutputFormat.setOutputCompressionType(job, style);
      SequenceFileOutputFormat.setCompressOutput(job, compress);
      SequenceFileOutputFormat.setOutputCompressorClass(job, codecClass)

  was:
!https://issues.apache.org/jira/secure/thumbnail/12451985/12451985_RowSimilarityJob-CooccurrencesMapper-SimilarityReducer.jpg!

 From above picture ,we can see that the output of pairwiseSimilarity is very large ,we should compress them.

      SequenceFileOutputFormat.setOutputCompressionType(job, style);
      SequenceFileOutputFormat.setCompressOutput(job, compress);
      SequenceFileOutputFormat.setOutputCompressorClass(job, codecClass)

        Fix Version/s: 0.4
          Component/s: Collaborative Filtering

> Should compress output of Job pairwiseSimilarity and Job asMatrix
> -----------------------------------------------------------------
>
>                 Key: MAHOUT-474
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-474
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.4
>            Reporter: Han Hui Wen 
>             Fix For: 0.4
>
>
> !https://issues.apache.org/jira/secure/attachment/12451985/RowSimilarityJob-CooccurrencesMapper-SimilarityReducer.jpg!
>  From above picture ,we can see that the output of pairwiseSimilarity is very large ,we should compress them.
>       SequenceFileOutputFormat.setOutputCompressionType(job, style);
>       SequenceFileOutputFormat.setCompressOutput(job, compress);
>       SequenceFileOutputFormat.setOutputCompressorClass(job, codecClass)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-474) Should compress output of Job pairwiseSimilarity and Job asMatrix

Posted by "Han Hui Wen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898215#action_12898215 ] 

Han Hui Wen  commented on MAHOUT-474:
-------------------------------------

I have done test for this,
Before using patch,the output of pairwiseSimilarity is 18,7G,
After using patch ,the output of pairwiseSimilarity is about 3.952G( = 247M *16),

{code}
+      SequenceFileOutputFormat.setCompressOutput(pairwiseSimilarity, true);
+      SequenceFileOutputFormat.setOutputCompressorClass(pairwiseSimilarity, GzipCodec.class);
+      SequenceFileOutputFormat.setOutputCompressionType(pairwiseSimilarity, CompressionType.BLOCK);
{code}

Also because RowSimilarityJob run a in separated process ,the properties will lost from the main Job.

The properties must be set by the RecommenderJob

> Should compress output of Job pairwiseSimilarity and Job asMatrix
> -----------------------------------------------------------------
>
>                 Key: MAHOUT-474
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-474
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.4
>            Reporter: Han Hui Wen 
>            Assignee: Sean Owen
>
> !https://issues.apache.org/jira/secure/attachment/12451985/RowSimilarityJob-CooccurrencesMapper-SimilarityReducer.jpg!
>  From above picture ,we can see that the output of pairwiseSimilarity is very large ,we should compress them.
>       SequenceFileOutputFormat.setOutputCompressionType(job, style);
>       SequenceFileOutputFormat.setCompressOutput(job, compress);
>       SequenceFileOutputFormat.setOutputCompressorClass(job, codecClass)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.