You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Han Hui Wen (JIRA)" <ji...@apache.org> on 2010/08/13 07:27:17 UTC
[jira] Created: (MAHOUT-474) Should compress output of Job
pairwiseSimilarity and Job asMatrix
Should compress output of Job pairwiseSimilarity and Job asMatrix
-----------------------------------------------------------------
Key: MAHOUT-474
URL: https://issues.apache.org/jira/browse/MAHOUT-474
Project: Mahout
Issue Type: Improvement
Reporter: Han Hui Wen
!https://issues.apache.org/jira/secure/thumbnail/12451985/12451985_RowSimilarityJob-CooccurrencesMapper-SimilarityReducer.jpg!
From above picture ,we can see that the output of pairwiseSimilarity is very large ,we should compress them.
SequenceFileOutputFormat.setOutputCompressionType(job, style);
SequenceFileOutputFormat.setCompressOutput(job, compress);
SequenceFileOutputFormat.setOutputCompressorClass(job, codecClass)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-474) Should compress output of Job
pairwiseSimilarity and Job asMatrix
Posted by "Han Hui Wen (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Han Hui Wen updated MAHOUT-474:
--------------------------------
Attachment: (was: after_patch_20100813.jpg)
> Should compress output of Job pairwiseSimilarity and Job asMatrix
> -----------------------------------------------------------------
>
> Key: MAHOUT-474
> URL: https://issues.apache.org/jira/browse/MAHOUT-474
> Project: Mahout
> Issue Type: Improvement
> Components: Collaborative Filtering
> Affects Versions: 0.4
> Reporter: Han Hui Wen
> Assignee: Sean Owen
>
> !https://issues.apache.org/jira/secure/attachment/12451985/RowSimilarityJob-CooccurrencesMapper-SimilarityReducer.jpg!
> From above picture ,we can see that the output of pairwiseSimilarity is very large ,we should compress them.
> SequenceFileOutputFormat.setOutputCompressionType(job, style);
> SequenceFileOutputFormat.setCompressOutput(job, compress);
> SequenceFileOutputFormat.setOutputCompressorClass(job, codecClass)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-474) Should compress output of Job
pairwiseSimilarity and Job asMatrix
Posted by "Han Hui Wen (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Han Hui Wen updated MAHOUT-474:
--------------------------------
Attachment: after_patch_20100813.jpg
> Should compress output of Job pairwiseSimilarity and Job asMatrix
> -----------------------------------------------------------------
>
> Key: MAHOUT-474
> URL: https://issues.apache.org/jira/browse/MAHOUT-474
> Project: Mahout
> Issue Type: Improvement
> Components: Collaborative Filtering
> Affects Versions: 0.4
> Reporter: Han Hui Wen
> Assignee: Sean Owen
> Attachments: after_patch_20100813.jpg
>
>
> !https://issues.apache.org/jira/secure/attachment/12451985/RowSimilarityJob-CooccurrencesMapper-SimilarityReducer.jpg!
> From above picture ,we can see that the output of pairwiseSimilarity is very large ,we should compress them.
> SequenceFileOutputFormat.setOutputCompressionType(job, style);
> SequenceFileOutputFormat.setCompressOutput(job, compress);
> SequenceFileOutputFormat.setOutputCompressorClass(job, codecClass)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (MAHOUT-474) Should compress output of Job
pairwiseSimilarity and Job asMatrix
Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved MAHOUT-474.
------------------------------
Assignee: Sean Owen
Fix Version/s: (was: 0.4)
Resolution: Not A Problem
AbstractJob already helps make intermediate stages compress their output. I don't want to force the final output to be compressed. That is up to the caller to set if desired.
> Should compress output of Job pairwiseSimilarity and Job asMatrix
> -----------------------------------------------------------------
>
> Key: MAHOUT-474
> URL: https://issues.apache.org/jira/browse/MAHOUT-474
> Project: Mahout
> Issue Type: Improvement
> Components: Collaborative Filtering
> Affects Versions: 0.4
> Reporter: Han Hui Wen
> Assignee: Sean Owen
>
> !https://issues.apache.org/jira/secure/attachment/12451985/RowSimilarityJob-CooccurrencesMapper-SimilarityReducer.jpg!
> From above picture ,we can see that the output of pairwiseSimilarity is very large ,we should compress them.
> SequenceFileOutputFormat.setOutputCompressionType(job, style);
> SequenceFileOutputFormat.setCompressOutput(job, compress);
> SequenceFileOutputFormat.setOutputCompressorClass(job, codecClass)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-474) Should compress output of Job
pairwiseSimilarity and Job asMatrix
Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898346#action_12898346 ]
Sean Owen commented on MAHOUT-474:
----------------------------------
I don't doubt that compression is a good idea. But it is up to the caller, not the code. The Hadoop default is to not compress and we follow that. This is how other jobs work.
But if you are finding a problem in passing arguments, you can identify that and provide a patch that fixes argument passing.
> Should compress output of Job pairwiseSimilarity and Job asMatrix
> -----------------------------------------------------------------
>
> Key: MAHOUT-474
> URL: https://issues.apache.org/jira/browse/MAHOUT-474
> Project: Mahout
> Issue Type: Improvement
> Components: Collaborative Filtering
> Affects Versions: 0.4
> Reporter: Han Hui Wen
> Assignee: Sean Owen
>
> !https://issues.apache.org/jira/secure/attachment/12451985/RowSimilarityJob-CooccurrencesMapper-SimilarityReducer.jpg!
> From above picture ,we can see that the output of pairwiseSimilarity is very large ,we should compress them.
> SequenceFileOutputFormat.setOutputCompressionType(job, style);
> SequenceFileOutputFormat.setCompressOutput(job, compress);
> SequenceFileOutputFormat.setOutputCompressorClass(job, codecClass)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-474) Should compress output of Job
pairwiseSimilarity and Job asMatrix
Posted by "Han Hui Wen (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Han Hui Wen updated MAHOUT-474:
--------------------------------
Affects Version/s: 0.4
Description:
!https://issues.apache.org/jira/secure/attachment/12451985/RowSimilarityJob-CooccurrencesMapper-SimilarityReducer.jpg!
From above picture ,we can see that the output of pairwiseSimilarity is very large ,we should compress them.
SequenceFileOutputFormat.setOutputCompressionType(job, style);
SequenceFileOutputFormat.setCompressOutput(job, compress);
SequenceFileOutputFormat.setOutputCompressorClass(job, codecClass)
was:
!https://issues.apache.org/jira/secure/thumbnail/12451985/12451985_RowSimilarityJob-CooccurrencesMapper-SimilarityReducer.jpg!
From above picture ,we can see that the output of pairwiseSimilarity is very large ,we should compress them.
SequenceFileOutputFormat.setOutputCompressionType(job, style);
SequenceFileOutputFormat.setCompressOutput(job, compress);
SequenceFileOutputFormat.setOutputCompressorClass(job, codecClass)
Fix Version/s: 0.4
Component/s: Collaborative Filtering
> Should compress output of Job pairwiseSimilarity and Job asMatrix
> -----------------------------------------------------------------
>
> Key: MAHOUT-474
> URL: https://issues.apache.org/jira/browse/MAHOUT-474
> Project: Mahout
> Issue Type: Improvement
> Components: Collaborative Filtering
> Affects Versions: 0.4
> Reporter: Han Hui Wen
> Fix For: 0.4
>
>
> !https://issues.apache.org/jira/secure/attachment/12451985/RowSimilarityJob-CooccurrencesMapper-SimilarityReducer.jpg!
> From above picture ,we can see that the output of pairwiseSimilarity is very large ,we should compress them.
> SequenceFileOutputFormat.setOutputCompressionType(job, style);
> SequenceFileOutputFormat.setCompressOutput(job, compress);
> SequenceFileOutputFormat.setOutputCompressorClass(job, codecClass)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-474) Should compress output of Job
pairwiseSimilarity and Job asMatrix
Posted by "Han Hui Wen (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898215#action_12898215 ]
Han Hui Wen commented on MAHOUT-474:
-------------------------------------
I have done test for this,
Before using patch,the output of pairwiseSimilarity is 18,7G,
After using patch ,the output of pairwiseSimilarity is about 3.952G( = 247M *16),
{code}
+ SequenceFileOutputFormat.setCompressOutput(pairwiseSimilarity, true);
+ SequenceFileOutputFormat.setOutputCompressorClass(pairwiseSimilarity, GzipCodec.class);
+ SequenceFileOutputFormat.setOutputCompressionType(pairwiseSimilarity, CompressionType.BLOCK);
{code}
Also because RowSimilarityJob run a in separated process ,the properties will lost from the main Job.
The properties must be set by the RecommenderJob
> Should compress output of Job pairwiseSimilarity and Job asMatrix
> -----------------------------------------------------------------
>
> Key: MAHOUT-474
> URL: https://issues.apache.org/jira/browse/MAHOUT-474
> Project: Mahout
> Issue Type: Improvement
> Components: Collaborative Filtering
> Affects Versions: 0.4
> Reporter: Han Hui Wen
> Assignee: Sean Owen
>
> !https://issues.apache.org/jira/secure/attachment/12451985/RowSimilarityJob-CooccurrencesMapper-SimilarityReducer.jpg!
> From above picture ,we can see that the output of pairwiseSimilarity is very large ,we should compress them.
> SequenceFileOutputFormat.setOutputCompressionType(job, style);
> SequenceFileOutputFormat.setCompressOutput(job, compress);
> SequenceFileOutputFormat.setOutputCompressorClass(job, codecClass)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.