You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Arun C Murthy (JIRA)" <ji...@apache.org> on 2007/09/20 13:52:32 UTC
[jira] Created: (HADOOP-1926) Design/implement a set of compression
benchmarks for the map-reduce framework
Design/implement a set of compression benchmarks for the map-reduce framework
-----------------------------------------------------------------------------
Key: HADOOP-1926
URL: https://issues.apache.org/jira/browse/HADOOP-1926
Project: Hadoop
Issue Type: Improvement
Components: mapred
Reporter: Arun C Murthy
Assignee: Arun C Murthy
Fix For: 0.15.0
It would be nice to benchmark various compression codecs for use in the hadoop (existing codecs like zlib, lzo and in-future bzip2 etc.) and run these along with our nightlies or weeklies.
Here are some steps:
a) Fix HADOOP-1851 ( Map output compression codec cannot be set independently of job output compression codec)
b) Implement a random-text-writer along the lines of examples/randomwriter to generate large amounts of synthetic textual data for use in sort. One way to do this is to pick a word randomly from {{/usr/share/dict/words}} till we get enough bytes per map. To be safe, we could store an array of Strings of a snap-shot of the words in examples/RandomTextWriter.java.
c) Take a dump of wikipedia (http://download.wikimedia.org/enwiki/) and/or the ebooks from Project Gutenberg (http://www.gutenberg.org/MIRRORS.ALL) and use them as non-synthetic data to run sort/wordcount against.
For both b) and c) we should setup nightly/weekly benchmark runs with different codecs for reduce-outputs and map-outputs (shuffle) and track each.
Thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-1926) Design/implement a set of compression
benchmarks for the map-reduce framework
Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Owen O'Malley updated HADOOP-1926:
----------------------------------
Resolution: Fixed
Status: Resolved (was: Patch Available)
I just committed this. Thanks, Arun!
> Design/implement a set of compression benchmarks for the map-reduce framework
> -----------------------------------------------------------------------------
>
> Key: HADOOP-1926
> URL: https://issues.apache.org/jira/browse/HADOOP-1926
> Project: Hadoop
> Issue Type: Improvement
> Components: mapred
> Reporter: Arun C Murthy
> Assignee: Arun C Murthy
> Fix For: 0.15.0
>
> Attachments: HADOOP-1926_1_20071002.patch, HADOOP-1926_2_20071002.patch
>
>
> It would be nice to benchmark various compression codecs for use in the hadoop (existing codecs like zlib, lzo and in-future bzip2 etc.) and run these along with our nightlies or weeklies.
> Here are some steps:
> a) Fix HADOOP-1851 ( Map output compression codec cannot be set independently of job output compression codec)
> b) Implement a random-text-writer along the lines of examples/randomwriter to generate large amounts of synthetic textual data for use in sort. One way to do this is to pick a word randomly from {{/usr/share/dict/words}} till we get enough bytes per map. To be safe, we could store an array of Strings of a snap-shot of the words in examples/RandomTextWriter.java.
> c) Take a dump of wikipedia (http://download.wikimedia.org/enwiki/) and/or the ebooks from Project Gutenberg (http://www.gutenberg.org/MIRRORS.ALL) and use them as non-synthetic data to run sort/wordcount against.
> For both b) and c) we should setup nightly/weekly benchmark runs with different codecs for reduce-outputs and map-outputs (shuffle) and track each.
> Thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-1926) Design/implement a set of compression
benchmarks for the map-reduce framework
Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Arun C Murthy updated HADOOP-1926:
----------------------------------
Status: Patch Available (was: Open)
> Design/implement a set of compression benchmarks for the map-reduce framework
> -----------------------------------------------------------------------------
>
> Key: HADOOP-1926
> URL: https://issues.apache.org/jira/browse/HADOOP-1926
> Project: Hadoop
> Issue Type: Improvement
> Components: mapred
> Reporter: Arun C Murthy
> Assignee: Arun C Murthy
> Fix For: 0.15.0
>
> Attachments: HADOOP-1926_1_20071002.patch, HADOOP-1926_2_20071002.patch
>
>
> It would be nice to benchmark various compression codecs for use in the hadoop (existing codecs like zlib, lzo and in-future bzip2 etc.) and run these along with our nightlies or weeklies.
> Here are some steps:
> a) Fix HADOOP-1851 ( Map output compression codec cannot be set independently of job output compression codec)
> b) Implement a random-text-writer along the lines of examples/randomwriter to generate large amounts of synthetic textual data for use in sort. One way to do this is to pick a word randomly from {{/usr/share/dict/words}} till we get enough bytes per map. To be safe, we could store an array of Strings of a snap-shot of the words in examples/RandomTextWriter.java.
> c) Take a dump of wikipedia (http://download.wikimedia.org/enwiki/) and/or the ebooks from Project Gutenberg (http://www.gutenberg.org/MIRRORS.ALL) and use them as non-synthetic data to run sort/wordcount against.
> For both b) and c) we should setup nightly/weekly benchmark runs with different codecs for reduce-outputs and map-outputs (shuffle) and track each.
> Thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-1926) Design/implement a set of
compression benchmarks for the map-reduce framework
Posted by "Hudson (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12532072 ]
Hudson commented on HADOOP-1926:
--------------------------------
Integrated in Hadoop-Nightly #259 (See [http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/259/])
> Design/implement a set of compression benchmarks for the map-reduce framework
> -----------------------------------------------------------------------------
>
> Key: HADOOP-1926
> URL: https://issues.apache.org/jira/browse/HADOOP-1926
> Project: Hadoop
> Issue Type: Improvement
> Components: mapred
> Reporter: Arun C Murthy
> Assignee: Arun C Murthy
> Fix For: 0.15.0
>
> Attachments: HADOOP-1926_1_20071002.patch, HADOOP-1926_2_20071002.patch
>
>
> It would be nice to benchmark various compression codecs for use in the hadoop (existing codecs like zlib, lzo and in-future bzip2 etc.) and run these along with our nightlies or weeklies.
> Here are some steps:
> a) Fix HADOOP-1851 ( Map output compression codec cannot be set independently of job output compression codec)
> b) Implement a random-text-writer along the lines of examples/randomwriter to generate large amounts of synthetic textual data for use in sort. One way to do this is to pick a word randomly from {{/usr/share/dict/words}} till we get enough bytes per map. To be safe, we could store an array of Strings of a snap-shot of the words in examples/RandomTextWriter.java.
> c) Take a dump of wikipedia (http://download.wikimedia.org/enwiki/) and/or the ebooks from Project Gutenberg (http://www.gutenberg.org/MIRRORS.ALL) and use them as non-synthetic data to run sort/wordcount against.
> For both b) and c) we should setup nightly/weekly benchmark runs with different codecs for reduce-outputs and map-outputs (shuffle) and track each.
> Thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-1926) Design/implement a set of compression
benchmarks for the map-reduce framework
Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Arun C Murthy updated HADOOP-1926:
----------------------------------
Status: Open (was: Patch Available)
> Design/implement a set of compression benchmarks for the map-reduce framework
> -----------------------------------------------------------------------------
>
> Key: HADOOP-1926
> URL: https://issues.apache.org/jira/browse/HADOOP-1926
> Project: Hadoop
> Issue Type: Improvement
> Components: mapred
> Reporter: Arun C Murthy
> Assignee: Arun C Murthy
> Fix For: 0.15.0
>
> Attachments: HADOOP-1926_1_20071002.patch
>
>
> It would be nice to benchmark various compression codecs for use in the hadoop (existing codecs like zlib, lzo and in-future bzip2 etc.) and run these along with our nightlies or weeklies.
> Here are some steps:
> a) Fix HADOOP-1851 ( Map output compression codec cannot be set independently of job output compression codec)
> b) Implement a random-text-writer along the lines of examples/randomwriter to generate large amounts of synthetic textual data for use in sort. One way to do this is to pick a word randomly from {{/usr/share/dict/words}} till we get enough bytes per map. To be safe, we could store an array of Strings of a snap-shot of the words in examples/RandomTextWriter.java.
> c) Take a dump of wikipedia (http://download.wikimedia.org/enwiki/) and/or the ebooks from Project Gutenberg (http://www.gutenberg.org/MIRRORS.ALL) and use them as non-synthetic data to run sort/wordcount against.
> For both b) and c) we should setup nightly/weekly benchmark runs with different codecs for reduce-outputs and map-outputs (shuffle) and track each.
> Thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-1926) Design/implement a set of
compression benchmarks for the map-reduce framework
Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12529146 ]
Doug Cutting commented on HADOOP-1926:
--------------------------------------
FYI, Lucene uses the wikipedia text for benchmarking. It keeps a copy on people.apache.org. For details, see:
http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/benchmark/build.xml
http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/benchmark/README.enwiki
> Design/implement a set of compression benchmarks for the map-reduce framework
> -----------------------------------------------------------------------------
>
> Key: HADOOP-1926
> URL: https://issues.apache.org/jira/browse/HADOOP-1926
> Project: Hadoop
> Issue Type: Improvement
> Components: mapred
> Reporter: Arun C Murthy
> Assignee: Arun C Murthy
> Fix For: 0.15.0
>
>
> It would be nice to benchmark various compression codecs for use in the hadoop (existing codecs like zlib, lzo and in-future bzip2 etc.) and run these along with our nightlies or weeklies.
> Here are some steps:
> a) Fix HADOOP-1851 ( Map output compression codec cannot be set independently of job output compression codec)
> b) Implement a random-text-writer along the lines of examples/randomwriter to generate large amounts of synthetic textual data for use in sort. One way to do this is to pick a word randomly from {{/usr/share/dict/words}} till we get enough bytes per map. To be safe, we could store an array of Strings of a snap-shot of the words in examples/RandomTextWriter.java.
> c) Take a dump of wikipedia (http://download.wikimedia.org/enwiki/) and/or the ebooks from Project Gutenberg (http://www.gutenberg.org/MIRRORS.ALL) and use them as non-synthetic data to run sort/wordcount against.
> For both b) and c) we should setup nightly/weekly benchmark runs with different codecs for reduce-outputs and map-outputs (shuffle) and track each.
> Thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-1926) Design/implement a set of compression
benchmarks for the map-reduce framework
Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Arun C Murthy updated HADOOP-1926:
----------------------------------
Attachment: HADOOP-1926_1_20071002.patch
Here is an implementation of a *randomtextwriter* which can generate random textual data in any output-format (e.g. SequenceFileOutputFormat/TextOutputFormat etc.).
This patch also enhances examples/Sort and test/SortValidator to ensure they can be used with randomtextwriter.
> Design/implement a set of compression benchmarks for the map-reduce framework
> -----------------------------------------------------------------------------
>
> Key: HADOOP-1926
> URL: https://issues.apache.org/jira/browse/HADOOP-1926
> Project: Hadoop
> Issue Type: Improvement
> Components: mapred
> Reporter: Arun C Murthy
> Assignee: Arun C Murthy
> Fix For: 0.15.0
>
> Attachments: HADOOP-1926_1_20071002.patch
>
>
> It would be nice to benchmark various compression codecs for use in the hadoop (existing codecs like zlib, lzo and in-future bzip2 etc.) and run these along with our nightlies or weeklies.
> Here are some steps:
> a) Fix HADOOP-1851 ( Map output compression codec cannot be set independently of job output compression codec)
> b) Implement a random-text-writer along the lines of examples/randomwriter to generate large amounts of synthetic textual data for use in sort. One way to do this is to pick a word randomly from {{/usr/share/dict/words}} till we get enough bytes per map. To be safe, we could store an array of Strings of a snap-shot of the words in examples/RandomTextWriter.java.
> c) Take a dump of wikipedia (http://download.wikimedia.org/enwiki/) and/or the ebooks from Project Gutenberg (http://www.gutenberg.org/MIRRORS.ALL) and use them as non-synthetic data to run sort/wordcount against.
> For both b) and c) we should setup nightly/weekly benchmark runs with different codecs for reduce-outputs and map-outputs (shuffle) and track each.
> Thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-1926) Design/implement a set of
compression benchmarks for the map-reduce framework
Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12529619 ]
Arun C Murthy commented on HADOOP-1926:
---------------------------------------
Devaraj had a very good comment to add: we should also benchmark performance of sort with both {{RECORD}} compression and {{BLOCK}} compression.
> Design/implement a set of compression benchmarks for the map-reduce framework
> -----------------------------------------------------------------------------
>
> Key: HADOOP-1926
> URL: https://issues.apache.org/jira/browse/HADOOP-1926
> Project: Hadoop
> Issue Type: Improvement
> Components: mapred
> Reporter: Arun C Murthy
> Assignee: Arun C Murthy
> Fix For: 0.15.0
>
>
> It would be nice to benchmark various compression codecs for use in the hadoop (existing codecs like zlib, lzo and in-future bzip2 etc.) and run these along with our nightlies or weeklies.
> Here are some steps:
> a) Fix HADOOP-1851 ( Map output compression codec cannot be set independently of job output compression codec)
> b) Implement a random-text-writer along the lines of examples/randomwriter to generate large amounts of synthetic textual data for use in sort. One way to do this is to pick a word randomly from {{/usr/share/dict/words}} till we get enough bytes per map. To be safe, we could store an array of Strings of a snap-shot of the words in examples/RandomTextWriter.java.
> c) Take a dump of wikipedia (http://download.wikimedia.org/enwiki/) and/or the ebooks from Project Gutenberg (http://www.gutenberg.org/MIRRORS.ALL) and use them as non-synthetic data to run sort/wordcount against.
> For both b) and c) we should setup nightly/weekly benchmark runs with different codecs for reduce-outputs and map-outputs (shuffle) and track each.
> Thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-1926) Design/implement a set of
compression benchmarks for the map-reduce framework
Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12531864 ]
Hadoop QA commented on HADOOP-1926:
-----------------------------------
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12366925/HADOOP-1926_1_20071002.patch
against trunk revision r581101.
@author +1. The patch does not contain any @author tags.
javadoc -1. The javadoc tool appears to have generated messages.
javac +1. The applied patch does not generate any new compiler warnings.
findbugs +1. The patch does not introduce any new Findbugs warnings.
core tests +1. The patch passed core unit tests.
contrib tests +1. The patch passed contrib unit tests.
Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/866/testReport/
Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/866/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/866/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/866/console
This message is automatically generated.
> Design/implement a set of compression benchmarks for the map-reduce framework
> -----------------------------------------------------------------------------
>
> Key: HADOOP-1926
> URL: https://issues.apache.org/jira/browse/HADOOP-1926
> Project: Hadoop
> Issue Type: Improvement
> Components: mapred
> Reporter: Arun C Murthy
> Assignee: Arun C Murthy
> Fix For: 0.15.0
>
> Attachments: HADOOP-1926_1_20071002.patch
>
>
> It would be nice to benchmark various compression codecs for use in the hadoop (existing codecs like zlib, lzo and in-future bzip2 etc.) and run these along with our nightlies or weeklies.
> Here are some steps:
> a) Fix HADOOP-1851 ( Map output compression codec cannot be set independently of job output compression codec)
> b) Implement a random-text-writer along the lines of examples/randomwriter to generate large amounts of synthetic textual data for use in sort. One way to do this is to pick a word randomly from {{/usr/share/dict/words}} till we get enough bytes per map. To be safe, we could store an array of Strings of a snap-shot of the words in examples/RandomTextWriter.java.
> c) Take a dump of wikipedia (http://download.wikimedia.org/enwiki/) and/or the ebooks from Project Gutenberg (http://www.gutenberg.org/MIRRORS.ALL) and use them as non-synthetic data to run sort/wordcount against.
> For both b) and c) we should setup nightly/weekly benchmark runs with different codecs for reduce-outputs and map-outputs (shuffle) and track each.
> Thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-1926) Design/implement a set of compression
benchmarks for the map-reduce framework
Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Arun C Murthy updated HADOOP-1926:
----------------------------------
Attachment: HADOOP-1926_2_20071002.patch
Fixed the javadoc oversight.
> Design/implement a set of compression benchmarks for the map-reduce framework
> -----------------------------------------------------------------------------
>
> Key: HADOOP-1926
> URL: https://issues.apache.org/jira/browse/HADOOP-1926
> Project: Hadoop
> Issue Type: Improvement
> Components: mapred
> Reporter: Arun C Murthy
> Assignee: Arun C Murthy
> Fix For: 0.15.0
>
> Attachments: HADOOP-1926_1_20071002.patch, HADOOP-1926_2_20071002.patch
>
>
> It would be nice to benchmark various compression codecs for use in the hadoop (existing codecs like zlib, lzo and in-future bzip2 etc.) and run these along with our nightlies or weeklies.
> Here are some steps:
> a) Fix HADOOP-1851 ( Map output compression codec cannot be set independently of job output compression codec)
> b) Implement a random-text-writer along the lines of examples/randomwriter to generate large amounts of synthetic textual data for use in sort. One way to do this is to pick a word randomly from {{/usr/share/dict/words}} till we get enough bytes per map. To be safe, we could store an array of Strings of a snap-shot of the words in examples/RandomTextWriter.java.
> c) Take a dump of wikipedia (http://download.wikimedia.org/enwiki/) and/or the ebooks from Project Gutenberg (http://www.gutenberg.org/MIRRORS.ALL) and use them as non-synthetic data to run sort/wordcount against.
> For both b) and c) we should setup nightly/weekly benchmark runs with different codecs for reduce-outputs and map-outputs (shuffle) and track each.
> Thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-1926) Design/implement a set of
compression benchmarks for the map-reduce framework
Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12531905 ]
Hadoop QA commented on HADOOP-1926:
-----------------------------------
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12366943/HADOOP-1926_2_20071002.patch
against trunk revision r581345.
@author +1. The patch does not contain any @author tags.
javadoc +1. The javadoc tool did not generate any warning messages.
javac +1. The applied patch does not generate any new compiler warnings.
findbugs +1. The patch does not introduce any new Findbugs warnings.
core tests +1. The patch passed core unit tests.
contrib tests -1. The patch failed contrib unit tests.
Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/868/testReport/
Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/868/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/868/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/868/console
This message is automatically generated.
> Design/implement a set of compression benchmarks for the map-reduce framework
> -----------------------------------------------------------------------------
>
> Key: HADOOP-1926
> URL: https://issues.apache.org/jira/browse/HADOOP-1926
> Project: Hadoop
> Issue Type: Improvement
> Components: mapred
> Reporter: Arun C Murthy
> Assignee: Arun C Murthy
> Fix For: 0.15.0
>
> Attachments: HADOOP-1926_1_20071002.patch, HADOOP-1926_2_20071002.patch
>
>
> It would be nice to benchmark various compression codecs for use in the hadoop (existing codecs like zlib, lzo and in-future bzip2 etc.) and run these along with our nightlies or weeklies.
> Here are some steps:
> a) Fix HADOOP-1851 ( Map output compression codec cannot be set independently of job output compression codec)
> b) Implement a random-text-writer along the lines of examples/randomwriter to generate large amounts of synthetic textual data for use in sort. One way to do this is to pick a word randomly from {{/usr/share/dict/words}} till we get enough bytes per map. To be safe, we could store an array of Strings of a snap-shot of the words in examples/RandomTextWriter.java.
> c) Take a dump of wikipedia (http://download.wikimedia.org/enwiki/) and/or the ebooks from Project Gutenberg (http://www.gutenberg.org/MIRRORS.ALL) and use them as non-synthetic data to run sort/wordcount against.
> For both b) and c) we should setup nightly/weekly benchmark runs with different codecs for reduce-outputs and map-outputs (shuffle) and track each.
> Thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-1926) Design/implement a set of compression
benchmarks for the map-reduce framework
Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Arun C Murthy updated HADOOP-1926:
----------------------------------
Status: Patch Available (was: Open)
> Design/implement a set of compression benchmarks for the map-reduce framework
> -----------------------------------------------------------------------------
>
> Key: HADOOP-1926
> URL: https://issues.apache.org/jira/browse/HADOOP-1926
> Project: Hadoop
> Issue Type: Improvement
> Components: mapred
> Reporter: Arun C Murthy
> Assignee: Arun C Murthy
> Fix For: 0.15.0
>
> Attachments: HADOOP-1926_1_20071002.patch
>
>
> It would be nice to benchmark various compression codecs for use in the hadoop (existing codecs like zlib, lzo and in-future bzip2 etc.) and run these along with our nightlies or weeklies.
> Here are some steps:
> a) Fix HADOOP-1851 ( Map output compression codec cannot be set independently of job output compression codec)
> b) Implement a random-text-writer along the lines of examples/randomwriter to generate large amounts of synthetic textual data for use in sort. One way to do this is to pick a word randomly from {{/usr/share/dict/words}} till we get enough bytes per map. To be safe, we could store an array of Strings of a snap-shot of the words in examples/RandomTextWriter.java.
> c) Take a dump of wikipedia (http://download.wikimedia.org/enwiki/) and/or the ebooks from Project Gutenberg (http://www.gutenberg.org/MIRRORS.ALL) and use them as non-synthetic data to run sort/wordcount against.
> For both b) and c) we should setup nightly/weekly benchmark runs with different codecs for reduce-outputs and map-outputs (shuffle) and track each.
> Thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.