You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Owen O'Malley (JIRA)" <ji...@apache.org> on 2006/06/29 19:21:32 UTC
[jira] Created: (HADOOP-333) we should have some checks that the
sort benchmark generates correct outputs
we should have some checks that the sort benchmark generates correct outputs
----------------------------------------------------------------------------
Key: HADOOP-333
URL: http://issues.apache.org/jira/browse/HADOOP-333
Project: Hadoop
Type: Improvement
Reporter: Owen O'Malley
We should implement some checks of the input versus output of the sort benchmark to get some correctness guarantees:
1. the number of records
2. the number of bytes
3. the output records are in fact sorted
4. the xor of the md5 of each record's key/value pair
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Commented: (HADOOP-333) we should have some checks that the
sort benchmark generates correct outputs
Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472527 ]
Owen O'Malley commented on HADOOP-333:
--------------------------------------
Another reasonable check would be to join the input and output directories:
input dir: key, value -> (key,value), 1
output dir: key, value -> (key, value), 2
and ensure that you always have an equal number of 1's and 2's for each key,value pair.
> we should have some checks that the sort benchmark generates correct outputs
> ----------------------------------------------------------------------------
>
> Key: HADOOP-333
> URL: https://issues.apache.org/jira/browse/HADOOP-333
> Project: Hadoop
> Issue Type: Improvement
> Components: mapred
> Reporter: Owen O'Malley
> Assigned To: Owen O'Malley
>
> We should implement some checks of the input versus output of the sort benchmark to get some correctness guarantees:
> 1. the number of records
> 2. the number of bytes
> 3. the output records are in fact sorted
> 4. the xor of the md5 of each record's key/value pair
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-333) we should have some checks that the
sort benchmark generates correct outputs
Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473143 ]
Owen O'Malley commented on HADOOP-333:
--------------------------------------
I forgot to ask for one more check: by looking at the part-[0-9]* filenames, it would be good to have the tool use the default partition function to make sure each key was put into the right partition.
> we should have some checks that the sort benchmark generates correct outputs
> ----------------------------------------------------------------------------
>
> Key: HADOOP-333
> URL: https://issues.apache.org/jira/browse/HADOOP-333
> Project: Hadoop
> Issue Type: Improvement
> Components: mapred
> Reporter: Owen O'Malley
> Assigned To: Arun C Murthy
> Attachments: HADOOP-333_20070213_1.patch, HADOOP-333_20070214_2.patch
>
>
> We should implement some checks of the input versus output of the sort benchmark to get some correctness guarantees:
> 1. the number of records
> 2. the number of bytes
> 3. the output records are in fact sorted
> 4. the xor of the md5 of each record's key/value pair
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-333) we should have some checks that the
sort benchmark generates correct outputs
Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473659 ]
Hadoop QA commented on HADOOP-333:
----------------------------------
+1, because http://issues.apache.org/jira/secure/attachment/12351331/HADOOP-333_20070216_4.patch applied and successfully tested against trunk revision r508345.
> we should have some checks that the sort benchmark generates correct outputs
> ----------------------------------------------------------------------------
>
> Key: HADOOP-333
> URL: https://issues.apache.org/jira/browse/HADOOP-333
> Project: Hadoop
> Issue Type: Improvement
> Components: mapred
> Reporter: Owen O'Malley
> Assigned To: Arun C Murthy
> Attachments: HADOOP-333_20070213_1.patch, HADOOP-333_20070214_2.patch, HADOOP-333_20070215_3.patch, HADOOP-333_20070216_4.patch
>
>
> We should implement some checks of the input versus output of the sort benchmark to get some correctness guarantees:
> 1. the number of records
> 2. the number of bytes
> 3. the output records are in fact sorted
> 4. the xor of the md5 of each record's key/value pair
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-333) we should have some checks that the
sort benchmark generates correct outputs
Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Arun C Murthy updated HADOOP-333:
---------------------------------
Attachment: HADOOP-333_20070215_3.patch
Incorporated the check for partitioning and also ensures that RecordStatsChecker gets full (sort input/output) files so as to check that full-output-splits are sorted right.
> we should have some checks that the sort benchmark generates correct outputs
> ----------------------------------------------------------------------------
>
> Key: HADOOP-333
> URL: https://issues.apache.org/jira/browse/HADOOP-333
> Project: Hadoop
> Issue Type: Improvement
> Components: mapred
> Reporter: Owen O'Malley
> Assigned To: Arun C Murthy
> Attachments: HADOOP-333_20070213_1.patch, HADOOP-333_20070214_2.patch, HADOOP-333_20070215_3.patch
>
>
> We should implement some checks of the input versus output of the sort benchmark to get some correctness guarantees:
> 1. the number of records
> 2. the number of bytes
> 3. the output records are in fact sorted
> 4. the xor of the md5 of each record's key/value pair
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Assigned: (HADOOP-333) we should have some checks that the
sort benchmark generates correct outputs
Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/HADOOP-333?page=all ]
Owen O'Malley reassigned HADOOP-333:
------------------------------------
Assign To: Owen O'Malley
> we should have some checks that the sort benchmark generates correct outputs
> ----------------------------------------------------------------------------
>
> Key: HADOOP-333
> URL: http://issues.apache.org/jira/browse/HADOOP-333
> Project: Hadoop
> Type: Improvement
> Reporter: Owen O'Malley
> Assignee: Owen O'Malley
>
> We should implement some checks of the input versus output of the sort benchmark to get some correctness guarantees:
> 1. the number of records
> 2. the number of bytes
> 3. the output records are in fact sorted
> 4. the xor of the md5 of each record's key/value pair
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Commented: (HADOOP-333) we should have some checks that the
sort benchmark generates correct outputs
Posted by "Mike Smith (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472871 ]
Mike Smith commented on HADOOP-333:
-----------------------------------
Arun,
Do you have the input generator for the patch?
> we should have some checks that the sort benchmark generates correct outputs
> ----------------------------------------------------------------------------
>
> Key: HADOOP-333
> URL: https://issues.apache.org/jira/browse/HADOOP-333
> Project: Hadoop
> Issue Type: Improvement
> Components: mapred
> Reporter: Owen O'Malley
> Assigned To: Arun C Murthy
> Attachments: HADOOP-333_20070213_1.patch
>
>
> We should implement some checks of the input versus output of the sort benchmark to get some correctness guarantees:
> 1. the number of records
> 2. the number of bytes
> 3. the output records are in fact sorted
> 4. the xor of the md5 of each record's key/value pair
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-333) we should have some checks that the
sort benchmark generates correct outputs
Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472874 ]
Owen O'Malley commented on HADOOP-333:
--------------------------------------
Arun is probably asleep, since he is in Bangalore, but the input to this patch is the randomwriter and sort example programs.
Look at http://wiki.apache.org/lucene-hadoop/Sort
> we should have some checks that the sort benchmark generates correct outputs
> ----------------------------------------------------------------------------
>
> Key: HADOOP-333
> URL: https://issues.apache.org/jira/browse/HADOOP-333
> Project: Hadoop
> Issue Type: Improvement
> Components: mapred
> Reporter: Owen O'Malley
> Assigned To: Arun C Murthy
> Attachments: HADOOP-333_20070213_1.patch
>
>
> We should implement some checks of the input versus output of the sort benchmark to get some correctness guarantees:
> 1. the number of records
> 2. the number of bytes
> 3. the output records are in fact sorted
> 4. the xor of the md5 of each record's key/value pair
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-333) we should have some checks that the
sort benchmark generates correct outputs
Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Arun C Murthy updated HADOOP-333:
---------------------------------
Status: Patch Available (was: Open)
> we should have some checks that the sort benchmark generates correct outputs
> ----------------------------------------------------------------------------
>
> Key: HADOOP-333
> URL: https://issues.apache.org/jira/browse/HADOOP-333
> Project: Hadoop
> Issue Type: Improvement
> Components: mapred
> Reporter: Owen O'Malley
> Assigned To: Arun C Murthy
> Attachments: HADOOP-333_20070213_1.patch, HADOOP-333_20070214_2.patch, HADOOP-333_20070215_3.patch, HADOOP-333_20070216_4.patch
>
>
> We should implement some checks of the input versus output of the sort benchmark to get some correctness guarantees:
> 1. the number of records
> 2. the number of bytes
> 3. the output records are in fact sorted
> 4. the xor of the md5 of each record's key/value pair
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-333) we should have some checks that the
sort benchmark generates correct outputs
Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Arun C Murthy updated HADOOP-333:
---------------------------------
Attachment: HADOOP-333_20070213_1.patch
Here is a set of two utilities to validate the map-reduce framework's 'sort':
a) Checks the records in both the input and the output to the sort b/m and ensures they are consistent.
b) Checks to ensure that the records in each split are sorted correctly.
> we should have some checks that the sort benchmark generates correct outputs
> ----------------------------------------------------------------------------
>
> Key: HADOOP-333
> URL: https://issues.apache.org/jira/browse/HADOOP-333
> Project: Hadoop
> Issue Type: Improvement
> Components: mapred
> Reporter: Owen O'Malley
> Assigned To: Arun C Murthy
> Attachments: HADOOP-333_20070213_1.patch
>
>
> We should implement some checks of the input versus output of the sort benchmark to get some correctness guarantees:
> 1. the number of records
> 2. the number of bytes
> 3. the output records are in fact sorted
> 4. the xor of the md5 of each record's key/value pair
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-333) we should have some checks that the
sort benchmark generates correct outputs
Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Arun C Murthy updated HADOOP-333:
---------------------------------
Attachment: HADOOP-333_20070214_2.patch
Here is another patch incorporating Owen's f/b:
a) Validates the no. of bytes and records in sort's input & output.
b) Validates the xor of the md5's of each key/value pair
c) Ensures same key/value is present in both input and output.
Run this as:
hadoop jar hadoop-0.11.2-dev-examples.jar sortvalidator -sortInput /randomwriter/input -sortOutput /randomwriter/output
> we should have some checks that the sort benchmark generates correct outputs
> ----------------------------------------------------------------------------
>
> Key: HADOOP-333
> URL: https://issues.apache.org/jira/browse/HADOOP-333
> Project: Hadoop
> Issue Type: Improvement
> Components: mapred
> Reporter: Owen O'Malley
> Assigned To: Arun C Murthy
> Attachments: HADOOP-333_20070213_1.patch, HADOOP-333_20070214_2.patch
>
>
> We should implement some checks of the input versus output of the sort benchmark to get some correctness guarantees:
> 1. the number of records
> 2. the number of bytes
> 3. the output records are in fact sorted
> 4. the xor of the md5 of each record's key/value pair
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-333) we should have some checks that the
sort benchmark generates correct outputs
Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Arun C Murthy updated HADOOP-333:
---------------------------------
Attachment: HADOOP-333_20070216_4.patch
Moved SortValidator to src/test and it can now be run via the hadoop-test.jar:
$ hadoop jar hadoop-0.11.2-dev-test.jar testmapredsort -sortInput /randomwriter/input -sortOutput /randomwriter/output
> we should have some checks that the sort benchmark generates correct outputs
> ----------------------------------------------------------------------------
>
> Key: HADOOP-333
> URL: https://issues.apache.org/jira/browse/HADOOP-333
> Project: Hadoop
> Issue Type: Improvement
> Components: mapred
> Reporter: Owen O'Malley
> Assigned To: Arun C Murthy
> Attachments: HADOOP-333_20070213_1.patch, HADOOP-333_20070214_2.patch, HADOOP-333_20070215_3.patch, HADOOP-333_20070216_4.patch
>
>
> We should implement some checks of the input versus output of the sort benchmark to get some correctness guarantees:
> 1. the number of records
> 2. the number of bytes
> 3. the output records are in fact sorted
> 4. the xor of the md5 of each record's key/value pair
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-333) we should have some checks that the
sort benchmark generates correct outputs
Posted by "Sameer Paranjpye (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/HADOOP-333?page=all ]
Sameer Paranjpye updated HADOOP-333:
------------------------------------
Component/s: mapred
> we should have some checks that the sort benchmark generates correct outputs
> ----------------------------------------------------------------------------
>
> Key: HADOOP-333
> URL: http://issues.apache.org/jira/browse/HADOOP-333
> Project: Hadoop
> Issue Type: Improvement
> Components: mapred
> Reporter: Owen O'Malley
> Assigned To: Owen O'Malley
>
> We should implement some checks of the input versus output of the sort benchmark to get some correctness guarantees:
> 1. the number of records
> 2. the number of bytes
> 3. the output records are in fact sorted
> 4. the xor of the md5 of each record's key/value pair
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (HADOOP-333) we should have some checks that the
sort benchmark generates correct outputs
Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Doug Cutting updated HADOOP-333:
--------------------------------
Resolution: Fixed
Fix Version/s: 0.12.0
Status: Resolved (was: Patch Available)
I just committed this. Thanks, Arun!
> we should have some checks that the sort benchmark generates correct outputs
> ----------------------------------------------------------------------------
>
> Key: HADOOP-333
> URL: https://issues.apache.org/jira/browse/HADOOP-333
> Project: Hadoop
> Issue Type: Improvement
> Components: mapred
> Reporter: Owen O'Malley
> Assigned To: Arun C Murthy
> Fix For: 0.12.0
>
> Attachments: HADOOP-333_20070213_1.patch, HADOOP-333_20070214_2.patch, HADOOP-333_20070215_3.patch, HADOOP-333_20070216_4.patch
>
>
> We should implement some checks of the input versus output of the sort benchmark to get some correctness guarantees:
> 1. the number of records
> 2. the number of bytes
> 3. the output records are in fact sorted
> 4. the xor of the md5 of each record's key/value pair
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Assigned: (HADOOP-333) we should have some checks that the
sort benchmark generates correct outputs
Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Owen O'Malley reassigned HADOOP-333:
------------------------------------
Assignee: Arun C Murthy (was: Owen O'Malley)
> we should have some checks that the sort benchmark generates correct outputs
> ----------------------------------------------------------------------------
>
> Key: HADOOP-333
> URL: https://issues.apache.org/jira/browse/HADOOP-333
> Project: Hadoop
> Issue Type: Improvement
> Components: mapred
> Reporter: Owen O'Malley
> Assigned To: Arun C Murthy
>
> We should implement some checks of the input versus output of the sort benchmark to get some correctness guarantees:
> 1. the number of records
> 2. the number of bytes
> 3. the output records are in fact sorted
> 4. the xor of the md5 of each record's key/value pair
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-333) we should have some checks that the
sort benchmark generates correct outputs
Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472776 ]
Owen O'Malley commented on HADOOP-333:
--------------------------------------
I'd still like to see some checksums, bytes, and record information about the input and output directories compared. Otherwise a problem that changed all of the data to 0's would be counted as fine.
> we should have some checks that the sort benchmark generates correct outputs
> ----------------------------------------------------------------------------
>
> Key: HADOOP-333
> URL: https://issues.apache.org/jira/browse/HADOOP-333
> Project: Hadoop
> Issue Type: Improvement
> Components: mapred
> Reporter: Owen O'Malley
> Assigned To: Arun C Murthy
> Attachments: HADOOP-333_20070213_1.patch
>
>
> We should implement some checks of the input versus output of the sort benchmark to get some correctness guarantees:
> 1. the number of records
> 2. the number of bytes
> 3. the output records are in fact sorted
> 4. the xor of the md5 of each record's key/value pair
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.