You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Owen O'Malley (JIRA)" <ji...@apache.org> on 2006/06/29 19:21:32 UTC

[jira] Created: (HADOOP-333) we should have some checks that the sort benchmark generates correct outputs

we should have some checks that the sort benchmark generates correct outputs
----------------------------------------------------------------------------

         Key: HADOOP-333
         URL: http://issues.apache.org/jira/browse/HADOOP-333
     Project: Hadoop
        Type: Improvement

    Reporter: Owen O'Malley


We should implement some checks of the input versus output of the sort benchmark to get some correctness guarantees:

1. the number of records
2. the number of bytes
3. the output records are in fact sorted
4. the xor of the md5 of each record's key/value pair

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (HADOOP-333) we should have some checks that the sort benchmark generates correct outputs

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472527 ] 

Owen O'Malley commented on HADOOP-333:
--------------------------------------

Another reasonable check would be to join the input and output directories:

input dir: key, value -> (key,value), 1
output dir: key, value -> (key, value), 2

and ensure that you always have an equal number of 1's and 2's for each key,value pair.

> we should have some checks that the sort benchmark generates correct outputs
> ----------------------------------------------------------------------------
>
>                 Key: HADOOP-333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-333
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>         Assigned To: Owen O'Malley
>
> We should implement some checks of the input versus output of the sort benchmark to get some correctness guarantees:
> 1. the number of records
> 2. the number of bytes
> 3. the output records are in fact sorted
> 4. the xor of the md5 of each record's key/value pair

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-333) we should have some checks that the sort benchmark generates correct outputs

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473143 ] 

Owen O'Malley commented on HADOOP-333:
--------------------------------------

I forgot to ask for one more check: by looking at the part-[0-9]* filenames, it would be good to have the tool use the default partition function to make sure each key was put into the right partition.

> we should have some checks that the sort benchmark generates correct outputs
> ----------------------------------------------------------------------------
>
>                 Key: HADOOP-333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-333
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>         Assigned To: Arun C Murthy
>         Attachments: HADOOP-333_20070213_1.patch, HADOOP-333_20070214_2.patch
>
>
> We should implement some checks of the input versus output of the sort benchmark to get some correctness guarantees:
> 1. the number of records
> 2. the number of bytes
> 3. the output records are in fact sorted
> 4. the xor of the md5 of each record's key/value pair

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-333) we should have some checks that the sort benchmark generates correct outputs

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473659 ] 

Hadoop QA commented on HADOOP-333:
----------------------------------

+1, because http://issues.apache.org/jira/secure/attachment/12351331/HADOOP-333_20070216_4.patch applied and successfully tested against trunk revision r508345.

> we should have some checks that the sort benchmark generates correct outputs
> ----------------------------------------------------------------------------
>
>                 Key: HADOOP-333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-333
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>         Assigned To: Arun C Murthy
>         Attachments: HADOOP-333_20070213_1.patch, HADOOP-333_20070214_2.patch, HADOOP-333_20070215_3.patch, HADOOP-333_20070216_4.patch
>
>
> We should implement some checks of the input versus output of the sort benchmark to get some correctness guarantees:
> 1. the number of records
> 2. the number of bytes
> 3. the output records are in fact sorted
> 4. the xor of the md5 of each record's key/value pair

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-333) we should have some checks that the sort benchmark generates correct outputs

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-333:
---------------------------------

    Attachment: HADOOP-333_20070215_3.patch

Incorporated the check for partitioning and also ensures that RecordStatsChecker gets full (sort input/output) files so as to check that full-output-splits are sorted right.

> we should have some checks that the sort benchmark generates correct outputs
> ----------------------------------------------------------------------------
>
>                 Key: HADOOP-333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-333
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>         Assigned To: Arun C Murthy
>         Attachments: HADOOP-333_20070213_1.patch, HADOOP-333_20070214_2.patch, HADOOP-333_20070215_3.patch
>
>
> We should implement some checks of the input versus output of the sort benchmark to get some correctness guarantees:
> 1. the number of records
> 2. the number of bytes
> 3. the output records are in fact sorted
> 4. the xor of the md5 of each record's key/value pair

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (HADOOP-333) we should have some checks that the sort benchmark generates correct outputs

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/HADOOP-333?page=all ]

Owen O'Malley reassigned HADOOP-333:
------------------------------------

    Assign To: Owen O'Malley

> we should have some checks that the sort benchmark generates correct outputs
> ----------------------------------------------------------------------------
>
>          Key: HADOOP-333
>          URL: http://issues.apache.org/jira/browse/HADOOP-333
>      Project: Hadoop
>         Type: Improvement

>     Reporter: Owen O'Malley
>     Assignee: Owen O'Malley

>
> We should implement some checks of the input versus output of the sort benchmark to get some correctness guarantees:
> 1. the number of records
> 2. the number of bytes
> 3. the output records are in fact sorted
> 4. the xor of the md5 of each record's key/value pair

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (HADOOP-333) we should have some checks that the sort benchmark generates correct outputs

Posted by "Mike Smith (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472871 ] 

Mike Smith commented on HADOOP-333:
-----------------------------------

Arun, 

Do you have the input generator for the patch? 


> we should have some checks that the sort benchmark generates correct outputs
> ----------------------------------------------------------------------------
>
>                 Key: HADOOP-333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-333
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>         Assigned To: Arun C Murthy
>         Attachments: HADOOP-333_20070213_1.patch
>
>
> We should implement some checks of the input versus output of the sort benchmark to get some correctness guarantees:
> 1. the number of records
> 2. the number of bytes
> 3. the output records are in fact sorted
> 4. the xor of the md5 of each record's key/value pair

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-333) we should have some checks that the sort benchmark generates correct outputs

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472874 ] 

Owen O'Malley commented on HADOOP-333:
--------------------------------------

Arun is probably asleep, since he is in Bangalore, but the input to this patch is the randomwriter and sort example programs.

Look at http://wiki.apache.org/lucene-hadoop/Sort

> we should have some checks that the sort benchmark generates correct outputs
> ----------------------------------------------------------------------------
>
>                 Key: HADOOP-333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-333
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>         Assigned To: Arun C Murthy
>         Attachments: HADOOP-333_20070213_1.patch
>
>
> We should implement some checks of the input versus output of the sort benchmark to get some correctness guarantees:
> 1. the number of records
> 2. the number of bytes
> 3. the output records are in fact sorted
> 4. the xor of the md5 of each record's key/value pair

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-333) we should have some checks that the sort benchmark generates correct outputs

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-333:
---------------------------------

    Status: Patch Available  (was: Open)

> we should have some checks that the sort benchmark generates correct outputs
> ----------------------------------------------------------------------------
>
>                 Key: HADOOP-333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-333
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>         Assigned To: Arun C Murthy
>         Attachments: HADOOP-333_20070213_1.patch, HADOOP-333_20070214_2.patch, HADOOP-333_20070215_3.patch, HADOOP-333_20070216_4.patch
>
>
> We should implement some checks of the input versus output of the sort benchmark to get some correctness guarantees:
> 1. the number of records
> 2. the number of bytes
> 3. the output records are in fact sorted
> 4. the xor of the md5 of each record's key/value pair

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-333) we should have some checks that the sort benchmark generates correct outputs

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-333:
---------------------------------

    Attachment: HADOOP-333_20070213_1.patch

Here is a set of two utilities to validate the map-reduce framework's 'sort':
a) Checks the records in both the input and the output to the sort b/m and ensures they are consistent.
b) Checks to ensure that the records in each split are sorted correctly.


> we should have some checks that the sort benchmark generates correct outputs
> ----------------------------------------------------------------------------
>
>                 Key: HADOOP-333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-333
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>         Assigned To: Arun C Murthy
>         Attachments: HADOOP-333_20070213_1.patch
>
>
> We should implement some checks of the input versus output of the sort benchmark to get some correctness guarantees:
> 1. the number of records
> 2. the number of bytes
> 3. the output records are in fact sorted
> 4. the xor of the md5 of each record's key/value pair

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-333) we should have some checks that the sort benchmark generates correct outputs

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-333:
---------------------------------

    Attachment: HADOOP-333_20070214_2.patch

Here is another patch incorporating Owen's f/b:
a) Validates the no. of bytes and records in sort's input & output.
b) Validates the xor of the md5's of each key/value pair
c) Ensures same key/value is present in both input and output.

Run this as:
hadoop jar hadoop-0.11.2-dev-examples.jar sortvalidator -sortInput /randomwriter/input -sortOutput /randomwriter/output 


> we should have some checks that the sort benchmark generates correct outputs
> ----------------------------------------------------------------------------
>
>                 Key: HADOOP-333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-333
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>         Assigned To: Arun C Murthy
>         Attachments: HADOOP-333_20070213_1.patch, HADOOP-333_20070214_2.patch
>
>
> We should implement some checks of the input versus output of the sort benchmark to get some correctness guarantees:
> 1. the number of records
> 2. the number of bytes
> 3. the output records are in fact sorted
> 4. the xor of the md5 of each record's key/value pair

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-333) we should have some checks that the sort benchmark generates correct outputs

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-333:
---------------------------------

    Attachment: HADOOP-333_20070216_4.patch

Moved SortValidator to src/test and it can now be run via the hadoop-test.jar:

$ hadoop jar hadoop-0.11.2-dev-test.jar testmapredsort -sortInput /randomwriter/input -sortOutput /randomwriter/output


> we should have some checks that the sort benchmark generates correct outputs
> ----------------------------------------------------------------------------
>
>                 Key: HADOOP-333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-333
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>         Assigned To: Arun C Murthy
>         Attachments: HADOOP-333_20070213_1.patch, HADOOP-333_20070214_2.patch, HADOOP-333_20070215_3.patch, HADOOP-333_20070216_4.patch
>
>
> We should implement some checks of the input versus output of the sort benchmark to get some correctness guarantees:
> 1. the number of records
> 2. the number of bytes
> 3. the output records are in fact sorted
> 4. the xor of the md5 of each record's key/value pair

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-333) we should have some checks that the sort benchmark generates correct outputs

Posted by "Sameer Paranjpye (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/HADOOP-333?page=all ]

Sameer Paranjpye updated HADOOP-333:
------------------------------------

    Component/s: mapred

> we should have some checks that the sort benchmark generates correct outputs
> ----------------------------------------------------------------------------
>
>                 Key: HADOOP-333
>                 URL: http://issues.apache.org/jira/browse/HADOOP-333
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>         Assigned To: Owen O'Malley
>
> We should implement some checks of the input versus output of the sort benchmark to get some correctness guarantees:
> 1. the number of records
> 2. the number of bytes
> 3. the output records are in fact sorted
> 4. the xor of the md5 of each record's key/value pair

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (HADOOP-333) we should have some checks that the sort benchmark generates correct outputs

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated HADOOP-333:
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.12.0
           Status: Resolved  (was: Patch Available)

I just committed this.  Thanks, Arun!

> we should have some checks that the sort benchmark generates correct outputs
> ----------------------------------------------------------------------------
>
>                 Key: HADOOP-333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-333
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>         Assigned To: Arun C Murthy
>             Fix For: 0.12.0
>
>         Attachments: HADOOP-333_20070213_1.patch, HADOOP-333_20070214_2.patch, HADOOP-333_20070215_3.patch, HADOOP-333_20070216_4.patch
>
>
> We should implement some checks of the input versus output of the sort benchmark to get some correctness guarantees:
> 1. the number of records
> 2. the number of bytes
> 3. the output records are in fact sorted
> 4. the xor of the md5 of each record's key/value pair

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (HADOOP-333) we should have some checks that the sort benchmark generates correct outputs

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Owen O'Malley reassigned HADOOP-333:
------------------------------------

    Assignee: Arun C Murthy  (was: Owen O'Malley)

> we should have some checks that the sort benchmark generates correct outputs
> ----------------------------------------------------------------------------
>
>                 Key: HADOOP-333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-333
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>         Assigned To: Arun C Murthy
>
> We should implement some checks of the input versus output of the sort benchmark to get some correctness guarantees:
> 1. the number of records
> 2. the number of bytes
> 3. the output records are in fact sorted
> 4. the xor of the md5 of each record's key/value pair

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-333) we should have some checks that the sort benchmark generates correct outputs

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472776 ] 

Owen O'Malley commented on HADOOP-333:
--------------------------------------

I'd still like to see some checksums, bytes, and record information about the input and output directories compared. Otherwise a problem that changed all of the data to 0's would be counted as fine.


> we should have some checks that the sort benchmark generates correct outputs
> ----------------------------------------------------------------------------
>
>                 Key: HADOOP-333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-333
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>         Assigned To: Arun C Murthy
>         Attachments: HADOOP-333_20070213_1.patch
>
>
> We should implement some checks of the input versus output of the sort benchmark to get some correctness guarantees:
> 1. the number of records
> 2. the number of bytes
> 3. the output records are in fact sorted
> 4. the xor of the md5 of each record's key/value pair

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.