You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "ZhuGuanyin (JIRA)" <ji...@apache.org> on 2009/05/06 13:11:30 UTC

[jira] Created: (HADOOP-5779) KeyFieldBasedPartitioner should encode free and handle ArrayOutOfIndex exception!

KeyFieldBasedPartitioner should encode free and handle ArrayOutOfIndex exception!
---------------------------------------------------------------------------------

                 Key: HADOOP-5779
                 URL: https://issues.apache.org/jira/browse/HADOOP-5779
             Project: Hadoop Core
          Issue Type: Bug
          Components: mapred
            Reporter: ZhuGuanyin
             Fix For: 0.20.1


1) Currently,  KeyFieldBasedPartitioner only support utf8 encoded recored,  we should use text or byteswriteable data types.
2) when using KeyFieldBasedPartitioner, if the record doesn't contain the specified field, the endChar would equal with array.length, which throw ArrayOutOfIndex exception, losting that record!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5779) KeyFieldBasedPartitioner would lost data if specifed field not exist, and it should encode free not only support utf8

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amar Kamat updated HADOOP-5779:
-------------------------------

    Attachment: HADOOP-5779-partial.patch

Hey. I tried testing your patch against my testcase and it failed. The code changes to do with the exception seems insufficient. Attaching the code changes that fixes the issue. Can you plz change the patch accordingly? 

Note that I havent coded for utf-8 issue. 

> KeyFieldBasedPartitioner would lost data if specifed field not exist, and it should encode free not only support utf8
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5779
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5779
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: ZhuGuanyin
>             Fix For: 0.21.0
>
>         Attachments: encode-free-KeyFieldBasedPartitioner-v1.patch, encode-free-KeyFieldBasedPartitioner.patch, HADOOP-5779-partial.patch
>
>
> 1) Currently,  KeyFieldBasedPartitioner only support utf8 encoded recored,  we should use text or byteswriteable data types.
> 2) when using KeyFieldBasedPartitioner, if the record doesn't contain the specified field, the endChar would equal with array.length, which throw ArrayOutOfIndex exception, losting that record!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5779) KeyFieldBasedPartitioner would lost data if specifed field not exist, and it should encode free not only support utf8

Posted by "Jothi Padmanabhan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12710191#action_12710191 ] 

Jothi Padmanabhan commented on HADOOP-5779:
-------------------------------------------

Some minor comments:

* I think it would be better to have the instance check for the keys in a single if block
{code}
if (key instanceof BytesWritable) {
// Handle BytesWritable
}
else if (key instanceof Text) {
// Handle Text
}
else {
// error
}
{code}

* A test case to test for Text and BytesWritable keys would be good to have for this patch. It could either be a new test case or could modify TestStreamDataProtocol. Also, if the test case can demonstrate the fix for ArrayOutOfBoundsException -- it should fail without this patch and run with this patch, it would be really nice.

> KeyFieldBasedPartitioner would lost data if specifed field not exist, and it should encode free not only support utf8
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5779
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5779
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: ZhuGuanyin
>             Fix For: 0.21.0
>
>         Attachments: encode-free-KeyFieldBasedPartitioner-v1.patch, encode-free-KeyFieldBasedPartitioner.patch
>
>
> 1) Currently,  KeyFieldBasedPartitioner only support utf8 encoded recored,  we should use text or byteswriteable data types.
> 2) when using KeyFieldBasedPartitioner, if the record doesn't contain the specified field, the endChar would equal with array.length, which throw ArrayOutOfIndex exception, losting that record!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5779) KeyFieldBasedPartitioner would lost data if specifed field not exist, and it should encode free not only support utf8

Posted by "ZhuGuanyin (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ZhuGuanyin updated HADOOP-5779:
-------------------------------

    Attachment: encode-free-KeyFieldBasedPartitioner-v1.patch

create patsh using svn diff instead of diff

> KeyFieldBasedPartitioner would lost data if specifed field not exist, and it should encode free not only support utf8
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5779
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5779
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: ZhuGuanyin
>             Fix For: 0.21.0
>
>         Attachments: encode-free-KeyFieldBasedPartitioner-v1.patch, encode-free-KeyFieldBasedPartitioner.patch
>
>
> 1) Currently,  KeyFieldBasedPartitioner only support utf8 encoded recored,  we should use text or byteswriteable data types.
> 2) when using KeyFieldBasedPartitioner, if the record doesn't contain the specified field, the endChar would equal with array.length, which throw ArrayOutOfIndex exception, losting that record!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5779) KeyFieldBasedPartitioner should encode free and handle ArrayOutOfIndex exception!

Posted by "ZhuGuanyin (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ZhuGuanyin updated HADOOP-5779:
-------------------------------

    Attachment: encode-free-KeyFieldBasedPartitioner.patch

here comes the patch :)

> KeyFieldBasedPartitioner should encode free and handle ArrayOutOfIndex exception!
> ---------------------------------------------------------------------------------
>
>                 Key: HADOOP-5779
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5779
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: ZhuGuanyin
>             Fix For: 0.20.1
>
>         Attachments: encode-free-KeyFieldBasedPartitioner.patch
>
>
> 1) Currently,  KeyFieldBasedPartitioner only support utf8 encoded recored,  we should use text or byteswriteable data types.
> 2) when using KeyFieldBasedPartitioner, if the record doesn't contain the specified field, the endChar would equal with array.length, which throw ArrayOutOfIndex exception, losting that record!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5779) KeyFieldBasedPartitioner would lost data if specifed field not exist, and it should encode free not only support utf8

Posted by "ZhuGuanyin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719503#action_12719503 ] 

ZhuGuanyin commented on HADOOP-5779:
------------------------------------

Thanks very much, I'm busy the last month and not followed this issue, I'll attach an example dataset to let key.toString().getBytes("UTF-8") throws exception soon, 

Thanks again!

> KeyFieldBasedPartitioner would lost data if specifed field not exist, and it should encode free not only support utf8
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5779
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5779
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: ZhuGuanyin
>             Fix For: 0.21.0
>
>         Attachments: encode-free-KeyFieldBasedPartitioner-v1.patch, encode-free-KeyFieldBasedPartitioner.patch, HADOOP-5779-partial.patch, HADOOP-5779-v1.0.patch.patch
>
>
> 1) Currently,  KeyFieldBasedPartitioner only support utf8 encoded recored,  we should use text or byteswriteable data types.
> 2) when using KeyFieldBasedPartitioner, if the record doesn't contain the specified field, the endChar would equal with array.length, which throw ArrayOutOfIndex exception, losting that record!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5779) KeyFieldBasedPartitioner would lost data if specifed field not exist, and it should encode free not only support utf8

Posted by "Jothi Padmanabhan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718791#action_12718791 ] 

Jothi Padmanabhan commented on HADOOP-5779:
-------------------------------------------

Some minor comments:

# I think endChar cannot be negative, so the check endChar < 0 can be removed. Could you check?
# Instead of doing i <= end && i < b.length in the hashCode(), I think we should ideally fix the getEndOffset to return min (end, b.length -1).  But I would not -1 for that, I am OK with the existing simple change in the patch as well
# In the test case, adding an assert to verify the returned partition is 0 would be good.

> KeyFieldBasedPartitioner would lost data if specifed field not exist, and it should encode free not only support utf8
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5779
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5779
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: ZhuGuanyin
>             Fix For: 0.21.0
>
>         Attachments: encode-free-KeyFieldBasedPartitioner-v1.patch, encode-free-KeyFieldBasedPartitioner.patch, HADOOP-5779-partial.patch, HADOOP-5779-v1.0.patch.patch
>
>
> 1) Currently,  KeyFieldBasedPartitioner only support utf8 encoded recored,  we should use text or byteswriteable data types.
> 2) when using KeyFieldBasedPartitioner, if the record doesn't contain the specified field, the endChar would equal with array.length, which throw ArrayOutOfIndex exception, losting that record!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-5779) KeyFieldBasedPartitioner would lost data if specifed field not exist, and it should encode free not only support utf8

Posted by "ZhuGuanyin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708734#action_12708734 ] 

ZhuGuanyin edited comment on HADOOP-5779 at 5/17/09 6:51 PM:
-------------------------------------------------------------

create patch using svn diff instead of diff

      was (Author: buptzhugy):
    create patsh using svn diff instead of diff
  
> KeyFieldBasedPartitioner would lost data if specifed field not exist, and it should encode free not only support utf8
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5779
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5779
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: ZhuGuanyin
>             Fix For: 0.21.0
>
>         Attachments: encode-free-KeyFieldBasedPartitioner-v1.patch, encode-free-KeyFieldBasedPartitioner.patch
>
>
> 1) Currently,  KeyFieldBasedPartitioner only support utf8 encoded recored,  we should use text or byteswriteable data types.
> 2) when using KeyFieldBasedPartitioner, if the record doesn't contain the specified field, the endChar would equal with array.length, which throw ArrayOutOfIndex exception, losting that record!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5779) KeyFieldBasedPartitioner should encode free and handle ArrayOutOfIndex exception!

Posted by "ZhuGuanyin (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ZhuGuanyin updated HADOOP-5779:
-------------------------------

    Release Note: Using BytesWritable or Text data types for KeyFieldBasedPartitioner, which would support not only utf8, and avoid the ArrayOutOfIndex exception.
          Status: Patch Available  (was: Open)

Using BytesWritable or Text data types for KeyFieldBasedPartitioner, which would support not only utf8, and avoid the ArrayOutOfIndex exception.

> KeyFieldBasedPartitioner should encode free and handle ArrayOutOfIndex exception!
> ---------------------------------------------------------------------------------
>
>                 Key: HADOOP-5779
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5779
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: ZhuGuanyin
>             Fix For: 0.20.1
>
>
> 1) Currently,  KeyFieldBasedPartitioner only support utf8 encoded recored,  we should use text or byteswriteable data types.
> 2) when using KeyFieldBasedPartitioner, if the record doesn't contain the specified field, the endChar would equal with array.length, which throw ArrayOutOfIndex exception, losting that record!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5779) KeyFieldBasedPartitioner would lost data if specifed field not exist, and it should encode free not only support utf8

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amar Kamat updated HADOOP-5779:
-------------------------------

    Attachment: HADOOP-5779-v1.0.patch.patch

Attaching a merged patch. Testing in progress.

> KeyFieldBasedPartitioner would lost data if specifed field not exist, and it should encode free not only support utf8
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5779
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5779
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: ZhuGuanyin
>             Fix For: 0.21.0
>
>         Attachments: encode-free-KeyFieldBasedPartitioner-v1.patch, encode-free-KeyFieldBasedPartitioner.patch, HADOOP-5779-partial.patch, HADOOP-5779-v1.0.patch.patch
>
>
> 1) Currently,  KeyFieldBasedPartitioner only support utf8 encoded recored,  we should use text or byteswriteable data types.
> 2) when using KeyFieldBasedPartitioner, if the record doesn't contain the specified field, the endChar would equal with array.length, which throw ArrayOutOfIndex exception, losting that record!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5779) KeyFieldBasedPartitioner would lost data if specifed field not exist, and it should encode free not only support utf8

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719486#action_12719486 ] 

Amar Kamat commented on HADOOP-5779:
------------------------------------

ZhuGuanyin, I am in the process of merging the patches and add a testcase. I hope its ok with you. Can you please let me know how to simulate the situation where key.toString().getBytes("UTF-8") throws exception. 

> KeyFieldBasedPartitioner would lost data if specifed field not exist, and it should encode free not only support utf8
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5779
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5779
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: ZhuGuanyin
>             Fix For: 0.21.0
>
>         Attachments: encode-free-KeyFieldBasedPartitioner-v1.patch, encode-free-KeyFieldBasedPartitioner.patch, HADOOP-5779-partial.patch, HADOOP-5779-v1.0.patch.patch
>
>
> 1) Currently,  KeyFieldBasedPartitioner only support utf8 encoded recored,  we should use text or byteswriteable data types.
> 2) when using KeyFieldBasedPartitioner, if the record doesn't contain the specified field, the endChar would equal with array.length, which throw ArrayOutOfIndex exception, losting that record!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5779) KeyFieldBasedPartitioner would lost data if specifed field not exist, and it should encode free not only support utf8

Posted by "ZhuGuanyin (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ZhuGuanyin updated HADOOP-5779:
-------------------------------

        Fix Version/s:     (was: 0.20.1)
                       0.21.0
          Description: 
1) Currently,  KeyFieldBasedPartitioner only support utf8 encoded recored,  we should use text or byteswriteable data types.

2) when using KeyFieldBasedPartitioner, if the record doesn't contain the specified field, the endChar would equal with array.length, which throw ArrayOutOfIndex exception, losting that record!

  was:
1) Currently,  KeyFieldBasedPartitioner only support utf8 encoded recored,  we should use text or byteswriteable data types.
2) when using KeyFieldBasedPartitioner, if the record doesn't contain the specified field, the endChar would equal with array.length, which throw ArrayOutOfIndex exception, losting that record!

    Affects Version/s: 0.20.0
              Summary: KeyFieldBasedPartitioner would lost data if specifed field not exist, and it should encode free not only support utf8  (was: KeyFieldBasedPartitioner should encode free and handle ArrayOutOfIndex exception!)

> KeyFieldBasedPartitioner would lost data if specifed field not exist, and it should encode free not only support utf8
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5779
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5779
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: ZhuGuanyin
>             Fix For: 0.21.0
>
>         Attachments: encode-free-KeyFieldBasedPartitioner.patch
>
>
> 1) Currently,  KeyFieldBasedPartitioner only support utf8 encoded recored,  we should use text or byteswriteable data types.
> 2) when using KeyFieldBasedPartitioner, if the record doesn't contain the specified field, the endChar would equal with array.length, which throw ArrayOutOfIndex exception, losting that record!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5779) KeyFieldBasedPartitioner would lost data if specifed field not exist, and it should encode free not only support utf8

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719989#action_12719989 ] 

Amar Kamat commented on HADOOP-5779:
------------------------------------

Opened HADOOP-6052 to address the ArrayOutOfIndex exception. ZhuGuanyin, could you plz provide a new patch having framework fix and testcase to solve utf-8 encoding issue?

> KeyFieldBasedPartitioner would lost data if specifed field not exist, and it should encode free not only support utf8
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5779
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5779
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: ZhuGuanyin
>             Fix For: 0.21.0
>
>         Attachments: encode-free-KeyFieldBasedPartitioner-v1.patch, encode-free-KeyFieldBasedPartitioner.patch, HADOOP-5779-partial.patch, HADOOP-5779-v1.0.patch.patch
>
>
> 1) Currently,  KeyFieldBasedPartitioner only support utf8 encoded recored,  we should use text or byteswriteable data types.
> 2) when using KeyFieldBasedPartitioner, if the record doesn't contain the specified field, the endChar would equal with array.length, which throw ArrayOutOfIndex exception, losting that record!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5779) KeyFieldBasedPartitioner would lost data if specifed field not exist, and it should encode free not only support utf8

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12707525#action_12707525 ] 

Hadoop QA commented on HADOOP-5779:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12407333/encode-free-KeyFieldBasedPartitioner.patch
  against trunk revision 772960.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no tests are needed for this patch.

    -1 patch.  The patch command could not apply the patch.

Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/307/console

This message is automatically generated.

> KeyFieldBasedPartitioner would lost data if specifed field not exist, and it should encode free not only support utf8
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5779
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5779
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: ZhuGuanyin
>             Fix For: 0.21.0
>
>         Attachments: encode-free-KeyFieldBasedPartitioner.patch
>
>
> 1) Currently,  KeyFieldBasedPartitioner only support utf8 encoded recored,  we should use text or byteswriteable data types.
> 2) when using KeyFieldBasedPartitioner, if the record doesn't contain the specified field, the endChar would equal with array.length, which throw ArrayOutOfIndex exception, losting that record!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.