You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Pradeep Kamath (JIRA)" <ji...@apache.org> on 2008/12/11 20:18:44 UTC

[jira] Created: (PIG-560) UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()

UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()
-------------------------------------------------------------------------------------------------------------------------------

                 Key: PIG-560
                 URL: https://issues.apache.org/jira/browse/PIG-560
             Project: Pig
          Issue Type: Bug
    Affects Versions: types_branch
            Reporter: Pradeep Kamath
             Fix For: types_branch


BinStorage() uses DataOutput.writeUTF() and DataInput.readUTF() Java API to write out Strings as UTF-8 bytes and to read them back. From the Javadoc - "First, the total number of bytes needed to represent all the characters of s is calculated. If this number is larger than 65535, then a UTFDataFormatException  is thrown. " (because the writeUTF() API uses 2 bytes to represent the number of bytes). A way to get around this would be to not use writeUTF()/ReadUTF() and instead hand convert the string to the corresponding UTF-8 byte[]  (using String.getBytes("UTF-8") and then write the length of the byte array as an int - this will allow a size of upto 2^32 (2 raised to 32).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-560) UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669060#action_12669060 ] 

Olga Natkovich commented on PIG-560:
------------------------------------

I think the unit test was added for BinaryStorage not BinStorage.

> UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-560
>                 URL: https://issues.apache.org/jira/browse/PIG-560
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Pradeep Kamath
>             Fix For: types_branch
>
>         Attachments: PIG-560.patch, utf-limit-patch.diff
>
>
> BinStorage() uses DataOutput.writeUTF() and DataInput.readUTF() Java API to write out Strings as UTF-8 bytes and to read them back. From the Javadoc - "First, the total number of bytes needed to represent all the characters of s is calculated. If this number is larger than 65535, then a UTFDataFormatException  is thrown. " (because the writeUTF() API uses 2 bytes to represent the number of bytes). A way to get around this would be to not use writeUTF()/ReadUTF() and instead hand convert the string to the corresponding UTF-8 byte[]  (using String.getBytes("UTF-8") and then write the length of the byte array as an int - this will allow a size of upto 2^32 (2 raised to 32).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-560) UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()

Posted by "Santhosh Srinivasan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Santhosh Srinivasan updated PIG-560:
------------------------------------

    Attachment: PIG-560_1.patch

Incorporated comments from Laukik. Submitting a new patch and modified test case. Running the tests now.

> UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-560
>                 URL: https://issues.apache.org/jira/browse/PIG-560
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Pradeep Kamath
>             Fix For: types_branch
>
>         Attachments: PIG-560.patch, PIG-560_1.patch, utf-limit-patch.diff
>
>
> BinStorage() uses DataOutput.writeUTF() and DataInput.readUTF() Java API to write out Strings as UTF-8 bytes and to read them back. From the Javadoc - "First, the total number of bytes needed to represent all the characters of s is calculated. If this number is larger than 65535, then a UTFDataFormatException  is thrown. " (because the writeUTF() API uses 2 bytes to represent the number of bytes). A way to get around this would be to not use writeUTF()/ReadUTF() and instead hand convert the string to the corresponding UTF-8 byte[]  (using String.getBytes("UTF-8") and then write the length of the byte array as an int - this will allow a size of upto 2^32 (2 raised to 32).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-560) UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()

Posted by "Santhosh Srinivasan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Santhosh Srinivasan updated PIG-560:
------------------------------------

    Attachment: PIG-560.patch

Final patch!

> UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-560
>                 URL: https://issues.apache.org/jira/browse/PIG-560
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Pradeep Kamath
>             Fix For: types_branch
>
>         Attachments: PIG-560.patch, utf-limit-patch.diff
>
>
> BinStorage() uses DataOutput.writeUTF() and DataInput.readUTF() Java API to write out Strings as UTF-8 bytes and to read them back. From the Javadoc - "First, the total number of bytes needed to represent all the characters of s is calculated. If this number is larger than 65535, then a UTFDataFormatException  is thrown. " (because the writeUTF() API uses 2 bytes to represent the number of bytes). A way to get around this would be to not use writeUTF()/ReadUTF() and instead hand convert the string to the corresponding UTF-8 byte[]  (using String.getBytes("UTF-8") and then write the length of the byte array as an int - this will allow a size of upto 2^32 (2 raised to 32).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-560) UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()

Posted by "Laukik Chitnis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669097#action_12669097 ] 

Laukik Chitnis commented on PIG-560:
------------------------------------

In the current patch, when the length is <65536, the string to UTF8 conversion is happening twice -- once with String::getBytes() and once with DataOutput::writeUTF()

To avoid that, instead of writeUTF(), how about using writeShort() followed by writeBytes() since we would already have the length and the UTF8 bytes? 


> UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-560
>                 URL: https://issues.apache.org/jira/browse/PIG-560
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Pradeep Kamath
>             Fix For: types_branch
>
>         Attachments: PIG-560.patch, utf-limit-patch.diff
>
>
> BinStorage() uses DataOutput.writeUTF() and DataInput.readUTF() Java API to write out Strings as UTF-8 bytes and to read them back. From the Javadoc - "First, the total number of bytes needed to represent all the characters of s is calculated. If this number is larger than 65535, then a UTFDataFormatException  is thrown. " (because the writeUTF() API uses 2 bytes to represent the number of bytes). A way to get around this would be to not use writeUTF()/ReadUTF() and instead hand convert the string to the corresponding UTF-8 byte[]  (using String.getBytes("UTF-8") and then write the length of the byte array as an int - this will allow a size of upto 2^32 (2 raised to 32).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (PIG-560) UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()

Posted by "Santhosh Srinivasan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Santhosh Srinivasan resolved PIG-560.
-------------------------------------

      Resolution: Fixed
        Assignee: Santhosh Srinivasan
    Hadoop Flags: [Reviewed]

Patch has been committed. Thanks Laukik for the initial patch and the review comments.

> UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-560
>                 URL: https://issues.apache.org/jira/browse/PIG-560
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Pradeep Kamath
>            Assignee: Santhosh Srinivasan
>             Fix For: types_branch
>
>         Attachments: PIG-560.patch, PIG-560_1.patch, utf-limit-patch.diff
>
>
> BinStorage() uses DataOutput.writeUTF() and DataInput.readUTF() Java API to write out Strings as UTF-8 bytes and to read them back. From the Javadoc - "First, the total number of bytes needed to represent all the characters of s is calculated. If this number is larger than 65535, then a UTFDataFormatException  is thrown. " (because the writeUTF() API uses 2 bytes to represent the number of bytes). A way to get around this would be to not use writeUTF()/ReadUTF() and instead hand convert the string to the corresponding UTF-8 byte[]  (using String.getBytes("UTF-8") and then write the length of the byte array as an int - this will allow a size of upto 2^32 (2 raised to 32).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-560) UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()

Posted by "Laukik Chitnis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669099#action_12669099 ] 

Laukik Chitnis commented on PIG-560:
------------------------------------

In the current patch, when the length is <65536, the string to UTF8 
conversion is happening twice -- once with String::getBytes() and once 
with DataOutput::writeUTF()
Instead of writeUTF(), how about using writeShort() followed by 
writeBytes() since we would already have the length and the UTF8 bytes?




> UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-560
>                 URL: https://issues.apache.org/jira/browse/PIG-560
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Pradeep Kamath
>             Fix For: types_branch
>
>         Attachments: PIG-560.patch, utf-limit-patch.diff
>
>
> BinStorage() uses DataOutput.writeUTF() and DataInput.readUTF() Java API to write out Strings as UTF-8 bytes and to read them back. From the Javadoc - "First, the total number of bytes needed to represent all the characters of s is calculated. If this number is larger than 65535, then a UTFDataFormatException  is thrown. " (because the writeUTF() API uses 2 bytes to represent the number of bytes). A way to get around this would be to not use writeUTF()/ReadUTF() and instead hand convert the string to the corresponding UTF-8 byte[]  (using String.getBytes("UTF-8") and then write the length of the byte array as an int - this will allow a size of upto 2^32 (2 raised to 32).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-560) UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()

Posted by "Santhosh Srinivasan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Santhosh Srinivasan updated PIG-560:
------------------------------------

    Attachment: PIG-560_1.patch

Uploading the right patch.

> UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-560
>                 URL: https://issues.apache.org/jira/browse/PIG-560
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Pradeep Kamath
>             Fix For: types_branch
>
>         Attachments: PIG-560.patch, PIG-560_1.patch, utf-limit-patch.diff
>
>
> BinStorage() uses DataOutput.writeUTF() and DataInput.readUTF() Java API to write out Strings as UTF-8 bytes and to read them back. From the Javadoc - "First, the total number of bytes needed to represent all the characters of s is calculated. If this number is larger than 65535, then a UTFDataFormatException  is thrown. " (because the writeUTF() API uses 2 bytes to represent the number of bytes). A way to get around this would be to not use writeUTF()/ReadUTF() and instead hand convert the string to the corresponding UTF-8 byte[]  (using String.getBytes("UTF-8") and then write the length of the byte array as an int - this will allow a size of upto 2^32 (2 raised to 32).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-560) UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()

Posted by "Santhosh Srinivasan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Santhosh Srinivasan updated PIG-560:
------------------------------------

    Attachment:     (was: PIG-560.patch)

> UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-560
>                 URL: https://issues.apache.org/jira/browse/PIG-560
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Pradeep Kamath
>             Fix For: types_branch
>
>         Attachments: utf-limit-patch.diff
>
>
> BinStorage() uses DataOutput.writeUTF() and DataInput.readUTF() Java API to write out Strings as UTF-8 bytes and to read them back. From the Javadoc - "First, the total number of bytes needed to represent all the characters of s is calculated. If this number is larger than 65535, then a UTFDataFormatException  is thrown. " (because the writeUTF() API uses 2 bytes to represent the number of bytes). A way to get around this would be to not use writeUTF()/ReadUTF() and instead hand convert the string to the corresponding UTF-8 byte[]  (using String.getBytes("UTF-8") and then write the length of the byte array as an int - this will allow a size of upto 2^32 (2 raised to 32).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-560) UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()

Posted by "Laukik Chitnis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Laukik Chitnis updated PIG-560:
-------------------------------

    Comment: was deleted

> UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-560
>                 URL: https://issues.apache.org/jira/browse/PIG-560
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Pradeep Kamath
>             Fix For: types_branch
>
>         Attachments: PIG-560.patch, utf-limit-patch.diff
>
>
> BinStorage() uses DataOutput.writeUTF() and DataInput.readUTF() Java API to write out Strings as UTF-8 bytes and to read them back. From the Javadoc - "First, the total number of bytes needed to represent all the characters of s is calculated. If this number is larger than 65535, then a UTFDataFormatException  is thrown. " (because the writeUTF() API uses 2 bytes to represent the number of bytes). A way to get around this would be to not use writeUTF()/ReadUTF() and instead hand convert the string to the corresponding UTF-8 byte[]  (using String.getBytes("UTF-8") and then write the length of the byte array as an int - this will allow a size of upto 2^32 (2 raised to 32).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-560) UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()

Posted by "Laukik Chitnis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Laukik Chitnis updated PIG-560:
-------------------------------

    Attachment: utf-limit-patch.diff

The patch uses the String object's getBytes(charsetname) method to convert the string to UTF bytes, instead of the writeUTF() function. Now, an int can be used for storing the length instead of the 2 bytes used by the writeUTF(). Also includes the corresponding change while reading in a CHARARRAY.

> UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-560
>                 URL: https://issues.apache.org/jira/browse/PIG-560
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Pradeep Kamath
>             Fix For: types_branch
>
>         Attachments: utf-limit-patch.diff
>
>
> BinStorage() uses DataOutput.writeUTF() and DataInput.readUTF() Java API to write out Strings as UTF-8 bytes and to read them back. From the Javadoc - "First, the total number of bytes needed to represent all the characters of s is calculated. If this number is larger than 65535, then a UTFDataFormatException  is thrown. " (because the writeUTF() API uses 2 bytes to represent the number of bytes). A way to get around this would be to not use writeUTF()/ReadUTF() and instead hand convert the string to the corresponding UTF-8 byte[]  (using String.getBytes("UTF-8") and then write the length of the byte array as an int - this will allow a size of upto 2^32 (2 raised to 32).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-560) UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669098#action_12669098 ] 

Olga Natkovich commented on PIG-560:
------------------------------------

+1

> UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-560
>                 URL: https://issues.apache.org/jira/browse/PIG-560
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Pradeep Kamath
>             Fix For: types_branch
>
>         Attachments: PIG-560.patch, utf-limit-patch.diff
>
>
> BinStorage() uses DataOutput.writeUTF() and DataInput.readUTF() Java API to write out Strings as UTF-8 bytes and to read them back. From the Javadoc - "First, the total number of bytes needed to represent all the characters of s is calculated. If this number is larger than 65535, then a UTFDataFormatException  is thrown. " (because the writeUTF() API uses 2 bytes to represent the number of bytes). A way to get around this would be to not use writeUTF()/ReadUTF() and instead hand convert the string to the corresponding UTF-8 byte[]  (using String.getBytes("UTF-8") and then write the length of the byte array as an int - this will allow a size of upto 2^32 (2 raised to 32).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-560) UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()

Posted by "Santhosh Srinivasan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Santhosh Srinivasan updated PIG-560:
------------------------------------

    Attachment: PIG-560.patch

New patch (PIG-560.patch) adds the test case in TestEvalPipeline.java. Fixes a bug in the previous patch.

> UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-560
>                 URL: https://issues.apache.org/jira/browse/PIG-560
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Pradeep Kamath
>             Fix For: types_branch
>
>         Attachments: PIG-560.patch, utf-limit-patch.diff
>
>
> BinStorage() uses DataOutput.writeUTF() and DataInput.readUTF() Java API to write out Strings as UTF-8 bytes and to read them back. From the Javadoc - "First, the total number of bytes needed to represent all the characters of s is calculated. If this number is larger than 65535, then a UTFDataFormatException  is thrown. " (because the writeUTF() API uses 2 bytes to represent the number of bytes). A way to get around this would be to not use writeUTF()/ReadUTF() and instead hand convert the string to the corresponding UTF-8 byte[]  (using String.getBytes("UTF-8") and then write the length of the byte array as an int - this will allow a size of upto 2^32 (2 raised to 32).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-560) UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12668682#action_12668682 ] 

Alan Gates commented on PIG-560:
--------------------------------

I'm concerned here that we're adding 2 bytes to every string we store for a case which should be quite rare (how often to people have strings longer than 64K?)  Would it be better to have bin storage define a long string type that uses 4 bytes to encode it's length, and then test a string's length before writing it out and leave things as they are now for most strings and use the new long string for anything over 64K?

> UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-560
>                 URL: https://issues.apache.org/jira/browse/PIG-560
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Pradeep Kamath
>             Fix For: types_branch
>
>         Attachments: utf-limit-patch.diff
>
>
> BinStorage() uses DataOutput.writeUTF() and DataInput.readUTF() Java API to write out Strings as UTF-8 bytes and to read them back. From the Javadoc - "First, the total number of bytes needed to represent all the characters of s is calculated. If this number is larger than 65535, then a UTFDataFormatException  is thrown. " (because the writeUTF() API uses 2 bytes to represent the number of bytes). A way to get around this would be to not use writeUTF()/ReadUTF() and instead hand convert the string to the corresponding UTF-8 byte[]  (using String.getBytes("UTF-8") and then write the length of the byte array as an int - this will allow a size of upto 2^32 (2 raised to 32).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-560) UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()

Posted by "Santhosh Srinivasan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Santhosh Srinivasan updated PIG-560:
------------------------------------

    Attachment:     (was: PIG-560.patch)

> UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-560
>                 URL: https://issues.apache.org/jira/browse/PIG-560
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Pradeep Kamath
>             Fix For: types_branch
>
>         Attachments: utf-limit-patch.diff
>
>
> BinStorage() uses DataOutput.writeUTF() and DataInput.readUTF() Java API to write out Strings as UTF-8 bytes and to read them back. From the Javadoc - "First, the total number of bytes needed to represent all the characters of s is calculated. If this number is larger than 65535, then a UTFDataFormatException  is thrown. " (because the writeUTF() API uses 2 bytes to represent the number of bytes). A way to get around this would be to not use writeUTF()/ReadUTF() and instead hand convert the string to the corresponding UTF-8 byte[]  (using String.getBytes("UTF-8") and then write the length of the byte array as an int - this will allow a size of upto 2^32 (2 raised to 32).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-560) UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()

Posted by "Santhosh Srinivasan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Santhosh Srinivasan updated PIG-560:
------------------------------------

    Attachment: PIG-560.patch

Attached patch (PIG-560.patch) addresses the issue of storing strings larger than 65535 bytes in length using BinStorage. A new BIGCHARRAY type has been added to PIG. This type is used internally for storing and loading Strings that are bigger than 64K bytes. A new unit test case that tests this code path has been added and an existing test case has been modified.

> UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-560
>                 URL: https://issues.apache.org/jira/browse/PIG-560
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Pradeep Kamath
>             Fix For: types_branch
>
>         Attachments: PIG-560.patch, utf-limit-patch.diff
>
>
> BinStorage() uses DataOutput.writeUTF() and DataInput.readUTF() Java API to write out Strings as UTF-8 bytes and to read them back. From the Javadoc - "First, the total number of bytes needed to represent all the characters of s is calculated. If this number is larger than 65535, then a UTFDataFormatException  is thrown. " (because the writeUTF() API uses 2 bytes to represent the number of bytes). A way to get around this would be to not use writeUTF()/ReadUTF() and instead hand convert the string to the corresponding UTF-8 byte[]  (using String.getBytes("UTF-8") and then write the length of the byte array as an int - this will allow a size of upto 2^32 (2 raised to 32).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-560) UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12668710#action_12668710 ] 

Olga Natkovich commented on PIG-560:
------------------------------------

We know that since we put this changes into the production, only one other person complained so we are pretty certain it is a very rare case. I agree with Alan that we should only pay the penalty on long strings

> UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-560
>                 URL: https://issues.apache.org/jira/browse/PIG-560
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Pradeep Kamath
>             Fix For: types_branch
>
>         Attachments: utf-limit-patch.diff
>
>
> BinStorage() uses DataOutput.writeUTF() and DataInput.readUTF() Java API to write out Strings as UTF-8 bytes and to read them back. From the Javadoc - "First, the total number of bytes needed to represent all the characters of s is calculated. If this number is larger than 65535, then a UTFDataFormatException  is thrown. " (because the writeUTF() API uses 2 bytes to represent the number of bytes). A way to get around this would be to not use writeUTF()/ReadUTF() and instead hand convert the string to the corresponding UTF-8 byte[]  (using String.getBytes("UTF-8") and then write the length of the byte array as an int - this will allow a size of upto 2^32 (2 raised to 32).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-560) UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()

Posted by "Santhosh Srinivasan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Santhosh Srinivasan updated PIG-560:
------------------------------------

    Attachment:     (was: PIG-560.patch)

> UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-560
>                 URL: https://issues.apache.org/jira/browse/PIG-560
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Pradeep Kamath
>             Fix For: types_branch
>
>         Attachments: utf-limit-patch.diff
>
>
> BinStorage() uses DataOutput.writeUTF() and DataInput.readUTF() Java API to write out Strings as UTF-8 bytes and to read them back. From the Javadoc - "First, the total number of bytes needed to represent all the characters of s is calculated. If this number is larger than 65535, then a UTFDataFormatException  is thrown. " (because the writeUTF() API uses 2 bytes to represent the number of bytes). A way to get around this would be to not use writeUTF()/ReadUTF() and instead hand convert the string to the corresponding UTF-8 byte[]  (using String.getBytes("UTF-8") and then write the length of the byte array as an int - this will allow a size of upto 2^32 (2 raised to 32).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-560) UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()

Posted by "Santhosh Srinivasan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Santhosh Srinivasan updated PIG-560:
------------------------------------

    Attachment:     (was: PIG-560_1.patch)

> UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-560
>                 URL: https://issues.apache.org/jira/browse/PIG-560
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Pradeep Kamath
>             Fix For: types_branch
>
>         Attachments: PIG-560.patch, PIG-560_1.patch, utf-limit-patch.diff
>
>
> BinStorage() uses DataOutput.writeUTF() and DataInput.readUTF() Java API to write out Strings as UTF-8 bytes and to read them back. From the Javadoc - "First, the total number of bytes needed to represent all the characters of s is calculated. If this number is larger than 65535, then a UTFDataFormatException  is thrown. " (because the writeUTF() API uses 2 bytes to represent the number of bytes). A way to get around this would be to not use writeUTF()/ReadUTF() and instead hand convert the string to the corresponding UTF-8 byte[]  (using String.getBytes("UTF-8") and then write the length of the byte array as an int - this will allow a size of upto 2^32 (2 raised to 32).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-560) UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()

Posted by "Laukik Chitnis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12668688#action_12668688 ] 

Laukik Chitnis commented on PIG-560:
------------------------------------

The writeUTF() method was adding 2 bytes per string; we would actually be adding an int (32 bits) with this solution.

The new long string would then be required to be a new DataType, right? To make it transparent to the user, this DataType can just be used internally. Also, to keep things efficient, may be we can insert the string as this datatype only on getting the encoded-string-too-long  UTFDataFormatException.

By the way, though it looks quite probable that the average length of a string used would be far less than 64k, do we have any statistic on the average length of (UTF converted) CHARARRAYs? This would also help us in determining how big an overhead the additional 16 bits actually is. 

> UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-560
>                 URL: https://issues.apache.org/jira/browse/PIG-560
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Pradeep Kamath
>             Fix For: types_branch
>
>         Attachments: utf-limit-patch.diff
>
>
> BinStorage() uses DataOutput.writeUTF() and DataInput.readUTF() Java API to write out Strings as UTF-8 bytes and to read them back. From the Javadoc - "First, the total number of bytes needed to represent all the characters of s is calculated. If this number is larger than 65535, then a UTFDataFormatException  is thrown. " (because the writeUTF() API uses 2 bytes to represent the number of bytes). A way to get around this would be to not use writeUTF()/ReadUTF() and instead hand convert the string to the corresponding UTF-8 byte[]  (using String.getBytes("UTF-8") and then write the length of the byte array as an int - this will allow a size of upto 2^32 (2 raised to 32).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-560) UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()

Posted by "Santhosh Srinivasan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Santhosh Srinivasan updated PIG-560:
------------------------------------

    Attachment: PIG-560.patch

Changing the patch by deleting a commented out line.

> UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-560
>                 URL: https://issues.apache.org/jira/browse/PIG-560
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Pradeep Kamath
>             Fix For: types_branch
>
>         Attachments: PIG-560.patch, utf-limit-patch.diff
>
>
> BinStorage() uses DataOutput.writeUTF() and DataInput.readUTF() Java API to write out Strings as UTF-8 bytes and to read them back. From the Javadoc - "First, the total number of bytes needed to represent all the characters of s is calculated. If this number is larger than 65535, then a UTFDataFormatException  is thrown. " (because the writeUTF() API uses 2 bytes to represent the number of bytes). A way to get around this would be to not use writeUTF()/ReadUTF() and instead hand convert the string to the corresponding UTF-8 byte[]  (using String.getBytes("UTF-8") and then write the length of the byte array as an int - this will allow a size of upto 2^32 (2 raised to 32).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.