You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Alan Gates (JIRA)" <ji...@apache.org> on 2009/01/30 01:41:59 UTC
[jira] Commented: (PIG-560) UTFDataFormatException (encoded string
too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using
BinStorage()
[ https://issues.apache.org/jira/browse/PIG-560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12668682#action_12668682 ]
Alan Gates commented on PIG-560:
--------------------------------
I'm concerned here that we're adding 2 bytes to every string we store for a case which should be quite rare (how often to people have strings longer than 64K?) Would it be better to have bin storage define a long string type that uses 4 bytes to encode it's length, and then test a string's length before writing it out and leave things as they are now for most strings and use the new long string for anything over 64K?
> UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()
> -------------------------------------------------------------------------------------------------------------------------------
>
> Key: PIG-560
> URL: https://issues.apache.org/jira/browse/PIG-560
> Project: Pig
> Issue Type: Bug
> Affects Versions: types_branch
> Reporter: Pradeep Kamath
> Fix For: types_branch
>
> Attachments: utf-limit-patch.diff
>
>
> BinStorage() uses DataOutput.writeUTF() and DataInput.readUTF() Java API to write out Strings as UTF-8 bytes and to read them back. From the Javadoc - "First, the total number of bytes needed to represent all the characters of s is calculated. If this number is larger than 65535, then a UTFDataFormatException is thrown. " (because the writeUTF() API uses 2 bytes to represent the number of bytes). A way to get around this would be to not use writeUTF()/ReadUTF() and instead hand convert the string to the corresponding UTF-8 byte[] (using String.getBytes("UTF-8") and then write the length of the byte array as an int - this will allow a size of upto 2^32 (2 raised to 32).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.