You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Chris Douglas (JIRA)" <ji...@apache.org> on 2009/11/18 02:05:39 UTC

[jira] Created: (HADOOP-6381) WritableUtils::*VLong utilities should be available for byte arrays

WritableUtils::*VLong utilities should be available for byte arrays
-------------------------------------------------------------------

                 Key: HADOOP-6381
                 URL: https://issues.apache.org/jira/browse/HADOOP-6381
             Project: Hadoop Common
          Issue Type: Improvement
          Components: util
            Reporter: Chris Douglas
            Priority: Minor
         Attachments: C6381-0.patch

Particularly when working with raw bytes in Writables, it is often useful to have versions of the vint utility functions for byte arrays.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6381) WritableUtils::*VLong utilities should be available for byte arrays

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780235#action_12780235 ] 

Todd Lipcon commented on HADOOP-6381:
-------------------------------------

bq. How effective would JIT help eliminate the overhead of objects created and forgotten in a tight loop? This should be the kind of things that compilers are good at.

I agree they should be good at it, but in my experience they aren't quite that aggressive yet.


As Chris pointed out, you can actually get around the issue with a thread-local DataInputBuffer - I had forgotten about the .reset() method thereof.

I guess that leaves me -0 on this. The APIs might come in handy for some, but given you have to actually decode the first byte twice in order to figure out the length of the variable encoding, it's actually probably an efficiency loss in the case of raw comparators. So, I could swing either way; up to you guys.

> WritableUtils::*VLong utilities should be available for byte arrays
> -------------------------------------------------------------------
>
>                 Key: HADOOP-6381
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6381
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: util
>            Reporter: Chris Douglas
>            Priority: Minor
>         Attachments: C6381-0.patch, C6381-1.patch, C6381-2.patch
>
>
> Particularly when working with raw bytes in Writables, it is often useful to have versions of the vint utility functions for byte arrays.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6381) WritableUtils::*VLong utilities should be available for byte arrays

Posted by "Hong Tang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780413#action_12780413 ] 

Hong Tang commented on HADOOP-6381:
-----------------------------------

bq. On the other hand, we've lived without a set of static, byte[]-oriented DataInput functions that would admit identical arguments. If RawComparators are an advanced feature (in need of better documentation?) or something better handled transparently by Avro, Jute, et al. then perhaps the user that would appreciate these doesn't exist.

Agreed. I always feel that we lack a cohesive story in terms of raw-comparator (together with writable). For example, how do I write a raw comparator for a composite key where each field in the key defines a raw comparator.  protocol buffer would solve it easily because it does length prefixed encoding and is able to skip fields at any level. I think avro has the potential of untangling the mess. So I would also like to discourage the introduction of such ad-hoc apis in io.WritableUtil.

Could we first add these utility methods in the module where it is needed? And possibly migrate them to the io package when we see more usage cases? This is how BoundedByteArrayOutputStream is migrated from tfile to io.

Alternatively, if you feel WritableUtil is the right place, we probably should document it clearly that it should be limited in writing raw comparators and avro may obsolete such apis in the future.

> WritableUtils::*VLong utilities should be available for byte arrays
> -------------------------------------------------------------------
>
>                 Key: HADOOP-6381
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6381
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: util
>            Reporter: Chris Douglas
>            Priority: Minor
>         Attachments: C6381-0.patch, C6381-1.patch, C6381-2.patch
>
>
> Particularly when working with raw bytes in Writables, it is often useful to have versions of the vint utility functions for byte arrays.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6381) WritableUtils::*VLong utilities should be available for byte arrays

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779676#action_12779676 ] 

Chris Douglas commented on HADOOP-6381:
---------------------------------------

bq. Doesn't it avoid the need to construct a ByteArrayInputStream?

Yes, but if one has a DataInputBuffer for the purpose, then I wouldn't expect it to be too expensive. Given a key type with two vint params, to sort by the second:
{code}
final DataInputBuffer di = new DataInputBuffer(0);

int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
  di.reset(b1, s1, l1);
  final int v1 = WritableUtils.readVInt(di);
  final int x1 = WritableUtils.decodeVIntSize(b1[s1]);
  // second param starts at b[s1 + v1 + x1]
...
{code}

Which isn't too expensive or inelegant. Thread safety and a memory leak from the DIB reference can both be handled, again, without imposing too high a penalty, even for an inner loop. It just struck me as a lot of ceremony.

> WritableUtils::*VLong utilities should be available for byte arrays
> -------------------------------------------------------------------
>
>                 Key: HADOOP-6381
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6381
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: util
>            Reporter: Chris Douglas
>            Priority: Minor
>         Attachments: C6381-0.patch, C6381-1.patch, C6381-2.patch
>
>
> Particularly when working with raw bytes in Writables, it is often useful to have versions of the vint utility functions for byte arrays.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-6381) WritableUtils::*VLong utilities should be available for byte arrays

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-6381:
----------------------------------

    Attachment: C6381-1.patch

Changed names to encode/decodeVInt, returning bytes written from encode.

> WritableUtils::*VLong utilities should be available for byte arrays
> -------------------------------------------------------------------
>
>                 Key: HADOOP-6381
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6381
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: util
>            Reporter: Chris Douglas
>            Priority: Minor
>         Attachments: C6381-0.patch, C6381-1.patch
>
>
> Particularly when working with raw bytes in Writables, it is often useful to have versions of the vint utility functions for byte arrays.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-6381) WritableUtils::*VLong utilities should be available for byte arrays

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-6381:
----------------------------------

    Status: Open  (was: Patch Available)

Good point; I'd only added the write for symmetry. It's not a legal overload of {{writeVInt}} if it returns the length, but these can just be a different set of functions.

> WritableUtils::*VLong utilities should be available for byte arrays
> -------------------------------------------------------------------
>
>                 Key: HADOOP-6381
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6381
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: util
>            Reporter: Chris Douglas
>            Priority: Minor
>         Attachments: C6381-0.patch
>
>
> Particularly when working with raw bytes in Writables, it is often useful to have versions of the vint utility functions for byte arrays.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6381) WritableUtils::*VLong utilities should be available for byte arrays

Posted by "Hong Tang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779315#action_12779315 ] 

Hong Tang commented on HADOOP-6381:
-----------------------------------

Sorry about the spam, There is apparently some network problem that led me to believe my post was lost (I am currently in China). I'd appreciate if somebody delete the duplicated comments.

> WritableUtils::*VLong utilities should be available for byte arrays
> -------------------------------------------------------------------
>
>                 Key: HADOOP-6381
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6381
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: util
>            Reporter: Chris Douglas
>            Priority: Minor
>         Attachments: C6381-0.patch, C6381-1.patch, C6381-2.patch
>
>
> Particularly when working with raw bytes in Writables, it is often useful to have versions of the vint utility functions for byte arrays.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6381) WritableUtils::*VLong utilities should be available for byte arrays

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779278#action_12779278 ] 

Hadoop QA commented on HADOOP-6381:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12425284/C6381-1.patch
  against trunk revision 881509.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    -1 javadoc.  The javadoc tool appears to have generated 1 warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/21/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/21/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/21/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/21/console

This message is automatically generated.

> WritableUtils::*VLong utilities should be available for byte arrays
> -------------------------------------------------------------------
>
>                 Key: HADOOP-6381
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6381
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: util
>            Reporter: Chris Douglas
>            Priority: Minor
>         Attachments: C6381-0.patch, C6381-1.patch
>
>
> Particularly when working with raw bytes in Writables, it is often useful to have versions of the vint utility functions for byte arrays.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6381) WritableUtils::*VLong utilities should be available for byte arrays

Posted by "Hong Tang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779818#action_12779818 ] 

Hong Tang commented on HADOOP-6381:
-----------------------------------

bq. That's what WritableUtils::decodeVIntSize does.
That means you would have to decode the first byte of the vint twice.

bq. where the object instantiation cost was just too much, performance-wise. No?
How effective would JIT help eliminate the overhead of objects created and forgotten in a tight loop? This should be the kind of things that compilers are good at.



> WritableUtils::*VLong utilities should be available for byte arrays
> -------------------------------------------------------------------
>
>                 Key: HADOOP-6381
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6381
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: util
>            Reporter: Chris Douglas
>            Priority: Minor
>         Attachments: C6381-0.patch, C6381-1.patch, C6381-2.patch
>
>
> Particularly when working with raw bytes in Writables, it is often useful to have versions of the vint utility functions for byte arrays.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6381) WritableUtils::*VLong utilities should be available for byte arrays

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779241#action_12779241 ] 

Hadoop QA commented on HADOOP-6381:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12425283/C6381-0.patch
  against trunk revision 881509.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    -1 javadoc.  The javadoc tool appears to have generated 1 warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    -1 core tests.  The patch failed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/141/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/141/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/141/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/141/console

This message is automatically generated.

> WritableUtils::*VLong utilities should be available for byte arrays
> -------------------------------------------------------------------
>
>                 Key: HADOOP-6381
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6381
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: util
>            Reporter: Chris Douglas
>            Priority: Minor
>         Attachments: C6381-0.patch
>
>
> Particularly when working with raw bytes in Writables, it is often useful to have versions of the vint utility functions for byte arrays.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-6381) WritableUtils::*VLong utilities should be available for byte arrays

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-6381:
----------------------------------

    Status: Open  (was: Patch Available)

> WritableUtils::*VLong utilities should be available for byte arrays
> -------------------------------------------------------------------
>
>                 Key: HADOOP-6381
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6381
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: util
>            Reporter: Chris Douglas
>            Priority: Minor
>         Attachments: C6381-0.patch, C6381-1.patch, C6381-2.patch
>
>
> Particularly when working with raw bytes in Writables, it is often useful to have versions of the vint utility functions for byte arrays.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-6381) WritableUtils::*VLong utilities should be available for byte arrays

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-6381:
----------------------------------

    Attachment: C6381-0.patch

> WritableUtils::*VLong utilities should be available for byte arrays
> -------------------------------------------------------------------
>
>                 Key: HADOOP-6381
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6381
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: util
>            Reporter: Chris Douglas
>            Priority: Minor
>         Attachments: C6381-0.patch
>
>
> Particularly when working with raw bytes in Writables, it is often useful to have versions of the vint utility functions for byte arrays.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6381) WritableUtils::*VLong utilities should be available for byte arrays

Posted by "Hong Tang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779311#action_12779311 ] 

Hong Tang commented on HADOOP-6381:
-----------------------------------

I am concerned about the code duplication this brings.

Is decodeVLong(buf, off) == readVLong(new DataInputStream(new ByteArrayInputStream(buf, off, buf.length - off)))?
And encodeVLong(buf, off, l) could be implemented as follows (with an additional constructor added to BoundedByteArrayOutputStream)
{
  BoundedByteArrayOutputStream bbaos = new BoundedByteArrayOutputStream(buf, off);
  writeVLong(new DataOutputStream(bbaos), l); 
  return bbaos.size() - off;
}


  

> WritableUtils::*VLong utilities should be available for byte arrays
> -------------------------------------------------------------------
>
>                 Key: HADOOP-6381
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6381
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: util
>            Reporter: Chris Douglas
>            Priority: Minor
>         Attachments: C6381-0.patch, C6381-1.patch, C6381-2.patch
>
>
> Particularly when working with raw bytes in Writables, it is often useful to have versions of the vint utility functions for byte arrays.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6381) WritableUtils::*VLong utilities should be available for byte arrays

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780329#action_12780329 ] 

Chris Douglas commented on HADOOP-6381:
---------------------------------------

Frankly, we're all just speculating on the relative efficiency of querying the DIB position before and after the vint read and doing the decode, but I think we'd all agree that any difference is probably not significant (if we need an efficiency use case, sorting by vint length avoids the DIB costs). Again, the purpose of this is to require less than the set of steps to do this correctly now: creating the RawComparator with the DIB, including the arithmetic in the compare, clearing the reference to the array in the DIB after the compare, and (optionally) handling thread safety. Instead of requiring users to figure this out on their own, adding a utility method to decode the bytes seems merciful to me.

On the other hand, we've lived without a set of static, byte[]-oriented DataInput functions that would admit identical arguments. If RawComparators are an advanced feature (in need of better documentation?) or something better handled transparently by Avro, Jute, et al. then perhaps the user that would appreciate these doesn't exist.

> WritableUtils::*VLong utilities should be available for byte arrays
> -------------------------------------------------------------------
>
>                 Key: HADOOP-6381
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6381
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: util
>            Reporter: Chris Douglas
>            Priority: Minor
>         Attachments: C6381-0.patch, C6381-1.patch, C6381-2.patch
>
>
> Particularly when working with raw bytes in Writables, it is often useful to have versions of the vint utility functions for byte arrays.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6381) WritableUtils::*VLong utilities should be available for byte arrays

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779313#action_12779313 ] 

Todd Lipcon commented on HADOOP-6381:
-------------------------------------

My assumption was that this code is meant for tight loop code (like RawComparators) where the object instantiation cost was just too much, performance-wise. No?

> WritableUtils::*VLong utilities should be available for byte arrays
> -------------------------------------------------------------------
>
>                 Key: HADOOP-6381
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6381
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: util
>            Reporter: Chris Douglas
>            Priority: Minor
>         Attachments: C6381-0.patch, C6381-1.patch, C6381-2.patch
>
>
> Particularly when working with raw bytes in Writables, it is often useful to have versions of the vint utility functions for byte arrays.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-6381) WritableUtils::*VLong utilities should be available for byte arrays

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-6381:
----------------------------------

    Resolution: Won't Fix
        Status: Resolved  (was: Patch Available)

bq. I would also like to discourage the introduction of such ad-hoc apis in io.WritableUtil

One could respond that WritableUtil contains only ad-hoc APIs, that many patches are still submitted without knowing best-practices for working with Writables (e.g. a patch recently posted to HADOOP-6373 uses ByteArray\*Streams to read vints from byte[]), and the day when Avro rescues Hadoop's users from byte-oriented interfaces is still in the future.

However, your point that what ails Writables can't be fixed with more utility functions is irrefutable and undisputed. Adding them so late in Writable's life makes the "confused user" argument even less compelling. I'll close this as "won't fix".

> WritableUtils::*VLong utilities should be available for byte arrays
> -------------------------------------------------------------------
>
>                 Key: HADOOP-6381
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6381
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: util
>            Reporter: Chris Douglas
>            Priority: Minor
>         Attachments: C6381-0.patch, C6381-1.patch, C6381-2.patch
>
>
> Particularly when working with raw bytes in Writables, it is often useful to have versions of the vint utility functions for byte arrays.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6381) WritableUtils::*VLong utilities should be available for byte arrays

Posted by "Hong Tang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779314#action_12779314 ] 

Hong Tang commented on HADOOP-6381:
-----------------------------------

I am a bit concerned with the code duplication this may introduce.

> WritableUtils::*VLong utilities should be available for byte arrays
> -------------------------------------------------------------------
>
>                 Key: HADOOP-6381
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6381
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: util
>            Reporter: Chris Douglas
>            Priority: Minor
>         Attachments: C6381-0.patch, C6381-1.patch, C6381-2.patch
>
>
> Particularly when working with raw bytes in Writables, it is often useful to have versions of the vint utility functions for byte arrays.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-6381) WritableUtils::*VLong utilities should be available for byte arrays

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-6381:
----------------------------------

    Attachment: C6381-2.patch

Fixed javadoc warning.

> WritableUtils::*VLong utilities should be available for byte arrays
> -------------------------------------------------------------------
>
>                 Key: HADOOP-6381
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6381
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: util
>            Reporter: Chris Douglas
>            Priority: Minor
>         Attachments: C6381-0.patch, C6381-1.patch, C6381-2.patch
>
>
> Particularly when working with raw bytes in Writables, it is often useful to have versions of the vint utility functions for byte arrays.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-6381) WritableUtils::*VLong utilities should be available for byte arrays

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-6381:
----------------------------------

    Status: Patch Available  (was: Open)

> WritableUtils::*VLong utilities should be available for byte arrays
> -------------------------------------------------------------------
>
>                 Key: HADOOP-6381
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6381
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: util
>            Reporter: Chris Douglas
>            Priority: Minor
>         Attachments: C6381-0.patch, C6381-1.patch, C6381-2.patch
>
>
> Particularly when working with raw bytes in Writables, it is often useful to have versions of the vint utility functions for byte arrays.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6381) WritableUtils::*VLong utilities should be available for byte arrays

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779329#action_12779329 ] 

Chris Douglas commented on HADOOP-6381:
---------------------------------------

bq. decodeVint methods only return the integer and not the length consumed, so again they're less than useful in a lot of circumstances.

That's what {{WritableUtils::decodeVIntSize}} does.

bq. I am a bit concerned with the code duplication this may introduce.

I wrote the read to use with a RawComparator. For e.g. tuples that vint encode the length of each parameter, it's easier to skip over bytes using the form this introduces. One could keep a DataInputBuffer with the comparator, but that either requires an allocation or has threading issues. To be fair, WritableComparator isn't threadsafe either, but adding a ThreadLocal, etc. seemed like an unnecessary evasion for a byte-oriented decode.

We're committed to the vint format written from WritableUtils... it's not clear what problems duplicating this code would cause, unless the new code is incorrect. It's not an efficiency improvement; it's just a helpful API when writing RawComparators. It seemed like a natural form to have for working with vints, but I'm wholly ambivalent about closing this as "won't fix".

> WritableUtils::*VLong utilities should be available for byte arrays
> -------------------------------------------------------------------
>
>                 Key: HADOOP-6381
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6381
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: util
>            Reporter: Chris Douglas
>            Priority: Minor
>         Attachments: C6381-0.patch, C6381-1.patch, C6381-2.patch
>
>
> Particularly when working with raw bytes in Writables, it is often useful to have versions of the vint utility functions for byte arrays.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6381) WritableUtils::*VLong utilities should be available for byte arrays

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779594#action_12779594 ] 

Todd Lipcon commented on HADOOP-6381:
-------------------------------------

I agree with Chris about the code duplication - it's not really a maintenance issue since the format is locked down.

bq. It's not an efficiency improvement

Doesn't it avoid the need to construct a ByteArrayInputStream? If it does, I'm +1 for a patch like this - avoiding the object allocations in a comparator seems worth it.

> WritableUtils::*VLong utilities should be available for byte arrays
> -------------------------------------------------------------------
>
>                 Key: HADOOP-6381
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6381
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: util
>            Reporter: Chris Douglas
>            Priority: Minor
>         Attachments: C6381-0.patch, C6381-1.patch, C6381-2.patch
>
>
> Particularly when working with raw bytes in Writables, it is often useful to have versions of the vint utility functions for byte arrays.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6381) WritableUtils::*VLong utilities should be available for byte arrays

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779237#action_12779237 ] 

Todd Lipcon commented on HADOOP-6381:
-------------------------------------

I think the new functions should return the number of bytes actually written - otherwise it's impossible to use this as part of a serialization routine where you want to continue writing more stuff into an array after the vlong/vint, right?

> WritableUtils::*VLong utilities should be available for byte arrays
> -------------------------------------------------------------------
>
>                 Key: HADOOP-6381
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6381
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: util
>            Reporter: Chris Douglas
>            Priority: Minor
>         Attachments: C6381-0.patch
>
>
> Particularly when working with raw bytes in Writables, it is often useful to have versions of the vint utility functions for byte arrays.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-6381) WritableUtils::*VLong utilities should be available for byte arrays

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-6381:
----------------------------------

    Status: Patch Available  (was: Open)

> WritableUtils::*VLong utilities should be available for byte arrays
> -------------------------------------------------------------------
>
>                 Key: HADOOP-6381
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6381
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: util
>            Reporter: Chris Douglas
>            Priority: Minor
>         Attachments: C6381-0.patch, C6381-1.patch
>
>
> Particularly when working with raw bytes in Writables, it is often useful to have versions of the vint utility functions for byte arrays.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6381) WritableUtils::*VLong utilities should be available for byte arrays

Posted by "Hong Tang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779312#action_12779312 ] 

Hong Tang commented on HADOOP-6381:
-----------------------------------

I am concerned about the code duplication this brings.

Is decodeVLong(buf, off) == readVLong(new DataInputStream(new ByteArrayInputStream(buf, off, buf.length - off)))?
And encodeVLong(buf, off, l) could be implemented as follows (with an additional constructor added to BoundedByteArrayOutputStream)
{
  BoundedByteArrayOutputStream bbaos = new BoundedByteArrayOutputStream(buf, off);
  writeVLong(new DataOutputStream(bbaos), l); 
  return bbaos.size() - off;
}

> WritableUtils::*VLong utilities should be available for byte arrays
> -------------------------------------------------------------------
>
>                 Key: HADOOP-6381
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6381
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: util
>            Reporter: Chris Douglas
>            Priority: Minor
>         Attachments: C6381-0.patch, C6381-1.patch, C6381-2.patch
>
>
> Particularly when working with raw bytes in Writables, it is often useful to have versions of the vint utility functions for byte arrays.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6381) WritableUtils::*VLong utilities should be available for byte arrays

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779293#action_12779293 ] 

Hadoop QA commented on HADOOP-6381:
-----------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12425300/C6381-2.patch
  against trunk revision 881509.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/142/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/142/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/142/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/142/console

This message is automatically generated.

> WritableUtils::*VLong utilities should be available for byte arrays
> -------------------------------------------------------------------
>
>                 Key: HADOOP-6381
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6381
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: util
>            Reporter: Chris Douglas
>            Priority: Minor
>         Attachments: C6381-0.patch, C6381-1.patch, C6381-2.patch
>
>
> Particularly when working with raw bytes in Writables, it is often useful to have versions of the vint utility functions for byte arrays.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-6381) WritableUtils::*VLong utilities should be available for byte arrays

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-6381:
----------------------------------

    Status: Patch Available  (was: Open)

> WritableUtils::*VLong utilities should be available for byte arrays
> -------------------------------------------------------------------
>
>                 Key: HADOOP-6381
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6381
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: util
>            Reporter: Chris Douglas
>            Priority: Minor
>         Attachments: C6381-0.patch
>
>
> Particularly when working with raw bytes in Writables, it is often useful to have versions of the vint utility functions for byte arrays.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6381) WritableUtils::*VLong utilities should be available for byte arrays

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779304#action_12779304 ] 

Todd Lipcon commented on HADOOP-6381:
-------------------------------------

Hey Chris. I thought about this a bit more on the train (I'm an odd guy, I know) and realized another similar issue: the decodeVint methods only return the integer and not the length consumed, so again they're less than useful in a lot of circumstances. Namely, I'd imagine these sorts of routines would be most handy when writing RawComparators where you might see a vint length followed by a string. Without knowing how long the vint was, you can't really know where the string starts.

Since Java sadly doesn't have tuples, I don't have a great idea what to do about this. Option 1 is to just leave it like you've got it. Option 2 would be to make the decode routines be non-static methods of VIntWritable/VLongWritable. Any thoughts about this? I'm just hesitant to add these public-facing APIs if we know them to be faulty and don't have a use case yet.

> WritableUtils::*VLong utilities should be available for byte arrays
> -------------------------------------------------------------------
>
>                 Key: HADOOP-6381
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6381
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: util
>            Reporter: Chris Douglas
>            Priority: Minor
>         Attachments: C6381-0.patch, C6381-1.patch, C6381-2.patch
>
>
> Particularly when working with raw bytes in Writables, it is often useful to have versions of the vint utility functions for byte arrays.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.