You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Arun C Murthy (JIRA)" <ji...@apache.org> on 2008/05/19 10:46:55 UTC

[jira] Created: (HADOOP-3414) Facility to query serializable types such as Writables for 'raw length'

Facility to query serializable types such as Writables for 'raw length'
-----------------------------------------------------------------------

                 Key: HADOOP-3414
                 URL: https://issues.apache.org/jira/browse/HADOOP-3414
             Project: Hadoop Core
          Issue Type: Improvement
          Components: io
            Reporter: Arun C Murthy


Currently we need to jump through hoops to get the 'raw length' of serializable types for e.g. SequenceFile.Writer.append needs to copy the key/value into a buffer and then check the buffer's size to figure the record/key/value lenghts. Obviously this can be improved to do away with the extra copy if we had types which could be queried for it's raw-length.

Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3414) Facility to query serializable types such as Writables for 'raw length'

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598000#action_12598000 ] 

Devaraj Das commented on HADOOP-3414:
-------------------------------------

This doesn't look straightforward (i wish it were). Especially given that post HADOOP-1986, we could use any serialization technique (including java's native one)...

> Facility to query serializable types such as Writables for 'raw length'
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-3414
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3414
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io
>            Reporter: Arun C Murthy
>
> Currently we need to jump through hoops to get the 'raw length' of serializable types for e.g. SequenceFile.Writer.append needs to copy the key/value into a buffer and then check the buffer's size to figure the record/key/value lenghts. Obviously this can be improved to do away with the extra copy if we had types which could be queried for it's raw-length.
> Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3414) Facility to query serializable types such as Writables for 'raw length'

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598141#action_12598141 ] 

Arun C Murthy commented on HADOOP-3414:
---------------------------------------

Sigh, we need to go further and tweak Writables too... we'd need to add a Writable.getSerializedLength. Unfortunately Writable is an interface, we would have to make it an abstract class to play the same trick as Serializer.getSize! Thoughts? *wink*

> Facility to query serializable types such as Writables for 'raw length'
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-3414
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3414
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io
>            Reporter: Arun C Murthy
>
> Currently we need to jump through hoops to get the 'raw length' of serializable types for e.g. SequenceFile.Writer.append needs to copy the key/value into a buffer and then check the buffer's size to figure the record/key/value lenghts. Obviously this can be improved to do away with the extra copy if we had types which could be queried for it's raw-length.
> Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3414) Facility to query serializable types such as Writables for 'raw length'

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598134#action_12598134 ] 

Arun C Murthy commented on HADOOP-3414:
---------------------------------------

This would be useful even without LengthPrefixedSerializer, since typically we want to write out both record/key/value lengths first, before the actual data ala SequenceFiles. 

Serializer#getSize in itself goes a long way! *smile*

> Facility to query serializable types such as Writables for 'raw length'
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-3414
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3414
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io
>            Reporter: Arun C Murthy
>
> Currently we need to jump through hoops to get the 'raw length' of serializable types for e.g. SequenceFile.Writer.append needs to copy the key/value into a buffer and then check the buffer's size to figure the record/key/value lenghts. Obviously this can be improved to do away with the extra copy if we had types which could be queried for it's raw-length.
> Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3414) Facility to query serializable types such as Writables for 'raw length'

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598133#action_12598133 ] 

Arun C Murthy commented on HADOOP-3414:
---------------------------------------

bq. Is that something like what you have in mind?

Absolutely!

> Facility to query serializable types such as Writables for 'raw length'
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-3414
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3414
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io
>            Reporter: Arun C Murthy
>
> Currently we need to jump through hoops to get the 'raw length' of serializable types for e.g. SequenceFile.Writer.append needs to copy the key/value into a buffer and then check the buffer's size to figure the record/key/value lenghts. Obviously this can be improved to do away with the extra copy if we had types which could be queried for it's raw-length.
> Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3414) Facility to query serializable types such as Writables for 'raw length'

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598126#action_12598126 ] 

Doug Cutting commented on HADOOP-3414:
--------------------------------------

So we'd:
 - make Serializer an abstract class;
 - add a method:
{code}
 public int getSize() { return -1; }
{code}
 - override this in some simple classes, like Text and BytesWritable;
 - add a utility somewhere like:
{code}
public LengthPrefixedSerializer<T> extends Serializer<T> {
  private DataOutputStream out;
  private DataOutputBuffer buffer = new DataOutputBuffer();
  private Serializer<T> serializer;
  private Serializer<T> bufferSerializer;

  public LengthPrefixedSerializer<T>(Class<T> c, DataOutputStream out) {
    this.out = out;
    serializer =  SerializationFactory.getSerializer(c);
    serializer.open(out);
    bufferSerializer = SerializationFactory.getSerializer(c);
    bufferSerializer.open(buffer);
  }

  public void serialize(T o) {
    int size o.getSize();
    if (size >= 0) {
      // can serialize directly w/o buffering
      WriteableUtils.writeVInt(out, size);
      serializer.serialize(o);
    } else {
      // have to buffer before we can serialize
      buffer.reset();
      bufferSerializer.serialize(o);
      WriteableUtils.writeVInt(out, buffer.getLength());
      out.write(buffer.getBytes(), 0, buffer.getLength());
   }
}
{code}

Is that something like what you have in mind?

> Facility to query serializable types such as Writables for 'raw length'
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-3414
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3414
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io
>            Reporter: Arun C Murthy
>
> Currently we need to jump through hoops to get the 'raw length' of serializable types for e.g. SequenceFile.Writer.append needs to copy the key/value into a buffer and then check the buffer's size to figure the record/key/value lenghts. Obviously this can be improved to do away with the extra copy if we had types which could be queried for it's raw-length.
> Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3414) Facility to query serializable types such as Writables for 'raw length'

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598343#action_12598343 ] 

Doug Cutting commented on HADOOP-3414:
--------------------------------------

Perhaps we could add a SizedWritable sub-interface or abstract class, then change Text & BytesWritable to extend that instead.  Then WritableSerializer#getSize(Writable) could use instanceof to decide whether to call getSize() or simply return -1.  It's a little ugly to use instanceof, but it would be back-compatible, I think.

What we really need is a benchmark that shows this provides some significant performance improvement.  If it does, then it's probably worth finding the best compromise between back-compatiblity, elegance and performance.

> Facility to query serializable types such as Writables for 'raw length'
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-3414
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3414
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io
>            Reporter: Arun C Murthy
>
> Currently we need to jump through hoops to get the 'raw length' of serializable types for e.g. SequenceFile.Writer.append needs to copy the key/value into a buffer and then check the buffer's size to figure the record/key/value lenghts. Obviously this can be improved to do away with the extra copy if we had types which could be queried for it's raw-length.
> Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3414) Facility to query serializable types such as Writables for 'raw length'

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598135#action_12598135 ] 

Doug Cutting commented on HADOOP-3414:
--------------------------------------

> This would be useful even without LengthPrefixedSerializer, [ ... ]

Sure, that was mostly an example.  But examples are important when designing APIs.  If you have a different canonical use in mind, please present it, so folks can see how well the proposed API fits the application you have in mind.

> Facility to query serializable types such as Writables for 'raw length'
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-3414
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3414
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io
>            Reporter: Arun C Murthy
>
> Currently we need to jump through hoops to get the 'raw length' of serializable types for e.g. SequenceFile.Writer.append needs to copy the key/value into a buffer and then check the buffer's size to figure the record/key/value lenghts. Obviously this can be improved to do away with the extra copy if we had types which could be queried for it's raw-length.
> Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.