You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Arun C Murthy (JIRA)" <ji...@apache.org> on 2008/05/19 10:46:55 UTC
[jira] Created: (HADOOP-3414) Facility to query serializable types
such as Writables for 'raw length'
Facility to query serializable types such as Writables for 'raw length'
-----------------------------------------------------------------------
Key: HADOOP-3414
URL: https://issues.apache.org/jira/browse/HADOOP-3414
Project: Hadoop Core
Issue Type: Improvement
Components: io
Reporter: Arun C Murthy
Currently we need to jump through hoops to get the 'raw length' of serializable types for e.g. SequenceFile.Writer.append needs to copy the key/value into a buffer and then check the buffer's size to figure the record/key/value lenghts. Obviously this can be improved to do away with the extra copy if we had types which could be queried for it's raw-length.
Thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-3414) Facility to query serializable
types such as Writables for 'raw length'
Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598000#action_12598000 ]
Devaraj Das commented on HADOOP-3414:
-------------------------------------
This doesn't look straightforward (i wish it were). Especially given that post HADOOP-1986, we could use any serialization technique (including java's native one)...
> Facility to query serializable types such as Writables for 'raw length'
> -----------------------------------------------------------------------
>
> Key: HADOOP-3414
> URL: https://issues.apache.org/jira/browse/HADOOP-3414
> Project: Hadoop Core
> Issue Type: Improvement
> Components: io
> Reporter: Arun C Murthy
>
> Currently we need to jump through hoops to get the 'raw length' of serializable types for e.g. SequenceFile.Writer.append needs to copy the key/value into a buffer and then check the buffer's size to figure the record/key/value lenghts. Obviously this can be improved to do away with the extra copy if we had types which could be queried for it's raw-length.
> Thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-3414) Facility to query serializable
types such as Writables for 'raw length'
Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598141#action_12598141 ]
Arun C Murthy commented on HADOOP-3414:
---------------------------------------
Sigh, we need to go further and tweak Writables too... we'd need to add a Writable.getSerializedLength. Unfortunately Writable is an interface, we would have to make it an abstract class to play the same trick as Serializer.getSize! Thoughts? *wink*
> Facility to query serializable types such as Writables for 'raw length'
> -----------------------------------------------------------------------
>
> Key: HADOOP-3414
> URL: https://issues.apache.org/jira/browse/HADOOP-3414
> Project: Hadoop Core
> Issue Type: Improvement
> Components: io
> Reporter: Arun C Murthy
>
> Currently we need to jump through hoops to get the 'raw length' of serializable types for e.g. SequenceFile.Writer.append needs to copy the key/value into a buffer and then check the buffer's size to figure the record/key/value lenghts. Obviously this can be improved to do away with the extra copy if we had types which could be queried for it's raw-length.
> Thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-3414) Facility to query serializable
types such as Writables for 'raw length'
Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598134#action_12598134 ]
Arun C Murthy commented on HADOOP-3414:
---------------------------------------
This would be useful even without LengthPrefixedSerializer, since typically we want to write out both record/key/value lengths first, before the actual data ala SequenceFiles.
Serializer#getSize in itself goes a long way! *smile*
> Facility to query serializable types such as Writables for 'raw length'
> -----------------------------------------------------------------------
>
> Key: HADOOP-3414
> URL: https://issues.apache.org/jira/browse/HADOOP-3414
> Project: Hadoop Core
> Issue Type: Improvement
> Components: io
> Reporter: Arun C Murthy
>
> Currently we need to jump through hoops to get the 'raw length' of serializable types for e.g. SequenceFile.Writer.append needs to copy the key/value into a buffer and then check the buffer's size to figure the record/key/value lenghts. Obviously this can be improved to do away with the extra copy if we had types which could be queried for it's raw-length.
> Thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-3414) Facility to query serializable
types such as Writables for 'raw length'
Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598133#action_12598133 ]
Arun C Murthy commented on HADOOP-3414:
---------------------------------------
bq. Is that something like what you have in mind?
Absolutely!
> Facility to query serializable types such as Writables for 'raw length'
> -----------------------------------------------------------------------
>
> Key: HADOOP-3414
> URL: https://issues.apache.org/jira/browse/HADOOP-3414
> Project: Hadoop Core
> Issue Type: Improvement
> Components: io
> Reporter: Arun C Murthy
>
> Currently we need to jump through hoops to get the 'raw length' of serializable types for e.g. SequenceFile.Writer.append needs to copy the key/value into a buffer and then check the buffer's size to figure the record/key/value lenghts. Obviously this can be improved to do away with the extra copy if we had types which could be queried for it's raw-length.
> Thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-3414) Facility to query serializable
types such as Writables for 'raw length'
Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598126#action_12598126 ]
Doug Cutting commented on HADOOP-3414:
--------------------------------------
So we'd:
- make Serializer an abstract class;
- add a method:
{code}
public int getSize() { return -1; }
{code}
- override this in some simple classes, like Text and BytesWritable;
- add a utility somewhere like:
{code}
public LengthPrefixedSerializer<T> extends Serializer<T> {
private DataOutputStream out;
private DataOutputBuffer buffer = new DataOutputBuffer();
private Serializer<T> serializer;
private Serializer<T> bufferSerializer;
public LengthPrefixedSerializer<T>(Class<T> c, DataOutputStream out) {
this.out = out;
serializer = SerializationFactory.getSerializer(c);
serializer.open(out);
bufferSerializer = SerializationFactory.getSerializer(c);
bufferSerializer.open(buffer);
}
public void serialize(T o) {
int size o.getSize();
if (size >= 0) {
// can serialize directly w/o buffering
WriteableUtils.writeVInt(out, size);
serializer.serialize(o);
} else {
// have to buffer before we can serialize
buffer.reset();
bufferSerializer.serialize(o);
WriteableUtils.writeVInt(out, buffer.getLength());
out.write(buffer.getBytes(), 0, buffer.getLength());
}
}
{code}
Is that something like what you have in mind?
> Facility to query serializable types such as Writables for 'raw length'
> -----------------------------------------------------------------------
>
> Key: HADOOP-3414
> URL: https://issues.apache.org/jira/browse/HADOOP-3414
> Project: Hadoop Core
> Issue Type: Improvement
> Components: io
> Reporter: Arun C Murthy
>
> Currently we need to jump through hoops to get the 'raw length' of serializable types for e.g. SequenceFile.Writer.append needs to copy the key/value into a buffer and then check the buffer's size to figure the record/key/value lenghts. Obviously this can be improved to do away with the extra copy if we had types which could be queried for it's raw-length.
> Thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-3414) Facility to query serializable
types such as Writables for 'raw length'
Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598343#action_12598343 ]
Doug Cutting commented on HADOOP-3414:
--------------------------------------
Perhaps we could add a SizedWritable sub-interface or abstract class, then change Text & BytesWritable to extend that instead. Then WritableSerializer#getSize(Writable) could use instanceof to decide whether to call getSize() or simply return -1. It's a little ugly to use instanceof, but it would be back-compatible, I think.
What we really need is a benchmark that shows this provides some significant performance improvement. If it does, then it's probably worth finding the best compromise between back-compatiblity, elegance and performance.
> Facility to query serializable types such as Writables for 'raw length'
> -----------------------------------------------------------------------
>
> Key: HADOOP-3414
> URL: https://issues.apache.org/jira/browse/HADOOP-3414
> Project: Hadoop Core
> Issue Type: Improvement
> Components: io
> Reporter: Arun C Murthy
>
> Currently we need to jump through hoops to get the 'raw length' of serializable types for e.g. SequenceFile.Writer.append needs to copy the key/value into a buffer and then check the buffer's size to figure the record/key/value lenghts. Obviously this can be improved to do away with the extra copy if we had types which could be queried for it's raw-length.
> Thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-3414) Facility to query serializable
types such as Writables for 'raw length'
Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598135#action_12598135 ]
Doug Cutting commented on HADOOP-3414:
--------------------------------------
> This would be useful even without LengthPrefixedSerializer, [ ... ]
Sure, that was mostly an example. But examples are important when designing APIs. If you have a different canonical use in mind, please present it, so folks can see how well the proposed API fits the application you have in mind.
> Facility to query serializable types such as Writables for 'raw length'
> -----------------------------------------------------------------------
>
> Key: HADOOP-3414
> URL: https://issues.apache.org/jira/browse/HADOOP-3414
> Project: Hadoop Core
> Issue Type: Improvement
> Components: io
> Reporter: Arun C Murthy
>
> Currently we need to jump through hoops to get the 'raw length' of serializable types for e.g. SequenceFile.Writer.append needs to copy the key/value into a buffer and then check the buffer's size to figure the record/key/value lenghts. Obviously this can be improved to do away with the extra copy if we had types which could be queried for it's raw-length.
> Thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.