You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Pete Wyckoff (JIRA)" <ji...@apache.org> on 2008/09/19 20:04:44 UTC

[jira] Created: (HADOOP-4224) [Hive] Port Hive's serialization/deserialization to the new Serialization framework

[Hive] Port Hive's serialization/deserialization to the new Serialization framework
-----------------------------------------------------------------------------------

                 Key: HADOOP-4224
                 URL: https://issues.apache.org/jira/browse/HADOOP-4224
             Project: Hadoop Core
          Issue Type: Improvement
          Components: contrib/hive, contrib/serialization
            Reporter: Pete Wyckoff


Problem 1: legacy data

This is non-trivial because of legacy Hive data which is written as BytesWritable in the SequenceFile value key.  The specific RecordIO/Thrift/X class name is stored in the metastore. 

If we write our own SequenceFileRecordReader, this is trivial, but the standard reader assumes the SequenceFile has the actual class name and thus we cannot  deserialize at this level as we would just get back bytes writable. We need the SequenceFileRecordReader to consult the Deserializer as to what the actual class being deserialized is.

I don't know if this is a common problem of writing data as just byteswritable and storing the real class somewhere else, but for us it is an issue.

Otherwise, there's soon to be a ThriftSerialization set of classes and we can add ones for our other serdes.

Problem 2: DynamicSerDe

This is a serializer/deserializer that takes a thrift DDL at *runtime* and can serialize/deserialize thrift/non thrift data.  Thus, the class name DynamicSerDe doesn't give you what you need, namely the DDL and the protocol used for the serialization - Binary or Control Separated. (in theory json, xml, ...)  

We can store this DDL in the metastore (and we do), but then DynamicSerDe must be used only with Hive.  Maybe we should output only to TFiles where we could put the DDL in the metadata for the file.




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.