You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org> on 2009/01/05 18:59:44 UTC

[jira] Commented: (HIVE-207) Change SerDe API to allow skipping unused columns

    [ https://issues.apache.org/jira/browse/HIVE-207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12660832#action_12660832 ] 

Joydeep Sen Sarma commented on HIVE-207:
----------------------------------------

the deserializer api does get one column at a time. the deserialize() call doesn't have to do anything - it only has to return a handle back for lazy deserialization (where for example - the handle can contain a reference to a byte array). later on specific operators will invoke ObjectInspector interfaces to get access to particular columns - and at this point the objectinspector interface can be implemented to deserialize the relevant part of the byte array (for example).

the default reflection based objectinspector does not work this way - but this is a matter of implementation (we just haven't gotten around to lazy deserialization - and anyway it's dependent on the serialization format).

if u can try and implement lazy deserialization for protocol buffers - that will tell us what else needs to be added in terms of interfaces (right now i am confident that we have enough interfaces, to for example, do lazy deserialization of delimited string format).

> Change SerDe API to allow skipping unused columns
> -------------------------------------------------
>
>                 Key: HIVE-207
>                 URL: https://issues.apache.org/jira/browse/HIVE-207
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor, Serializers/Deserializers
>            Reporter: David Phillips
>
> A deserializer shouldn't have to deserialize columns that are never used by the query processor.  A serializer shouldn't have to examine unused columns that are known to always be null.
> As an example, we store data as a Protocol Buffer structure with ~60 fields.  Running a "select count(1)" currently requires deserializing all fields, which includes checking if they exist and formatting the data appropriately.  This is expensive and unnecessary.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.