You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by Zheng Shao <zs...@gmail.com> on 2010/01/23 03:20:48 UTC

lazy deserialization?

I noticed that avro has the "skip" functions which can help skip a
field when deserializing data.
This is good for column pruning in most cases, but we might be able to
do better in the following case.


Let's say we have a query like this:

CREATE TABLE t (col1 STRING, col2 STRING, col3 STRING);
SELECT col2 FROM t WHERE col3 = 'abcde';

We want to get field col3 first, if that matches what we want, then we
want to get to field col2.


Is there anyway to "remember" the current location of deserialization,
so that we can "resume" from that point?


-- 
Yours,
Zheng

Re: lazy deserialization?

Posted by Scott Carey <sc...@richrelevance.com>.
The binary decoder needs some work to improve performance that requires some extra buffering. (AVRO-327).  Once that is done, adding on some deferred lazy load capabilities wouldn't be that intrusive, and I am willing to build it into the Java BinaryDecoder if it is needed.  

-Scott

On Jan 22, 2010, at 6:38 PM, Philip Zeyliger wrote:

> Not with any of today's APIs.  "SELECT col1, col3 FROM t" is handled
> easily: you construct a schema that only has those columns, and col2
> is skipped at read time.
> 
> Does Hive have a use case for this that you're interested in?  If you
> don't mind paying the buffer copy, you could probably write a
> "DeferredFoo" class that doesn't de-serialize certain structures...
> 
> -- Philip
> 
> On Fri, Jan 22, 2010 at 6:20 PM, Zheng Shao <zs...@gmail.com> wrote:
>> I noticed that avro has the "skip" functions which can help skip a
>> field when deserializing data.
>> This is good for column pruning in most cases, but we might be able to
>> do better in the following case.
>> 
>> 
>> Let's say we have a query like this:
>> 
>> CREATE TABLE t (col1 STRING, col2 STRING, col3 STRING);
>> SELECT col2 FROM t WHERE col3 = 'abcde';
>> 
>> We want to get field col3 first, if that matches what we want, then we
>> want to get to field col2.
>> 
>> 
>> Is there anyway to "remember" the current location of deserialization,
>> so that we can "resume" from that point?
>> 
>> 
>> --
>> Yours,
>> Zheng
>> 


Re: lazy deserialization?

Posted by Philip Zeyliger <ph...@cloudera.com>.
Not with any of today's APIs.  "SELECT col1, col3 FROM t" is handled
easily: you construct a schema that only has those columns, and col2
is skipped at read time.

Does Hive have a use case for this that you're interested in?  If you
don't mind paying the buffer copy, you could probably write a
"DeferredFoo" class that doesn't de-serialize certain structures...

-- Philip

On Fri, Jan 22, 2010 at 6:20 PM, Zheng Shao <zs...@gmail.com> wrote:
> I noticed that avro has the "skip" functions which can help skip a
> field when deserializing data.
> This is good for column pruning in most cases, but we might be able to
> do better in the following case.
>
>
> Let's say we have a query like this:
>
> CREATE TABLE t (col1 STRING, col2 STRING, col3 STRING);
> SELECT col2 FROM t WHERE col3 = 'abcde';
>
> We want to get field col3 first, if that matches what we want, then we
> want to get to field col2.
>
>
> Is there anyway to "remember" the current location of deserialization,
> so that we can "resume" from that point?
>
>
> --
> Yours,
> Zheng
>