You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Mat Kelcey <ma...@gmail.com> on 2012/09/16 02:15:43 UTC

Approaches to storing arbitrary schema in a sequencefile

Hey all,

I've starting using SequenceFiles more and more (in particular the
elephant bird load and storage functions) and am wondering what's the
best approach is for marshaling between a schema from pig (which can
have some arbitrary number of fields) and a sequence files (which must
have two fields; key and value).

When I've got two fields its trivial...

 %declare SEQFILE_STORAGE
'com.twitter.elephantbird.pig.store.SequenceFileStorage';
 %declare TEXT_CONVERTER 'com.twitter.elephantbird.pig.util.TextConverter';
 %declare LONG_CONVERTER
'com.twitter.elephantbird.pig.util.LongWritableConverter';
 a = load 'x' as (f1:chararray, f2:chararray);
 store a into 'y' using $SEQFILE_STORAGE( '-c $TEXT_CONVERTER', '-c
$TEXT_CONVERTER');

but what's the best way to handle something with 3+ fields?

 a = load 'x' as (f1:chararray, f2:chararray, f3:chararray);

I can see two options...
1) A simple writeable convertor to convert to something like f1 and a
composite f2, f3 field
2) Packing the fields myself using something like "a = foreach a
generate f1, TOTUPLE(f2, f3)"

But both are super clumsy and require unpacking when i reread things.

Am I missing something obvious here?

Cheers,
Mat

Re: Approaches to storing arbitrary schema in a sequencefile

Posted by Mat Kelcey <ma...@gmail.com>.
I guess I was looking for a quick win for a simple flat schema; a
serialisation format feels a bit of overkill for what I'm doing.
I might be able to just JSON my way out of this specific problem...
Cheers!
Mat

On 15 September 2012 19:44, Dmitriy Ryaboy <dv...@gmail.com> wrote:
> We tend to write protobuf or thrift definition for complex objects,
> but that introduces severe latency into the development process.
> I suppose you could try something like kryo (and create a
> corresponding deserializer for EB).. the core of the problem is that
> you need to carry around the schema, and you probably don't want to
> write it into every tuple.
>
> D
>
> On Sat, Sep 15, 2012 at 5:15 PM, Mat Kelcey <ma...@gmail.com> wrote:
>> Hey all,
>>
>> I've starting using SequenceFiles more and more (in particular the
>> elephant bird load and storage functions) and am wondering what's the
>> best approach is for marshaling between a schema from pig (which can
>> have some arbitrary number of fields) and a sequence files (which must
>> have two fields; key and value).
>>
>> When I've got two fields its trivial...
>>
>>  %declare SEQFILE_STORAGE
>> 'com.twitter.elephantbird.pig.store.SequenceFileStorage';
>>  %declare TEXT_CONVERTER 'com.twitter.elephantbird.pig.util.TextConverter';
>>  %declare LONG_CONVERTER
>> 'com.twitter.elephantbird.pig.util.LongWritableConverter';
>>  a = load 'x' as (f1:chararray, f2:chararray);
>>  store a into 'y' using $SEQFILE_STORAGE( '-c $TEXT_CONVERTER', '-c
>> $TEXT_CONVERTER');
>>
>> but what's the best way to handle something with 3+ fields?
>>
>>  a = load 'x' as (f1:chararray, f2:chararray, f3:chararray);
>>
>> I can see two options...
>> 1) A simple writeable convertor to convert to something like f1 and a
>> composite f2, f3 field
>> 2) Packing the fields myself using something like "a = foreach a
>> generate f1, TOTUPLE(f2, f3)"
>>
>> But both are super clumsy and require unpacking when i reread things.
>>
>> Am I missing something obvious here?
>>
>> Cheers,
>> Mat

Re: Approaches to storing arbitrary schema in a sequencefile

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
We tend to write protobuf or thrift definition for complex objects,
but that introduces severe latency into the development process.
I suppose you could try something like kryo (and create a
corresponding deserializer for EB).. the core of the problem is that
you need to carry around the schema, and you probably don't want to
write it into every tuple.

D

On Sat, Sep 15, 2012 at 5:15 PM, Mat Kelcey <ma...@gmail.com> wrote:
> Hey all,
>
> I've starting using SequenceFiles more and more (in particular the
> elephant bird load and storage functions) and am wondering what's the
> best approach is for marshaling between a schema from pig (which can
> have some arbitrary number of fields) and a sequence files (which must
> have two fields; key and value).
>
> When I've got two fields its trivial...
>
>  %declare SEQFILE_STORAGE
> 'com.twitter.elephantbird.pig.store.SequenceFileStorage';
>  %declare TEXT_CONVERTER 'com.twitter.elephantbird.pig.util.TextConverter';
>  %declare LONG_CONVERTER
> 'com.twitter.elephantbird.pig.util.LongWritableConverter';
>  a = load 'x' as (f1:chararray, f2:chararray);
>  store a into 'y' using $SEQFILE_STORAGE( '-c $TEXT_CONVERTER', '-c
> $TEXT_CONVERTER');
>
> but what's the best way to handle something with 3+ fields?
>
>  a = load 'x' as (f1:chararray, f2:chararray, f3:chararray);
>
> I can see two options...
> 1) A simple writeable convertor to convert to something like f1 and a
> composite f2, f3 field
> 2) Packing the fields myself using something like "a = foreach a
> generate f1, TOTUPLE(f2, f3)"
>
> But both are super clumsy and require unpacking when i reread things.
>
> Am I missing something obvious here?
>
> Cheers,
> Mat