You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Pete Wyckoff <pw...@facebook.com> on 2008/09/20 02:35:27 UTC

Tuple and Datum implementations

Is it possible to implement these using an object composed of List, Commons
bag, Integer, Double, String, Boolean and Byte? (I will also have type
information, so I don't need to endure the cost of reflection.)

Also, is it possible to implement them using a record IO Object and
reflection?

Thanks, pete

Re: Tuple and Datum implementations

Posted by Alan Gates <ga...@yahoo-inc.com>.


Pete Wyckoff wrote:
> For #1.4, could I not implement a new storage implementation and when given
> the file name, I choose the deserialization/serialization mechanism? This
> would not allow me to hide the location of the file from the user, but would
> still have the benefit of the storage implementation hiding the details of
> the deserialization.
>   
In the short term you could.  Eventually we'd like to be able to 
decouple the loading from the metadata for those who want to use 
external metadata sources so that they aren't forced to reimplement all 
the load and store functions.
> For #2, yes I see, I don't want to implement the full Bag API, just want to
> construct a default data bag from a Set or a List native object.
>
> As for Describe, I would mean on a symbolic name - presumably a name
> returned by a "show" command I would also want to implement - both with
> basically mysql semantics.
>   
In my earlier mail I was thinking mainly of file level metadata (schema, 
etc.)  Here you're proposing grid level metadata.  We talked about being 
able to do things like show and describe on "tables" instead of on 
files, but haven't fleshed it out yet.  I think we all agree it's 
something we'll want to be able to do.

Alan.

Re: Tuple and Datum implementations

Posted by Pete Wyckoff <pw...@facebook.com>.

For #1.4, could I not implement a new storage implementation and when given
the file name, I choose the deserialization/serialization mechanism? This
would not allow me to hide the location of the file from the user, but would
still have the benefit of the storage implementation hiding the details of
the deserialization.

For #2, yes I see, I don't want to implement the full Bag API, just want to
construct a default data bag from a Set or a List native object.

As for Describe, I would mean on a symbolic name - presumably a name
returned by a "show" command I would also want to implement - both with
basically mysql semantics.

Thanks, pete


On 9/22/08 5:46 PM, "Alan Gates" <ga...@yahoo-inc.com> wrote:

> 
> 
> Pete Wyckoff wrote:
>> So, my use case would be to use the Hive MetaStore and
>> Serializes/Deserializers to implement:
>> 
>> 1. a new Pig storage class based on looking up the metadata from the
>> metastore
>>   
> Our general direction in metadata is to support 4 possible types of
> metadata:
> 
> 1) none, we assume everything is uninterpreted bytes.  This works.
> 2) user specified in script.  This works.
> 3) provided by load function (for example if it's reading JSON or XML or
> whatever and can tell the schema of the data.  This is coded but not yet
> tested.
> 4) provided by an external source.  In this scenario pig would somehow
> be made aware of an external metadata source, and when it sees a load it
> would query that source for info on the file.  I think this is what you
> want.  We hope to start design work on this in the next month or two.
> Any input you have to this design is certainly welcome.
>> 2. a new Tuple/datum/bag based on native Java list,bag, integer, ...
>>   
> In the types branch we've already reimplemented what was DataAtom as
> java String, Integer, Double, etc.  So it's only Tuple and DataBag that
> you'd need to give new implementations for.  But these interfaces are
> much more complex than just a List (for example, a DataBag has to be
> able to spill to disk if it runs out of memory).  If you have different
> underlying data representations then you might benefit from a
> re-implementation.  But just trying to base it on java types I think
> you'll end up re-inventing what we have.  Take a look at
> org.apache.pig.data.DefaultTuple and org.apache.pig.data.DefaultDataBag
> in the types branch to get and idea of what Tuples and Bags look like in
> pig now.
>> 3. add a "describe" command to the grunt shell.
>>   
> Do you mean you'd like to do describe on a file (or more generically a
> data input) instead of just on an alias?  If so, yes, I think that would
> be a great idea.
>> Here, I will assume that the "filename" passed in to bindTo in storage is
>> the name of the "table".
>> 
>> Is this a plausible implementation as I don't know much about the Pig
>> internals and would people find such an optional feature useful?
>> 
>> Thanks, pete
>>   
> Alan.

Re: Tuple and Datum implementations

Posted by Alan Gates <ga...@yahoo-inc.com>.


Pete Wyckoff wrote:
> So, my use case would be to use the Hive MetaStore and
> Serializes/Deserializers to implement:
>
> 1. a new Pig storage class based on looking up the metadata from the
> metastore
>   
Our general direction in metadata is to support 4 possible types of 
metadata:

1) none, we assume everything is uninterpreted bytes.  This works.
2) user specified in script.  This works.
3) provided by load function (for example if it's reading JSON or XML or 
whatever and can tell the schema of the data.  This is coded but not yet 
tested.
4) provided by an external source.  In this scenario pig would somehow 
be made aware of an external metadata source, and when it sees a load it 
would query that source for info on the file.  I think this is what you 
want.  We hope to start design work on this in the next month or two.  
Any input you have to this design is certainly welcome.
> 2. a new Tuple/datum/bag based on native Java list,bag, integer, ...
>   
In the types branch we've already reimplemented what was DataAtom as 
java String, Integer, Double, etc.  So it's only Tuple and DataBag that 
you'd need to give new implementations for.  But these interfaces are 
much more complex than just a List (for example, a DataBag has to be 
able to spill to disk if it runs out of memory).  If you have different 
underlying data representations then you might benefit from a 
re-implementation.  But just trying to base it on java types I think 
you'll end up re-inventing what we have.  Take a look at 
org.apache.pig.data.DefaultTuple and org.apache.pig.data.DefaultDataBag 
in the types branch to get and idea of what Tuples and Bags look like in 
pig now.
> 3. add a "describe" command to the grunt shell.
>   
Do you mean you'd like to do describe on a file (or more generically a 
data input) instead of just on an alias?  If so, yes, I think that would 
be a great idea.
> Here, I will assume that the "filename" passed in to bindTo in storage is
> the name of the "table".
>
> Is this a plausible implementation as I don't know much about the Pig
> internals and would people find such an optional feature useful?
>
> Thanks, pete
>   
Alan.

Re: Tuple and Datum implementations

Posted by Pete Wyckoff <pw...@facebook.com>.

So, my use case would be to use the Hive MetaStore and
Serializes/Deserializers to implement:

1. a new Pig storage class based on looking up the metadata from the
metastore
2. a new Tuple/datum/bag based on native Java list,bag, integer, ...
3. add a "describe" command to the grunt shell.

Here, I will assume that the "filename" passed in to bindTo in storage is
the name of the "table".

Is this a plausible implementation as I don't know much about the Pig
internals and would people find such an optional feature useful?

Thanks, pete

On 9/19/08 5:39 PM, "Olga Natkovich" <ol...@yahoo-inc.com> wrote:

> In types branch, we are doing a couple of things:
> 
> (1) Using native java types for scalar types and for DataMap;
> (2) Making Tuple and Bag interfaces so that new implementations can be
> written.
> 
> Olga
> 
>> -----Original Message-----
>> From: Pete Wyckoff [mailto:pwyckoff@facebook.com]
>> Sent: Friday, September 19, 2008 5:35 PM
>> To: pig-dev@incubator.apache.org
>> Subject: Tuple and Datum implementations
>> 
>> 
>> Is it possible to implement these using an object composed of
>> List, Commons bag, Integer, Double, String, Boolean and Byte?
>> (I will also have type information, so I don't need to endure
>> the cost of reflection.)
>> 
>> Also, is it possible to implement them using a record IO
>> Object and reflection?
>> 
>> Thanks, pete
>> 
>>

RE: Tuple and Datum implementations

Posted by Olga Natkovich <ol...@yahoo-inc.com>.

In types branch, we are doing a couple of things:

(1) Using native java types for scalar types and for DataMap;
(2) Making Tuple and Bag interfaces so that new implementations can be
written.

Olga

> -----Original Message-----
> From: Pete Wyckoff [mailto:pwyckoff@facebook.com] 
> Sent: Friday, September 19, 2008 5:35 PM
> To: pig-dev@incubator.apache.org
> Subject: Tuple and Datum implementations
> 
> 
> Is it possible to implement these using an object composed of 
> List, Commons bag, Integer, Double, String, Boolean and Byte? 
> (I will also have type information, so I don't need to endure 
> the cost of reflection.)
> 
> Also, is it possible to implement them using a record IO 
> Object and reflection?
> 
> Thanks, pete
> 
>