You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Keren Ouaknine <ke...@gmail.com> on 2013/08/08 23:42:35 UTC

schema definition and subschema

Hi,

A schema in Pig (LogicalSchema.java) is defined as an array list of
LogicalFieldSchema whose class members are:
- String alias
- byte type
- long uid
- LogicalSchema schema

I am wondering why is LogicalFieldShema containing a LogicalSchema member?
My guess so far is that perhaps there's a subschema used by some operators?
I tried to figure out which operators might be using it and categorized the
main ones as follow:

==> SCHEMA IS DEFINED BY INPUT SCHEMA ONLY
LOAD
DISTINCT
FILTER
ORDER BY
SPLIT

==> SCHEMA IS DEFINED BY THE LIST OF "AS" IN THE FOREACH STATEMENT
FOREACH

==> IF SCHEMA CAN BE DEFINED (SAME LENGTH AND CASTABLE) OR UNKNOWN SCHEMA
UNION

==> SCHEMA IS DEFINED BY THE CONCATENATION OF THE TWO INPUT SCHEMAS (+
ADDING THE ALIAS TO THE FIELD NAME x ==> A::x)
JOIN
*Are the two inputs here considered subschemas?*

==> SCHEMA: (key_to_order_by, bag)
GROUP

Thanks,
Keren

--
Keren Ouaknine
Web: www.kereno.com

Re: schema definition and subschema

Posted by Cheolsoo Park <pi...@gmail.com>.
Hi Keren,

Hope this is too late.

>> I am wondering why is LogicalFieldShema containing a LogicalSchema
member?

That's for nested tuple fields. For example, consider "( i:int,
t:tuple(j:int) )". The field t:tuple needs to contain a list of field
schemas, so you need a LogicalSchema. Here is how you can verify it.

1) Debug Pig main in eclipse.
2) Set a breakpoint in the LogicalFieldSchema constructor.
3) Run "a = load '/dev/null' as (i:int, t:tuple(j:int));" on grunt.

Thanks,
Cheolsoo




On Thu, Aug 8, 2013 at 2:42 PM, Keren Ouaknine <ke...@gmail.com> wrote:

> Hi,
>
> A schema in Pig (LogicalSchema.java) is defined as an array list of
> LogicalFieldSchema whose class members are:
> - String alias
> - byte type
> - long uid
> - LogicalSchema schema
>
> I am wondering why is LogicalFieldShema containing a LogicalSchema member?
> My guess so far is that perhaps there's a subschema used by some operators?
> I tried to figure out which operators might be using it and categorized the
> main ones as follow:
>
> ==> SCHEMA IS DEFINED BY INPUT SCHEMA ONLY
> LOAD
> DISTINCT
> FILTER
> ORDER BY
> SPLIT
>
> ==> SCHEMA IS DEFINED BY THE LIST OF "AS" IN THE FOREACH STATEMENT
> FOREACH
>
> ==> IF SCHEMA CAN BE DEFINED (SAME LENGTH AND CASTABLE) OR UNKNOWN SCHEMA
> UNION
>
> ==> SCHEMA IS DEFINED BY THE CONCATENATION OF THE TWO INPUT SCHEMAS (+
> ADDING THE ALIAS TO THE FIELD NAME x ==> A::x)
> JOIN
> *Are the two inputs here considered subschemas?*
>
> ==> SCHEMA: (key_to_order_by, bag)
> GROUP
>
> Thanks,
> Keren
>
> --
> Keren Ouaknine
> Web: www.kereno.com
>

Re: schema definition and subschema

Posted by Cheolsoo Park <pi...@gmail.com>.
Hi Keren,

Hope this is too late.

>> I am wondering why is LogicalFieldShema containing a LogicalSchema
member?

That's for nested tuple fields. For example, consider "( i:int,
t:tuple(j:int) )". The field t:tuple needs to contain a list of field
schemas, so you need a LogicalSchema. Here is how you can verify it.

1) Debug Pig main in eclipse.
2) Set a breakpoint in the LogicalFieldSchema constructor.
3) Run "a = load '/dev/null' as (i:int, t:tuple(j:int));" on grunt.

Thanks,
Cheolsoo




On Thu, Aug 8, 2013 at 2:42 PM, Keren Ouaknine <ke...@gmail.com> wrote:

> Hi,
>
> A schema in Pig (LogicalSchema.java) is defined as an array list of
> LogicalFieldSchema whose class members are:
> - String alias
> - byte type
> - long uid
> - LogicalSchema schema
>
> I am wondering why is LogicalFieldShema containing a LogicalSchema member?
> My guess so far is that perhaps there's a subschema used by some operators?
> I tried to figure out which operators might be using it and categorized the
> main ones as follow:
>
> ==> SCHEMA IS DEFINED BY INPUT SCHEMA ONLY
> LOAD
> DISTINCT
> FILTER
> ORDER BY
> SPLIT
>
> ==> SCHEMA IS DEFINED BY THE LIST OF "AS" IN THE FOREACH STATEMENT
> FOREACH
>
> ==> IF SCHEMA CAN BE DEFINED (SAME LENGTH AND CASTABLE) OR UNKNOWN SCHEMA
> UNION
>
> ==> SCHEMA IS DEFINED BY THE CONCATENATION OF THE TWO INPUT SCHEMAS (+
> ADDING THE ALIAS TO THE FIELD NAME x ==> A::x)
> JOIN
> *Are the two inputs here considered subschemas?*
>
> ==> SCHEMA: (key_to_order_by, bag)
> GROUP
>
> Thanks,
> Keren
>
> --
> Keren Ouaknine
> Web: www.kereno.com
>