You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Costin Leau <co...@gmail.com> on 2013/12/19 15:08:38 UTC

accessing the schema within a LoadFunc

Hi,

I'm trying to get a hold of the schema specified for a loader through 'AS' using Apache Pig 0.12 :

A = LOAD 'pig/tupleartists' USING MyStorage() AS (name: chararray, links: (url:chararray, picture:chararray));
B = FOREACH A GENERATE name, links.url;
DUMP B;

1.
My loader implements LoadPushDown#pushProjection() which does not seem to be called at all (tried breakpoints, 
System.out - nothing). The API docs and this thread [1] suggest it should be call yet in my tests (using a local 
PigServer) this does not happen. Am I missing something?

2.
As an alternative, I'm loading the POStore objects (from  pig.map.store and pig.reduce.store) but the schema that I'm 
getting is incorrect, namely:
"(name: chararray, url: charray)" without any mention of the "links" field. Is there any way to recreate/retrieve the 
actual schema defined by the user or at least determine which fields are nested ("links.url") as oppose to the top level 
ones ("name")?

Thanks,
-- 
Costin

Re: accessing the schema within a LoadFunc

Posted by Costin Leau <co...@gmail.com>.
Raised https://issues.apache.org/jira/browse/PIG-3646
Until it gets fixed though, are there some Pig internal APIs that I can use
to get a hold of the schema? As I've mentioned in my initial email, I can't
seem to find a way to get access to the full declaration - even the POStore
contain only the FOREACH information (which would be helpful) but in a
dereferenced form which is completely useful - i.e. links.url is returned
as url.
I'm assuming this information is available somewhere in Pig which, at some
point is reconstructed, yet I'm unable to find it.

Any pointers would be helpful (no matter how tied to Pig internals is it).

Cheers!

P.S. I tried looking into the pig.script variable but for some reason it
seems to be empty in my case...


On Sun, Dec 29, 2013 at 6:40 AM, Cheolsoo Park <pi...@gmail.com> wrote:

> Like Alan said in the thread that you're referring to, user-defined schema
> in the as-clause is not available within a LoadFunc. HBaseStorage is
> different since its schema is passed via a constructor parameter. As far as
> I know, most popular Pig storages do not require users to define schema in
> a load statement. For example, HCatLoader gets it from Hive metastore,
> AvroStorage get it from Avro file, etc.
>
> But it shouldn't be hard to change this, and contribution is welcome! Feel
> free to file a jira. Thanks!
>
> On Tue, Dec 24, 2013 at 1:41 AM, Costin Leau <co...@gmail.com>
> wrote:
>
> > Thanks for the pointers regarding 1).
> >
> > Any ideas on 2) - namely why only the deferenced schema is available and
> > how to get a hold of the actual user declaration?
> >
> > Cheers and Merry Christmas!
> >
> >
> > On 24/12/2013 1:05 AM, Cheolsoo Park wrote:
> >
> >> As for #1, pushdownProject() is called only if it's applicable-
> >> https://github.com/apache/pig/blob/trunk/src/org/apache/pig/
> >> newplan/optimizer/PlanOptimizer.java#L108
> >>
> >> Set a breakpoint in ColumnMapKeyPrune.java and see whether check()
> returns
> >> true or false-
> >> https://github.com/apache/pig/blob/trunk/src/org/apache/pig/
> >> newplan/logical/rules/ColumnMapKeyPrune.java#L85
> >>
> >> It probably returns false in your case, and that's why your
> >> pushProjection() is never called.
> >>
> >>
> >> On Thu, Dec 19, 2013 at 6:09 AM, Costin Leau <co...@gmail.com>
> >> wrote:
> >>
> >>  Forgot to specify the aforementioned thread [1]
> >>>
> >>> [1] http://www.mail-archive.com/user@pig.apache.org/msg06285.html
> >>>
> >>>
> >>> On 19/12/2013 4:08 PM, Costin Leau wrote:
> >>>
> >>>  Hi,
> >>>>
> >>>> I'm trying to get a hold of the schema specified for a loader through
> >>>> 'AS' using Apache Pig 0.12 :
> >>>>
> >>>> A = LOAD 'pig/tupleartists' USING MyStorage() AS (name: chararray,
> >>>> links:
> >>>> (url:chararray, picture:chararray));
> >>>> B = FOREACH A GENERATE name, links.url;
> >>>> DUMP B;
> >>>>
> >>>> 1.
> >>>> My loader implements LoadPushDown#pushProjection() which does not seem
> >>>> to
> >>>> be called at all (tried breakpoints,
> >>>> System.out - nothing). The API docs and this thread [1] suggest it
> >>>> should
> >>>> be call yet in my tests (using a local
> >>>> PigServer) this does not happen. Am I missing something?
> >>>>
> >>>> 2.
> >>>> As an alternative, I'm loading the POStore objects (from
>  pig.map.store
> >>>> and pig.reduce.store) but the schema that I'm
> >>>> getting is incorrect, namely:
> >>>> "(name: chararray, url: charray)" without any mention of the "links"
> >>>> field. Is there any way to recreate/retrieve the
> >>>> actual schema defined by the user or at least determine which fields
> are
> >>>> nested ("links.url") as oppose to the top level
> >>>> ones ("name")?
> >>>>
> >>>> Thanks,
> >>>>
> >>>>
> >>> --
> >>> Costin
> >>>
> >>>
> >>
> > --
> > Costin
> >
>

Re: accessing the schema within a LoadFunc

Posted by Cheolsoo Park <pi...@gmail.com>.
Like Alan said in the thread that you're referring to, user-defined schema
in the as-clause is not available within a LoadFunc. HBaseStorage is
different since its schema is passed via a constructor parameter. As far as
I know, most popular Pig storages do not require users to define schema in
a load statement. For example, HCatLoader gets it from Hive metastore,
AvroStorage get it from Avro file, etc.

But it shouldn't be hard to change this, and contribution is welcome! Feel
free to file a jira. Thanks!

On Tue, Dec 24, 2013 at 1:41 AM, Costin Leau <co...@gmail.com> wrote:

> Thanks for the pointers regarding 1).
>
> Any ideas on 2) - namely why only the deferenced schema is available and
> how to get a hold of the actual user declaration?
>
> Cheers and Merry Christmas!
>
>
> On 24/12/2013 1:05 AM, Cheolsoo Park wrote:
>
>> As for #1, pushdownProject() is called only if it's applicable-
>> https://github.com/apache/pig/blob/trunk/src/org/apache/pig/
>> newplan/optimizer/PlanOptimizer.java#L108
>>
>> Set a breakpoint in ColumnMapKeyPrune.java and see whether check() returns
>> true or false-
>> https://github.com/apache/pig/blob/trunk/src/org/apache/pig/
>> newplan/logical/rules/ColumnMapKeyPrune.java#L85
>>
>> It probably returns false in your case, and that's why your
>> pushProjection() is never called.
>>
>>
>> On Thu, Dec 19, 2013 at 6:09 AM, Costin Leau <co...@gmail.com>
>> wrote:
>>
>>  Forgot to specify the aforementioned thread [1]
>>>
>>> [1] http://www.mail-archive.com/user@pig.apache.org/msg06285.html
>>>
>>>
>>> On 19/12/2013 4:08 PM, Costin Leau wrote:
>>>
>>>  Hi,
>>>>
>>>> I'm trying to get a hold of the schema specified for a loader through
>>>> 'AS' using Apache Pig 0.12 :
>>>>
>>>> A = LOAD 'pig/tupleartists' USING MyStorage() AS (name: chararray,
>>>> links:
>>>> (url:chararray, picture:chararray));
>>>> B = FOREACH A GENERATE name, links.url;
>>>> DUMP B;
>>>>
>>>> 1.
>>>> My loader implements LoadPushDown#pushProjection() which does not seem
>>>> to
>>>> be called at all (tried breakpoints,
>>>> System.out - nothing). The API docs and this thread [1] suggest it
>>>> should
>>>> be call yet in my tests (using a local
>>>> PigServer) this does not happen. Am I missing something?
>>>>
>>>> 2.
>>>> As an alternative, I'm loading the POStore objects (from  pig.map.store
>>>> and pig.reduce.store) but the schema that I'm
>>>> getting is incorrect, namely:
>>>> "(name: chararray, url: charray)" without any mention of the "links"
>>>> field. Is there any way to recreate/retrieve the
>>>> actual schema defined by the user or at least determine which fields are
>>>> nested ("links.url") as oppose to the top level
>>>> ones ("name")?
>>>>
>>>> Thanks,
>>>>
>>>>
>>> --
>>> Costin
>>>
>>>
>>
> --
> Costin
>

Re: accessing the schema within a LoadFunc

Posted by Costin Leau <co...@gmail.com>.
Thanks for the pointers regarding 1).

Any ideas on 2) - namely why only the deferenced schema is available and how to get a hold of the actual user declaration?

Cheers and Merry Christmas!

On 24/12/2013 1:05 AM, Cheolsoo Park wrote:
> As for #1, pushdownProject() is called only if it's applicable-
> https://github.com/apache/pig/blob/trunk/src/org/apache/pig/newplan/optimizer/PlanOptimizer.java#L108
>
> Set a breakpoint in ColumnMapKeyPrune.java and see whether check() returns
> true or false-
> https://github.com/apache/pig/blob/trunk/src/org/apache/pig/newplan/logical/rules/ColumnMapKeyPrune.java#L85
>
> It probably returns false in your case, and that's why your
> pushProjection() is never called.
>
>
> On Thu, Dec 19, 2013 at 6:09 AM, Costin Leau <co...@gmail.com> wrote:
>
>> Forgot to specify the aforementioned thread [1]
>>
>> [1] http://www.mail-archive.com/user@pig.apache.org/msg06285.html
>>
>>
>> On 19/12/2013 4:08 PM, Costin Leau wrote:
>>
>>> Hi,
>>>
>>> I'm trying to get a hold of the schema specified for a loader through
>>> 'AS' using Apache Pig 0.12 :
>>>
>>> A = LOAD 'pig/tupleartists' USING MyStorage() AS (name: chararray, links:
>>> (url:chararray, picture:chararray));
>>> B = FOREACH A GENERATE name, links.url;
>>> DUMP B;
>>>
>>> 1.
>>> My loader implements LoadPushDown#pushProjection() which does not seem to
>>> be called at all (tried breakpoints,
>>> System.out - nothing). The API docs and this thread [1] suggest it should
>>> be call yet in my tests (using a local
>>> PigServer) this does not happen. Am I missing something?
>>>
>>> 2.
>>> As an alternative, I'm loading the POStore objects (from  pig.map.store
>>> and pig.reduce.store) but the schema that I'm
>>> getting is incorrect, namely:
>>> "(name: chararray, url: charray)" without any mention of the "links"
>>> field. Is there any way to recreate/retrieve the
>>> actual schema defined by the user or at least determine which fields are
>>> nested ("links.url") as oppose to the top level
>>> ones ("name")?
>>>
>>> Thanks,
>>>
>>
>> --
>> Costin
>>
>

-- 
Costin

Re: accessing the schema within a LoadFunc

Posted by Cheolsoo Park <pi...@gmail.com>.
As for #1, pushdownProject() is called only if it's applicable-
https://github.com/apache/pig/blob/trunk/src/org/apache/pig/newplan/optimizer/PlanOptimizer.java#L108

Set a breakpoint in ColumnMapKeyPrune.java and see whether check() returns
true or false-
https://github.com/apache/pig/blob/trunk/src/org/apache/pig/newplan/logical/rules/ColumnMapKeyPrune.java#L85

It probably returns false in your case, and that's why your
pushProjection() is never called.


On Thu, Dec 19, 2013 at 6:09 AM, Costin Leau <co...@gmail.com> wrote:

> Forgot to specify the aforementioned thread [1]
>
> [1] http://www.mail-archive.com/user@pig.apache.org/msg06285.html
>
>
> On 19/12/2013 4:08 PM, Costin Leau wrote:
>
>> Hi,
>>
>> I'm trying to get a hold of the schema specified for a loader through
>> 'AS' using Apache Pig 0.12 :
>>
>> A = LOAD 'pig/tupleartists' USING MyStorage() AS (name: chararray, links:
>> (url:chararray, picture:chararray));
>> B = FOREACH A GENERATE name, links.url;
>> DUMP B;
>>
>> 1.
>> My loader implements LoadPushDown#pushProjection() which does not seem to
>> be called at all (tried breakpoints,
>> System.out - nothing). The API docs and this thread [1] suggest it should
>> be call yet in my tests (using a local
>> PigServer) this does not happen. Am I missing something?
>>
>> 2.
>> As an alternative, I'm loading the POStore objects (from  pig.map.store
>> and pig.reduce.store) but the schema that I'm
>> getting is incorrect, namely:
>> "(name: chararray, url: charray)" without any mention of the "links"
>> field. Is there any way to recreate/retrieve the
>> actual schema defined by the user or at least determine which fields are
>> nested ("links.url") as oppose to the top level
>> ones ("name")?
>>
>> Thanks,
>>
>
> --
> Costin
>

Re: accessing the schema within a LoadFunc

Posted by Costin Leau <co...@gmail.com>.
Forgot to specify the aforementioned thread [1]

[1] http://www.mail-archive.com/user@pig.apache.org/msg06285.html

On 19/12/2013 4:08 PM, Costin Leau wrote:
> Hi,
>
> I'm trying to get a hold of the schema specified for a loader through 'AS' using Apache Pig 0.12 :
>
> A = LOAD 'pig/tupleartists' USING MyStorage() AS (name: chararray, links: (url:chararray, picture:chararray));
> B = FOREACH A GENERATE name, links.url;
> DUMP B;
>
> 1.
> My loader implements LoadPushDown#pushProjection() which does not seem to be called at all (tried breakpoints,
> System.out - nothing). The API docs and this thread [1] suggest it should be call yet in my tests (using a local
> PigServer) this does not happen. Am I missing something?
>
> 2.
> As an alternative, I'm loading the POStore objects (from  pig.map.store and pig.reduce.store) but the schema that I'm
> getting is incorrect, namely:
> "(name: chararray, url: charray)" without any mention of the "links" field. Is there any way to recreate/retrieve the
> actual schema defined by the user or at least determine which fields are nested ("links.url") as oppose to the top level
> ones ("name")?
>
> Thanks,

-- 
Costin