You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Rajgopal Vaithiyanathan <ra...@gmail.com> on 2012/04/13 10:32:06 UTC

Execution of outputSchema

Where will the outputSchema be executed? in the client or as a mapreduce ?

I've planned to keep the output schema as an XML and let the outputSchema
method read it and generate the Schema object with respect ti the XML.

Where should I place this XML file ? Client or HDFS ?

:)
Raj

Re: Execution of outputSchema

Posted by Jonathan Coveney <jc...@gmail.com>.

Oh, one last thing: why are you serializing the Schema to XML? Are you
adding any new metadata, or just enough to reconstruct it on the other
side? If it's the latter, you can just call a Schema's toString, and if you
trim the {}'s at the front and end, you cna use
org.apache.pig.impl.util.Utils.getSchemaFromString(String) to reconstruct
it. No need to do anything more complex (if that's all you want/need to do).

2012/4/13 Jonathan Coveney <jc...@gmail.com>

> Raj,
>
> If you serialize the inputSchema, getting the outputSchema is as easy as
> just running outputSchema on it. Either way, I looked into it and it's a
> feature in trunk, it's not even in 0.10...so yeah, if this is something you
> need to do you're going to have to cook up another solution.
>
> first, I would see if this patch can be backported to 0.10:
> https://issues.apache.org/jira/browse/PIG-2337 if not, then you can
> leverage the work they did to make a unique signature.
>
> Be wary of using the UDFContext... the name is misleading. It is actually
> shared between UDFs, and isn't a safe place to put things (without jumping
> through hoops). Another issue that you have to contend with is multiple
> instances of your UDF, especially multiple instances with different input.
> Even if you push the data to Hadoop or the distributed cache or anywhere,
> if you have 3 instances of the same UDF with different input schemas (and
> thus potentially different output Schemas), how do you know which instances
> of the UDF on the backend should grab which xml files?
>
> Lastly, why do you need this information on the backend? there may be
> another way to do what you're trying to do.
>
>
> 2012/4/13 Rajgopal Vaithiyanathan <ra...@gmail.com>
>
>> Thanks Jonathan,
>>
>>
>> But, the question is not about serializing input schema. however, i'm
>> using
>> 0.9.2 and i dont see getInputSchema in EvalFunc.. Please tell me how to
>> use
>> it. Right now, i'm serializing it using UDFContext
>>
>> The question was:
>> I've implemented the outputSchema this way ;
>>
>>
>>    public Schema outputSchema(Schema input) {
>>
>>        if(input.getAliases().contains("sales")) {
>>            return generateOutputSchemaFrom("sales.xml");
>>        }
>>
>>        else if(input.getAliases().contains("others")) {
>>            return generateOutputSchemaFrom("others.xml");
>>        }
>>
>>    }
>>
>> The question was where i should place this *sales.xml and others.xml.*?
>>
>>
>>
>> On Fri, Apr 13, 2012 at 2:08 PM, Jonathan Coveney <jcoveney@gmail.com
>> >wrote:
>>
>> > Raj,
>> >
>> > The outputSchema is executed on the front end[1] (and beware: it can be
>> > called many times, and beyond that, UDFs are instantiated many times on
>> the
>> > front end).
>> >
>> > What is your goal with serializing the output schema to XML? What are
>> you
>> > trying to do? I should also mention that EvalFunc now has
>> > "getInputSchema()," as it serializes the input schema for you... but
>> yeah,
>> > some context around what you want to do is key.
>> >
>> > [1] front end meaning the client side where the script is parsed and the
>> > job jar created
>> >
>> > 2012/4/13 Rajgopal Vaithiyanathan <ra...@gmail.com>
>> >
>> > > Where will the outputSchema be executed? in the client or as a
>> mapreduce
>> > ?
>> > >
>> > > I've planned to keep the output schema as an XML and let the
>> outputSchema
>> > > method read it and generate the Schema object with respect ti the XML.
>> > >
>> > > Where should I place this XML file ? Client or HDFS ?
>> > >
>> > > :)
>> > > Raj
>> > >
>> >
>>
>>
>>
>> --
>> Thanks and Regards,
>> Rajgopal Vaithiyanathan.
>>
>
>

Re: Execution of outputSchema

Posted by Jonathan Coveney <jc...@gmail.com>.

Raj,

If you serialize the inputSchema, getting the outputSchema is as easy as
just running outputSchema on it. Either way, I looked into it and it's a
feature in trunk, it's not even in 0.10...so yeah, if this is something you
need to do you're going to have to cook up another solution.

first, I would see if this patch can be backported to 0.10:
https://issues.apache.org/jira/browse/PIG-2337 if not, then you can
leverage the work they did to make a unique signature.

Be wary of using the UDFContext... the name is misleading. It is actually
shared between UDFs, and isn't a safe place to put things (without jumping
through hoops). Another issue that you have to contend with is multiple
instances of your UDF, especially multiple instances with different input.
Even if you push the data to Hadoop or the distributed cache or anywhere,
if you have 3 instances of the same UDF with different input schemas (and
thus potentially different output Schemas), how do you know which instances
of the UDF on the backend should grab which xml files?

Lastly, why do you need this information on the backend? there may be
another way to do what you're trying to do.

2012/4/13 Rajgopal Vaithiyanathan <ra...@gmail.com>

> Thanks Jonathan,
>
>
> But, the question is not about serializing input schema. however, i'm using
> 0.9.2 and i dont see getInputSchema in EvalFunc.. Please tell me how to use
> it. Right now, i'm serializing it using UDFContext
>
> The question was:
> I've implemented the outputSchema this way ;
>
>
>    public Schema outputSchema(Schema input) {
>
>        if(input.getAliases().contains("sales")) {
>            return generateOutputSchemaFrom("sales.xml");
>        }
>
>        else if(input.getAliases().contains("others")) {
>            return generateOutputSchemaFrom("others.xml");
>        }
>
>    }
>
> The question was where i should place this *sales.xml and others.xml.*?
>
>
>
> On Fri, Apr 13, 2012 at 2:08 PM, Jonathan Coveney <jcoveney@gmail.com
> >wrote:
>
> > Raj,
> >
> > The outputSchema is executed on the front end[1] (and beware: it can be
> > called many times, and beyond that, UDFs are instantiated many times on
> the
> > front end).
> >
> > What is your goal with serializing the output schema to XML? What are you
> > trying to do? I should also mention that EvalFunc now has
> > "getInputSchema()," as it serializes the input schema for you... but
> yeah,
> > some context around what you want to do is key.
> >
> > [1] front end meaning the client side where the script is parsed and the
> > job jar created
> >
> > 2012/4/13 Rajgopal Vaithiyanathan <ra...@gmail.com>
> >
> > > Where will the outputSchema be executed? in the client or as a
> mapreduce
> > ?
> > >
> > > I've planned to keep the output schema as an XML and let the
> outputSchema
> > > method read it and generate the Schema object with respect ti the XML.
> > >
> > > Where should I place this XML file ? Client or HDFS ?
> > >
> > > :)
> > > Raj
> > >
> >
>
>
>
> --
> Thanks and Regards,
> Rajgopal Vaithiyanathan.
>

Re: Execution of outputSchema

Posted by Rajgopal Vaithiyanathan <ra...@gmail.com>.

Thanks Jonathan,


But, the question is not about serializing input schema. however, i'm using
0.9.2 and i dont see getInputSchema in EvalFunc.. Please tell me how to use
it. Right now, i'm serializing it using UDFContext

The question was:
I've implemented the outputSchema this way ;


    public Schema outputSchema(Schema input) {

        if(input.getAliases().contains("sales")) {
            return generateOutputSchemaFrom("sales.xml");
        }

        else if(input.getAliases().contains("others")) {
            return generateOutputSchemaFrom("others.xml");
        }

    }

The question was where i should place this *sales.xml and others.xml.*?



On Fri, Apr 13, 2012 at 2:08 PM, Jonathan Coveney <jc...@gmail.com>wrote:

> Raj,
>
> The outputSchema is executed on the front end[1] (and beware: it can be
> called many times, and beyond that, UDFs are instantiated many times on the
> front end).
>
> What is your goal with serializing the output schema to XML? What are you
> trying to do? I should also mention that EvalFunc now has
> "getInputSchema()," as it serializes the input schema for you... but yeah,
> some context around what you want to do is key.
>
> [1] front end meaning the client side where the script is parsed and the
> job jar created
>
> 2012/4/13 Rajgopal Vaithiyanathan <ra...@gmail.com>
>
> > Where will the outputSchema be executed? in the client or as a mapreduce
> ?
> >
> > I've planned to keep the output schema as an XML and let the outputSchema
> > method read it and generate the Schema object with respect ti the XML.
> >
> > Where should I place this XML file ? Client or HDFS ?
> >
> > :)
> > Raj
> >
>



-- 
Thanks and Regards,
Rajgopal Vaithiyanathan.

Re: Execution of outputSchema

Posted by Jonathan Coveney <jc...@gmail.com>.

Raj,

The outputSchema is executed on the front end[1] (and beware: it can be
called many times, and beyond that, UDFs are instantiated many times on the
front end).

What is your goal with serializing the output schema to XML? What are you
trying to do? I should also mention that EvalFunc now has
"getInputSchema()," as it serializes the input schema for you... but yeah,
some context around what you want to do is key.

[1] front end meaning the client side where the script is parsed and the
job jar created

2012/4/13 Rajgopal Vaithiyanathan <ra...@gmail.com>

> Where will the outputSchema be executed? in the client or as a mapreduce ?
>
> I've planned to keep the output schema as an XML and let the outputSchema
> method read it and generate the Schema object with respect ti the XML.
>
> Where should I place this XML file ? Client or HDFS ?
>
> :)
> Raj
>