You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Prashant Kommireddi <pr...@gmail.com> on 2011/11/24 22:38:18 UTC

Pig Data type question

I have a question regarding the pig data types.

If I have a UDF, say 'CustomUDF' and I do something like this:

REGISTER 'foo.jar';

A = LOAD '/shared/a.dat';

What would be the difference in the data types for UDF arguments between
-->

Case 1 : B = FOREACH A GENERATE CustomUDF(TOTUPLE(*), 'input'); AND
Case 2 : B = FOREACH A GENERATE CustomUDF(*, 'input');

I am sure Case 1 is (tuple, chararray). Can anyone let me know the data
type for Case 2 arguments?

Thanks,
Prashant

Re: Pig Data type question

Posted by Prashant Kommireddi <pr...@gmail.com>.

Yeah, your use case makes sense, though have you done any benchmarking to
see how significantly eliminating the TOTUPLE call will benefit
performance? I'd be curious if it was so significant.

Yes, its a O(n)  vs O(1) operation. TOTUPLE is O(n) whereas UDF(*,'arg') is
O(1). Basically, the UDF checks for String argument 'arg', and looks up for
a field in the Tuple based on a Hashmap that stores 'arg' to index mapping.

Also, if the string you're passing is just a static argument, it's probably
cleaner to put it in the constructor, and then use a DEFINE statement to
instantiate it.

Unfortunately, String argument is not static.

But yeah, I mean, even if Pig supported this functionality more cleanly,
there is a problem matching TOTUPLE(*) and * because * could just be a
simple Tuple, and there would be ambiguity there. I would test to see if
there is actually a material benefit to doing this.

If * were a Tuple Pig should invoke UDF(tuple, chararray). If you notice
Pig treats BETTERUDF(*, 'arg') as a single argument -> Tuple of fields
containing values from * followed by 'arg' as the last value. If * were a
Tuple itself, Pig should treat that as UDF(tuple, 'arg')

On Fri, Nov 25, 2011 at 1:15 PM, Jonathan Coveney <jc...@gmail.com>wrote:

> Yeah, your use case makes sense, though have you done any benchmarking to
> see how significantly eliminating the TOTUPLE call will benefit
> performance? I'd be curious if it was so significant.
>
> Also, if the string you're passing is just a static argument, it's probably
> cleaner to put it in the constructor, and then use a DEFINE statement to
> instantiate it.
>
> But yeah, I mean, even if Pig supported this functionality more cleanly,
> there is a problem matching TOTUPLE(*) and * because * could just be a
> simple Tuple, and there would be ambiguity there. I would test to see if
> there is actually a material benefit to doing this.
>
> 2011/11/25 Prashant Kommireddi <pr...@gmail.com>
>
> > In the case where arguments are UDF(TOTUPLE(*), 'arg'), the EvalFunc
> > actually receives a single Tuple with 2 elements - first one being a
> Tuple
> > and the 2nd a chararray. In case the arguments were UDF(*, 'arg') the
> > EvalFunc receives a Tuple with multiple fields (* and 'arg' being the
> last
> > element in that Tuple). I feel Pig should be able to distinguish between
> > the 2 cases here.
> >
> > To answer your question,
> >
> > *what if * is in fact just a Tuple of something? So you have
> >
> > TOTUPLE(tuple), 'chararray'
> > tuple, 'chararray'
> >
> > which one should they match? The one intended for TOTUPLE(*), or the one
> > intended for just *? Because both would match just a tuple.*
> >
> > It should match UDF(Tuple, chararray). Its for the UDF to handle the
> inner
> > elements of Tuple but getArgToFuncMapping() should be able to invoke the
> > right UDF, at least.
> >
> > The reason I am trying to overload the function is because I have already
> > exposed UDF(TOTUPLE(*), chararray) to my users. I have now come up with a
> > better UDF - BETTERUDF(*, 'arg') in terms of performance ( avoiding a
> > TOTUPLE call ) and want users to be able to just change the arguments
> they
> > pass to their UDF and be able to use the new one.
> >
> > The FloatAbs function is intended for Scalar values, so it makes sense
> not
> > to wrap it in a TOTUPLE.
> >
> > On Fri, Nov 25, 2011 at 11:48 AM, Jonathan Coveney <jcoveney@gmail.com
> > >wrote:
> >
> > > I believe that this is a current limitation of Pig: you can't have a
> > > function that uses both getArgToFuncMapping and a variable number of
> > > arguments. In this case, it kind of makes sense that you can't though,
> > > example:
> > >
> > > what if * is in fact just a Tuple of something? So you have
> > >
> > > TOTUPLE(tuple), 'chararray'
> > > tuple, 'chararray'
> > >
> > > which one should they match? The one intended for TOTUPLE(*), or the
> one
> > > intended for just *? Because both would match just a tuple.
> > >
> > > Hmm, one more thing, though, which also is important: you're
> re-wrapping
> > > the argument in a Tuple. It is implicit that the input to your evalfunc
> > > will come in the form of a Tuple. In the UDF example, note that they
> > don't
> > > rewrap in a tuple:
> > >
> > > funcList.add(new FuncSpec(FloatAbs.class.getName(),   new Schema(new
> > > Schema.FieldSchema(null, DataType.FLOAT))));
> > >
> > > So unless your argument will be explicitly rewrapped in a tuple, you
> > don't
> > > need that piece.
> > >
> > > But yeah, someone else can chime in with whether getArgtoFunc can do
> wha
> > > you want it to do, but I don't think it can. My suggestion would be to
> a)
> > > choose one form of input and stick to that, instead of trying to
> support
> > > two forms and b) you could have a initializer in your EvalFunc that on
> > the
> > > first input, inspects the types and figures out which function to use
> to
> > > process the input.
> > >
> > > We do need to make funcspecs play nice with variable numbers of
> > arguments,
> > > though, especially now that more schema info is available.
> > >
> > > 2011/11/25 Prashant Kommireddi <pr...@gmail.com>
> > >
> > > > Thanks Jonathan.
> > > >
> > > > What do I check for as the input type, because DataType.TUPLE does
> not
> > > seem
> > > > to work. I would like to use "getArgToFuncMapping()" to be able to
> > invoke
> > > > different functions based on input type, and I am not sure how to
> check
> > > for
> > > > Case 2.
> > > >
> > > > In my implementation, Case 1 could be checked for (DataType.TUPLE,
> > > > DataType.CHARARRAY) but for Case 2 I would assume it should be
> > > > (DataType.TUPLE) but that does not work. PIg UDF cannot infer a
> > matching
> > > > function.
> > > >
> > > >  @Override
> > > >    public List<FuncSpec> getArgToFuncMapping() throws
> > FrontendException {
> > > >        List<FuncSpec> funcList = new ArrayList<FuncSpec>();
> > > >        Schema s = new Schema();
> > > >        s.add(new Schema.FieldSchema(null, DataType.TUPLE));
> > > >        s.add(new Schema.FieldSchema(null, DataType.CHARARRAY));
> > > >        funcList.add(new FuncSpec(this.getClass().getName(), s));
> > > >
> > > >        s = new Schema();
> > > >        s.add(new Schema.FieldSchema(null, DataType.TUPLE));
> > > >        funcList.add(new FuncSpec(CustomUDF.class.getName(), s));
> > > >
> > > >        return funcList;
> > > >    }
> > > >
> > > >
> > > >
> > > > On Fri, Nov 25, 2011 at 12:52 AM, Jonathan Coveney <
> jcoveney@gmail.com
> > > > >wrote:
> > > >
> > > > > The first case will give you a tuple which contains, as it first
> > > > element, a
> > > > > tuple of all of the stuff in *, and as its second element, 'input'.
> > > > >
> > > > > The second will give youa tuple which contains all of the elements
> of
> > > *,
> > > > > and then as its last element, 'input'.
> > > > >
> > > > > This is what I thought, but to be sure I ran this UDF:
> > > > >
> > > > > import org.apache.pig.EvalFunc;
> > > > > import java.io.IOException;
> > > > > import org.apache.pig.data.Tuple;
> > > > >
> > > > > public class ATHING extends EvalFunc<String> {
> > > > >  public String exec(Tuple input) throws IOException {
> > > > >    System.out.println(input.toString());
> > > > >    return null;
> > > > >   }
> > > > > }
> > > > >
> > > > > 2011/11/24 Prashant Kommireddi <pr...@gmail.com>
> > > > >
> > > > > > I have a question regarding the pig data types.
> > > > > >
> > > > > > If I have a UDF, say 'CustomUDF' and I do something like this:
> > > > > >
> > > > > > REGISTER 'foo.jar';
> > > > > >
> > > > > > A = LOAD '/shared/a.dat';
> > > > > >
> > > > > > What would be the difference in the data types for UDF arguments
> > > > between
> > > > > > -->
> > > > > >
> > > > > > Case 1 : B = FOREACH A GENERATE CustomUDF(TOTUPLE(*), 'input');
> AND
> > > > > > Case 2 : B = FOREACH A GENERATE CustomUDF(*, 'input');
> > > > > >
> > > > > > I am sure Case 1 is (tuple, chararray). Can anyone let me know
> the
> > > data
> > > > > > type for Case 2 arguments?
> > > > > >
> > > > > > Thanks,
> > > > > > Prashant
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Pig Data type question

Posted by Jonathan Coveney <jc...@gmail.com>.

Yeah, your use case makes sense, though have you done any benchmarking to
see how significantly eliminating the TOTUPLE call will benefit
performance? I'd be curious if it was so significant.

Also, if the string you're passing is just a static argument, it's probably
cleaner to put it in the constructor, and then use a DEFINE statement to
instantiate it.

But yeah, I mean, even if Pig supported this functionality more cleanly,
there is a problem matching TOTUPLE(*) and * because * could just be a
simple Tuple, and there would be ambiguity there. I would test to see if
there is actually a material benefit to doing this.

2011/11/25 Prashant Kommireddi <pr...@gmail.com>

> In the case where arguments are UDF(TOTUPLE(*), 'arg'), the EvalFunc
> actually receives a single Tuple with 2 elements - first one being a Tuple
> and the 2nd a chararray. In case the arguments were UDF(*, 'arg') the
> EvalFunc receives a Tuple with multiple fields (* and 'arg' being the last
> element in that Tuple). I feel Pig should be able to distinguish between
> the 2 cases here.
>
> To answer your question,
>
> *what if * is in fact just a Tuple of something? So you have
>
> TOTUPLE(tuple), 'chararray'
> tuple, 'chararray'
>
> which one should they match? The one intended for TOTUPLE(*), or the one
> intended for just *? Because both would match just a tuple.*
>
> It should match UDF(Tuple, chararray). Its for the UDF to handle the inner
> elements of Tuple but getArgToFuncMapping() should be able to invoke the
> right UDF, at least.
>
> The reason I am trying to overload the function is because I have already
> exposed UDF(TOTUPLE(*), chararray) to my users. I have now come up with a
> better UDF - BETTERUDF(*, 'arg') in terms of performance ( avoiding a
> TOTUPLE call ) and want users to be able to just change the arguments they
> pass to their UDF and be able to use the new one.
>
> The FloatAbs function is intended for Scalar values, so it makes sense not
> to wrap it in a TOTUPLE.
>
> On Fri, Nov 25, 2011 at 11:48 AM, Jonathan Coveney <jcoveney@gmail.com
> >wrote:
>
> > I believe that this is a current limitation of Pig: you can't have a
> > function that uses both getArgToFuncMapping and a variable number of
> > arguments. In this case, it kind of makes sense that you can't though,
> > example:
> >
> > what if * is in fact just a Tuple of something? So you have
> >
> > TOTUPLE(tuple), 'chararray'
> > tuple, 'chararray'
> >
> > which one should they match? The one intended for TOTUPLE(*), or the one
> > intended for just *? Because both would match just a tuple.
> >
> > Hmm, one more thing, though, which also is important: you're re-wrapping
> > the argument in a Tuple. It is implicit that the input to your evalfunc
> > will come in the form of a Tuple. In the UDF example, note that they
> don't
> > rewrap in a tuple:
> >
> > funcList.add(new FuncSpec(FloatAbs.class.getName(),   new Schema(new
> > Schema.FieldSchema(null, DataType.FLOAT))));
> >
> > So unless your argument will be explicitly rewrapped in a tuple, you
> don't
> > need that piece.
> >
> > But yeah, someone else can chime in with whether getArgtoFunc can do wha
> > you want it to do, but I don't think it can. My suggestion would be to a)
> > choose one form of input and stick to that, instead of trying to support
> > two forms and b) you could have a initializer in your EvalFunc that on
> the
> > first input, inspects the types and figures out which function to use to
> > process the input.
> >
> > We do need to make funcspecs play nice with variable numbers of
> arguments,
> > though, especially now that more schema info is available.
> >
> > 2011/11/25 Prashant Kommireddi <pr...@gmail.com>
> >
> > > Thanks Jonathan.
> > >
> > > What do I check for as the input type, because DataType.TUPLE does not
> > seem
> > > to work. I would like to use "getArgToFuncMapping()" to be able to
> invoke
> > > different functions based on input type, and I am not sure how to check
> > for
> > > Case 2.
> > >
> > > In my implementation, Case 1 could be checked for (DataType.TUPLE,
> > > DataType.CHARARRAY) but for Case 2 I would assume it should be
> > > (DataType.TUPLE) but that does not work. PIg UDF cannot infer a
> matching
> > > function.
> > >
> > >  @Override
> > >    public List<FuncSpec> getArgToFuncMapping() throws
> FrontendException {
> > >        List<FuncSpec> funcList = new ArrayList<FuncSpec>();
> > >        Schema s = new Schema();
> > >        s.add(new Schema.FieldSchema(null, DataType.TUPLE));
> > >        s.add(new Schema.FieldSchema(null, DataType.CHARARRAY));
> > >        funcList.add(new FuncSpec(this.getClass().getName(), s));
> > >
> > >        s = new Schema();
> > >        s.add(new Schema.FieldSchema(null, DataType.TUPLE));
> > >        funcList.add(new FuncSpec(CustomUDF.class.getName(), s));
> > >
> > >        return funcList;
> > >    }
> > >
> > >
> > >
> > > On Fri, Nov 25, 2011 at 12:52 AM, Jonathan Coveney <jcoveney@gmail.com
> > > >wrote:
> > >
> > > > The first case will give you a tuple which contains, as it first
> > > element, a
> > > > tuple of all of the stuff in *, and as its second element, 'input'.
> > > >
> > > > The second will give youa tuple which contains all of the elements of
> > *,
> > > > and then as its last element, 'input'.
> > > >
> > > > This is what I thought, but to be sure I ran this UDF:
> > > >
> > > > import org.apache.pig.EvalFunc;
> > > > import java.io.IOException;
> > > > import org.apache.pig.data.Tuple;
> > > >
> > > > public class ATHING extends EvalFunc<String> {
> > > >  public String exec(Tuple input) throws IOException {
> > > >    System.out.println(input.toString());
> > > >    return null;
> > > >   }
> > > > }
> > > >
> > > > 2011/11/24 Prashant Kommireddi <pr...@gmail.com>
> > > >
> > > > > I have a question regarding the pig data types.
> > > > >
> > > > > If I have a UDF, say 'CustomUDF' and I do something like this:
> > > > >
> > > > > REGISTER 'foo.jar';
> > > > >
> > > > > A = LOAD '/shared/a.dat';
> > > > >
> > > > > What would be the difference in the data types for UDF arguments
> > > between
> > > > > -->
> > > > >
> > > > > Case 1 : B = FOREACH A GENERATE CustomUDF(TOTUPLE(*), 'input'); AND
> > > > > Case 2 : B = FOREACH A GENERATE CustomUDF(*, 'input');
> > > > >
> > > > > I am sure Case 1 is (tuple, chararray). Can anyone let me know the
> > data
> > > > > type for Case 2 arguments?
> > > > >
> > > > > Thanks,
> > > > > Prashant
> > > > >
> > > >
> > >
> >
>

Re: Pig Data type question

Posted by Prashant Kommireddi <pr...@gmail.com>.

In the case where arguments are UDF(TOTUPLE(*), 'arg'), the EvalFunc
actually receives a single Tuple with 2 elements - first one being a Tuple
and the 2nd a chararray. In case the arguments were UDF(*, 'arg') the
EvalFunc receives a Tuple with multiple fields (* and 'arg' being the last
element in that Tuple). I feel Pig should be able to distinguish between
the 2 cases here.

To answer your question,

*what if * is in fact just a Tuple of something? So you have

TOTUPLE(tuple), 'chararray'
tuple, 'chararray'

which one should they match? The one intended for TOTUPLE(*), or the one
intended for just *? Because both would match just a tuple.*

It should match UDF(Tuple, chararray). Its for the UDF to handle the inner
elements of Tuple but getArgToFuncMapping() should be able to invoke the
right UDF, at least.

The reason I am trying to overload the function is because I have already
exposed UDF(TOTUPLE(*), chararray) to my users. I have now come up with a
better UDF - BETTERUDF(*, 'arg') in terms of performance ( avoiding a
TOTUPLE call ) and want users to be able to just change the arguments they
pass to their UDF and be able to use the new one.

The FloatAbs function is intended for Scalar values, so it makes sense not
to wrap it in a TOTUPLE.

On Fri, Nov 25, 2011 at 11:48 AM, Jonathan Coveney <jc...@gmail.com>wrote:

> I believe that this is a current limitation of Pig: you can't have a
> function that uses both getArgToFuncMapping and a variable number of
> arguments. In this case, it kind of makes sense that you can't though,
> example:
>
> what if * is in fact just a Tuple of something? So you have
>
> TOTUPLE(tuple), 'chararray'
> tuple, 'chararray'
>
> which one should they match? The one intended for TOTUPLE(*), or the one
> intended for just *? Because both would match just a tuple.
>
> Hmm, one more thing, though, which also is important: you're re-wrapping
> the argument in a Tuple. It is implicit that the input to your evalfunc
> will come in the form of a Tuple. In the UDF example, note that they don't
> rewrap in a tuple:
>
> funcList.add(new FuncSpec(FloatAbs.class.getName(),   new Schema(new
> Schema.FieldSchema(null, DataType.FLOAT))));
>
> So unless your argument will be explicitly rewrapped in a tuple, you don't
> need that piece.
>
> But yeah, someone else can chime in with whether getArgtoFunc can do wha
> you want it to do, but I don't think it can. My suggestion would be to a)
> choose one form of input and stick to that, instead of trying to support
> two forms and b) you could have a initializer in your EvalFunc that on the
> first input, inspects the types and figures out which function to use to
> process the input.
>
> We do need to make funcspecs play nice with variable numbers of arguments,
> though, especially now that more schema info is available.
>
> 2011/11/25 Prashant Kommireddi <pr...@gmail.com>
>
> > Thanks Jonathan.
> >
> > What do I check for as the input type, because DataType.TUPLE does not
> seem
> > to work. I would like to use "getArgToFuncMapping()" to be able to invoke
> > different functions based on input type, and I am not sure how to check
> for
> > Case 2.
> >
> > In my implementation, Case 1 could be checked for (DataType.TUPLE,
> > DataType.CHARARRAY) but for Case 2 I would assume it should be
> > (DataType.TUPLE) but that does not work. PIg UDF cannot infer a matching
> > function.
> >
> >  @Override
> >    public List<FuncSpec> getArgToFuncMapping() throws FrontendException {
> >        List<FuncSpec> funcList = new ArrayList<FuncSpec>();
> >        Schema s = new Schema();
> >        s.add(new Schema.FieldSchema(null, DataType.TUPLE));
> >        s.add(new Schema.FieldSchema(null, DataType.CHARARRAY));
> >        funcList.add(new FuncSpec(this.getClass().getName(), s));
> >
> >        s = new Schema();
> >        s.add(new Schema.FieldSchema(null, DataType.TUPLE));
> >        funcList.add(new FuncSpec(CustomUDF.class.getName(), s));
> >
> >        return funcList;
> >    }
> >
> >
> >
> > On Fri, Nov 25, 2011 at 12:52 AM, Jonathan Coveney <jcoveney@gmail.com
> > >wrote:
> >
> > > The first case will give you a tuple which contains, as it first
> > element, a
> > > tuple of all of the stuff in *, and as its second element, 'input'.
> > >
> > > The second will give youa tuple which contains all of the elements of
> *,
> > > and then as its last element, 'input'.
> > >
> > > This is what I thought, but to be sure I ran this UDF:
> > >
> > > import org.apache.pig.EvalFunc;
> > > import java.io.IOException;
> > > import org.apache.pig.data.Tuple;
> > >
> > > public class ATHING extends EvalFunc<String> {
> > >  public String exec(Tuple input) throws IOException {
> > >    System.out.println(input.toString());
> > >    return null;
> > >   }
> > > }
> > >
> > > 2011/11/24 Prashant Kommireddi <pr...@gmail.com>
> > >
> > > > I have a question regarding the pig data types.
> > > >
> > > > If I have a UDF, say 'CustomUDF' and I do something like this:
> > > >
> > > > REGISTER 'foo.jar';
> > > >
> > > > A = LOAD '/shared/a.dat';
> > > >
> > > > What would be the difference in the data types for UDF arguments
> > between
> > > > -->
> > > >
> > > > Case 1 : B = FOREACH A GENERATE CustomUDF(TOTUPLE(*), 'input'); AND
> > > > Case 2 : B = FOREACH A GENERATE CustomUDF(*, 'input');
> > > >
> > > > I am sure Case 1 is (tuple, chararray). Can anyone let me know the
> data
> > > > type for Case 2 arguments?
> > > >
> > > > Thanks,
> > > > Prashant
> > > >
> > >
> >
>

Re: Pig Data type question

Posted by Jonathan Coveney <jc...@gmail.com>.

I believe that this is a current limitation of Pig: you can't have a
function that uses both getArgToFuncMapping and a variable number of
arguments. In this case, it kind of makes sense that you can't though,
example:

what if * is in fact just a Tuple of something? So you have

TOTUPLE(tuple), 'chararray'
tuple, 'chararray'

which one should they match? The one intended for TOTUPLE(*), or the one
intended for just *? Because both would match just a tuple.

Hmm, one more thing, though, which also is important: you're re-wrapping
the argument in a Tuple. It is implicit that the input to your evalfunc
will come in the form of a Tuple. In the UDF example, note that they don't
rewrap in a tuple:

funcList.add(new FuncSpec(FloatAbs.class.getName(),   new Schema(new
Schema.FieldSchema(null, DataType.FLOAT))));

So unless your argument will be explicitly rewrapped in a tuple, you don't
need that piece.

But yeah, someone else can chime in with whether getArgtoFunc can do wha
you want it to do, but I don't think it can. My suggestion would be to a)
choose one form of input and stick to that, instead of trying to support
two forms and b) you could have a initializer in your EvalFunc that on the
first input, inspects the types and figures out which function to use to
process the input.

We do need to make funcspecs play nice with variable numbers of arguments,
though, especially now that more schema info is available.

2011/11/25 Prashant Kommireddi <pr...@gmail.com>

> Thanks Jonathan.
>
> What do I check for as the input type, because DataType.TUPLE does not seem
> to work. I would like to use "getArgToFuncMapping()" to be able to invoke
> different functions based on input type, and I am not sure how to check for
> Case 2.
>
> In my implementation, Case 1 could be checked for (DataType.TUPLE,
> DataType.CHARARRAY) but for Case 2 I would assume it should be
> (DataType.TUPLE) but that does not work. PIg UDF cannot infer a matching
> function.
>
>  @Override
>    public List<FuncSpec> getArgToFuncMapping() throws FrontendException {
>        List<FuncSpec> funcList = new ArrayList<FuncSpec>();
>        Schema s = new Schema();
>        s.add(new Schema.FieldSchema(null, DataType.TUPLE));
>        s.add(new Schema.FieldSchema(null, DataType.CHARARRAY));
>        funcList.add(new FuncSpec(this.getClass().getName(), s));
>
>        s = new Schema();
>        s.add(new Schema.FieldSchema(null, DataType.TUPLE));
>        funcList.add(new FuncSpec(CustomUDF.class.getName(), s));
>
>        return funcList;
>    }
>
>
>
> On Fri, Nov 25, 2011 at 12:52 AM, Jonathan Coveney <jcoveney@gmail.com
> >wrote:
>
> > The first case will give you a tuple which contains, as it first
> element, a
> > tuple of all of the stuff in *, and as its second element, 'input'.
> >
> > The second will give youa tuple which contains all of the elements of *,
> > and then as its last element, 'input'.
> >
> > This is what I thought, but to be sure I ran this UDF:
> >
> > import org.apache.pig.EvalFunc;
> > import java.io.IOException;
> > import org.apache.pig.data.Tuple;
> >
> > public class ATHING extends EvalFunc<String> {
> >  public String exec(Tuple input) throws IOException {
> >    System.out.println(input.toString());
> >    return null;
> >   }
> > }
> >
> > 2011/11/24 Prashant Kommireddi <pr...@gmail.com>
> >
> > > I have a question regarding the pig data types.
> > >
> > > If I have a UDF, say 'CustomUDF' and I do something like this:
> > >
> > > REGISTER 'foo.jar';
> > >
> > > A = LOAD '/shared/a.dat';
> > >
> > > What would be the difference in the data types for UDF arguments
> between
> > > -->
> > >
> > > Case 1 : B = FOREACH A GENERATE CustomUDF(TOTUPLE(*), 'input'); AND
> > > Case 2 : B = FOREACH A GENERATE CustomUDF(*, 'input');
> > >
> > > I am sure Case 1 is (tuple, chararray). Can anyone let me know the data
> > > type for Case 2 arguments?
> > >
> > > Thanks,
> > > Prashant
> > >
> >
>

Re: Pig Data type question

Posted by Prashant Kommireddi <pr...@gmail.com>.

Thanks Jonathan.

What do I check for as the input type, because DataType.TUPLE does not seem
to work. I would like to use "getArgToFuncMapping()" to be able to invoke
different functions based on input type, and I am not sure how to check for
Case 2.

In my implementation, Case 1 could be checked for (DataType.TUPLE,
DataType.CHARARRAY) but for Case 2 I would assume it should be
(DataType.TUPLE) but that does not work. PIg UDF cannot infer a matching
function.

 @Override
    public List<FuncSpec> getArgToFuncMapping() throws FrontendException {
        List<FuncSpec> funcList = new ArrayList<FuncSpec>();
        Schema s = new Schema();
        s.add(new Schema.FieldSchema(null, DataType.TUPLE));
        s.add(new Schema.FieldSchema(null, DataType.CHARARRAY));
        funcList.add(new FuncSpec(this.getClass().getName(), s));

        s = new Schema();
        s.add(new Schema.FieldSchema(null, DataType.TUPLE));
        funcList.add(new FuncSpec(CustomUDF.class.getName(), s));

        return funcList;
    }



On Fri, Nov 25, 2011 at 12:52 AM, Jonathan Coveney <jc...@gmail.com>wrote:

> The first case will give you a tuple which contains, as it first element, a
> tuple of all of the stuff in *, and as its second element, 'input'.
>
> The second will give youa tuple which contains all of the elements of *,
> and then as its last element, 'input'.
>
> This is what I thought, but to be sure I ran this UDF:
>
> import org.apache.pig.EvalFunc;
> import java.io.IOException;
> import org.apache.pig.data.Tuple;
>
> public class ATHING extends EvalFunc<String> {
>  public String exec(Tuple input) throws IOException {
>    System.out.println(input.toString());
>    return null;
>   }
> }
>
> 2011/11/24 Prashant Kommireddi <pr...@gmail.com>
>
> > I have a question regarding the pig data types.
> >
> > If I have a UDF, say 'CustomUDF' and I do something like this:
> >
> > REGISTER 'foo.jar';
> >
> > A = LOAD '/shared/a.dat';
> >
> > What would be the difference in the data types for UDF arguments between
> > -->
> >
> > Case 1 : B = FOREACH A GENERATE CustomUDF(TOTUPLE(*), 'input'); AND
> > Case 2 : B = FOREACH A GENERATE CustomUDF(*, 'input');
> >
> > I am sure Case 1 is (tuple, chararray). Can anyone let me know the data
> > type for Case 2 arguments?
> >
> > Thanks,
> > Prashant
> >
>

Re: Pig Data type question

Posted by Jonathan Coveney <jc...@gmail.com>.

The first case will give you a tuple which contains, as it first element, a
tuple of all of the stuff in *, and as its second element, 'input'.

The second will give youa tuple which contains all of the elements of *,
and then as its last element, 'input'.

This is what I thought, but to be sure I ran this UDF:

import org.apache.pig.EvalFunc;
import java.io.IOException;
import org.apache.pig.data.Tuple;

public class ATHING extends EvalFunc<String> {
  public String exec(Tuple input) throws IOException {
    System.out.println(input.toString());
    return null;
  }
}

2011/11/24 Prashant Kommireddi <pr...@gmail.com>

> I have a question regarding the pig data types.
>
> If I have a UDF, say 'CustomUDF' and I do something like this:
>
> REGISTER 'foo.jar';
>
> A = LOAD '/shared/a.dat';
>
> What would be the difference in the data types for UDF arguments between
> -->
>
> Case 1 : B = FOREACH A GENERATE CustomUDF(TOTUPLE(*), 'input'); AND
> Case 2 : B = FOREACH A GENERATE CustomUDF(*, 'input');
>
> I am sure Case 1 is (tuple, chararray). Can anyone let me know the data
> type for Case 2 arguments?
>
> Thanks,
> Prashant
>