You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by pranjal rajput <fi...@gmail.com> on 2013/03/18 10:26:30 UTC

UDF that takes bag as input and returns another bag

Hi,
Can we define a UDF in pig that takes a bag as an input and returns another
bag as output?
How can this be done?
Thanks,
--
regards
Pranjal

Re: UDF that takes bag as input and returns another bag

Posted by "Dan DeCapria, CivicScience" <da...@civicscience.com>.
By extending an abstract class, you can reuse the generics for the pig
input's Tuple ETL validation, and a consistent hook for your DataBag
parsing logic.  Consider the following abstract class ParseBagAsBag, which
can be extended by your own MyDatabagParserToDataBag, with override to
method parser_logic() and with references to the output super.bag:

public abstract class ParseBagAsBag extends EvalFunc<DataBag> {

    public TupleFactory tuple_factory = TupleFactory.getInstance();
    public BagFactory bag_factory = BagFactory.getInstance();
    public DataBag bag;

    /**
     * Wrapper for Deconstructing the input Tuple to extract DataBag
component.
     * @param input Tuple containing DataBag.
     * @return DataBag of parser logic, NULL iff bag is empty.
     * @throws IOException
     */
    @Override
    public DataBag exec(Tuple input) throws IOException {
        this.tuple = this.tuple_factory.newTuple();
    //  if valid, create a new Tuple from factory
        if (input != null) {
     //  @precondition check
            if ((!input.isNull()) && (input.size() > 0)) {
     //  @precondition check; tuple is non-empty and interesting
                Object oBag = input.get(0);
    //  DataBag wrapped in a one-element Tuple
                if (oBag instanceof DataBag) {
     //  @precondition check; type pig.DataBag
                    DataBag databag = (DataBag) oBag;
                    parser_logic(databag);
                }
            }
        }
        return (this.bag.size() > 0) ? this.bag : null;
    //  return the bag iff modified from factory instantiation, otherwise
return NULL Object
    }

    public abstract void parser_logic(DataBag databag) throws IOException;
}

Hope this helps.

-Dan

On Mon, Mar 18, 2013 at 11:01 AM, Jonathan Coveney <jc...@gmail.com>wrote:

> Ah, I suppose I was just proving it oculd be done.
>
> To make a new one, you'd do:
>
> public class MyUdf extends EvalFunc<DataBag> {
>   private static final BagFactory mBagFactory = BagFactory.getInstance();
>   public DataBag exec(Tuple input) throws IOException {
>     DataBag output = mBagFactory.newDefaultBag();
>     for (Tuple t : (DataBag)input.get(0)) {
>       output.add(t);
>     }
>     return output;
>   }
> }
>
>
>
>
> 2013/3/18 Kris Coward <kr...@melon.org>
>
> >
> > But he asked for a function that returns *another* bag ;)
> >
> > Snark aside, when returning bags or tuples, it's also worthwhile to at
> > least consider also defining the output schema, which for your example
> > code would probably mean
> >
> > public Schema outputSchema(Schema input){
> >   Schema output = new Schema();
> >   output.add(input.getField(0));
> >   return output;
> > }
> >
> > (possibly with some omitted exception handling)
> >
> > -Kris
> >
> > On Mon, Mar 18, 2013 at 11:19:17AM +0100, Jonathan Coveney wrote:
> > > Absolutely.
> > >
> > > public class MyUdf extends EvalFunc<DataBag> {
> > >   public DataBag exec(Tuple input) throws IOException {
> > >     return (DataBag)input.get(0);
> > >   }
> > > }
> > >
> > >
> > > A dummy example, but there you go. DataBag is a valid pig type like any
> > > other, so you just returnit like you would normally.
> > >
> > >
> > > 2013/3/18 pranjal rajput <fi...@gmail.com>
> > >
> > > > Hi,
> > > > Can we define a UDF in pig that takes a bag as an input and returns
> > another
> > > > bag as output?
> > > > How can this be done?
> > > > Thanks,
> > > > --
> > > > regards
> > > > Pranjal
> > > >
> >
> > --
> > Kris Coward                                     http://unripe.melon.org/
> > GPG Fingerprint: 2BF3 957D 310A FEEC 4733  830E 21A4 05C7 1FEB 12B3
> >
>

Re: UDF that takes bag as input and returns another bag

Posted by Jonathan Coveney <jc...@gmail.com>.
Ah, I suppose I was just proving it oculd be done.

To make a new one, you'd do:

public class MyUdf extends EvalFunc<DataBag> {
  private static final BagFactory mBagFactory = BagFactory.getInstance();
  public DataBag exec(Tuple input) throws IOException {
    DataBag output = mBagFactory.newDefaultBag();
    for (Tuple t : (DataBag)input.get(0)) {
      output.add(t);
    }
    return output;
  }
}




2013/3/18 Kris Coward <kr...@melon.org>

>
> But he asked for a function that returns *another* bag ;)
>
> Snark aside, when returning bags or tuples, it's also worthwhile to at
> least consider also defining the output schema, which for your example
> code would probably mean
>
> public Schema outputSchema(Schema input){
>   Schema output = new Schema();
>   output.add(input.getField(0));
>   return output;
> }
>
> (possibly with some omitted exception handling)
>
> -Kris
>
> On Mon, Mar 18, 2013 at 11:19:17AM +0100, Jonathan Coveney wrote:
> > Absolutely.
> >
> > public class MyUdf extends EvalFunc<DataBag> {
> >   public DataBag exec(Tuple input) throws IOException {
> >     return (DataBag)input.get(0);
> >   }
> > }
> >
> >
> > A dummy example, but there you go. DataBag is a valid pig type like any
> > other, so you just returnit like you would normally.
> >
> >
> > 2013/3/18 pranjal rajput <fi...@gmail.com>
> >
> > > Hi,
> > > Can we define a UDF in pig that takes a bag as an input and returns
> another
> > > bag as output?
> > > How can this be done?
> > > Thanks,
> > > --
> > > regards
> > > Pranjal
> > >
>
> --
> Kris Coward                                     http://unripe.melon.org/
> GPG Fingerprint: 2BF3 957D 310A FEEC 4733  830E 21A4 05C7 1FEB 12B3
>

Re: UDF that takes bag as input and returns another bag

Posted by Kris Coward <kr...@melon.org>.
But he asked for a function that returns *another* bag ;)

Snark aside, when returning bags or tuples, it's also worthwhile to at
least consider also defining the output schema, which for your example
code would probably mean

public Schema outputSchema(Schema input){
  Schema output = new Schema();
  output.add(input.getField(0));
  return output;
}

(possibly with some omitted exception handling)

-Kris

On Mon, Mar 18, 2013 at 11:19:17AM +0100, Jonathan Coveney wrote:
> Absolutely.
> 
> public class MyUdf extends EvalFunc<DataBag> {
>   public DataBag exec(Tuple input) throws IOException {
>     return (DataBag)input.get(0);
>   }
> }
> 
> 
> A dummy example, but there you go. DataBag is a valid pig type like any
> other, so you just returnit like you would normally.
> 
> 
> 2013/3/18 pranjal rajput <fi...@gmail.com>
> 
> > Hi,
> > Can we define a UDF in pig that takes a bag as an input and returns another
> > bag as output?
> > How can this be done?
> > Thanks,
> > --
> > regards
> > Pranjal
> >

-- 
Kris Coward					http://unripe.melon.org/
GPG Fingerprint: 2BF3 957D 310A FEEC 4733  830E 21A4 05C7 1FEB 12B3

Re: UDF that takes bag as input and returns another bag

Posted by Jonathan Coveney <jc...@gmail.com>.
Absolutely.

public class MyUdf extends EvalFunc<DataBag> {
  public DataBag exec(Tuple input) throws IOException {
    return (DataBag)input.get(0);
  }
}


A dummy example, but there you go. DataBag is a valid pig type like any
other, so you just returnit like you would normally.


2013/3/18 pranjal rajput <fi...@gmail.com>

> Hi,
> Can we define a UDF in pig that takes a bag as an input and returns another
> bag as output?
> How can this be done?
> Thanks,
> --
> regards
> Pranjal
>