You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Sheeba George <sh...@gmail.com> on 2010/11/26 04:01:05 UTC

Question on getting TotalCount - X records

Hi  all

     I need some help with PIG. The requirement is to generate the topX
records for a group. I can easily do this using PIG script where I can order
by DESC and then limit at X.  If there are more than X records in the
group,I need to aggregate the rest as a single record. How can I achieve
this?

I am generating topX as below

*kwgroup* = GROUP *kws* BY (type,category);

*topkws* = FOREACH *kwgroup* {

             sorted = ORDER *kws* BY visits DESC;

             *ltd* = limit sorted 5;

             GENERATE FLATTEN(*ltd*);}

For aggregating the rest,
I was thinking of sorting by ASC and truncate at (TotalCount – X). Then
aggregate these records.  How can I get the TotalCount of records in a
group? I tried the below, but fails.

*

bottomkws* = FOREACH kwgroup_cnt_gt_top {

sorted_asc = ORDER *kws* BY visits ASC;

ltd_bottom = limit sorted_asc (COUNT(*kws*) - 5) ;

GENERATE FLATTEN(ltd_bottom);}

But this fails with the erro message that we should use INTEGER instead of
COUNT(kws)

Is it better to do this using UDF? In that case UDF will have to sort, limit
,aggregate .Could you point to some samples that take a group of records and
return a group(bag)



Any help in this regard is appreciated.



Thanks

Sheeba

Re: Question on getting TotalCount - X records

Posted by Daniel Dai <ji...@yahoo-inc.com>.

Yes, actually it is much easier:

public Schema outputSchema(Schema input) {
    return input;
}

Daniel

Sheeba George wrote:
> Hi Daniel
> Is it possible to get the schema string from the "input" param rather than
> hardcoding?
> Thanks
> Sheeba
> On Mon, Dec 13, 2010 at 11:53 PM, Daniel Dai <ji...@yahoo-inc.com> wrote:
>
>   
>> There is something wrong in outputSchema I gave you last time, try this:
>>
>>
>>   public Schema outputSchema(Schema input) {
>>       try {
>>           Schema schema =
>> org.apache.pig.impl.util.Utils.getSchemaFromString("topx_sorted:bag{t:tuple(reportdate:chararray,appid:int,keyword:chararray,searchengine:chararray,visits:long,
>> etc....)}");
>>           return schema;
>>
>>       } catch (Exception e) {
>>           return null;
>>       }
>>   }
>>
>> Daniel
>>
>>
>> Sheeba George wrote:
>>
>>     
>>> Hi Daniel
>>>    Thanks for your help ... I have created a UDF that aggregates the rest.
>>> So my UDF takes a DataBag as input and DataBag which has the same schema
>>> as
>>> input as the output. My outputschema method is as below.
>>>
>>> *
>>>
>>> public* Schema outputSchema(Schema input) {
>>>
>>> *try*{
>>>
>>> Schema bagSchema = *new* Schema();
>>>
>>> bagSchema.add(*new* Schema.FieldSchema(input.getField(0)));
>>>
>>> *return* *new* Schema(*new*
>>>
>>> Schema.FieldSchema(getSchemaName(*this*.getClass().getName().toLowerCase(),
>>> input),
>>>
>>> bagSchema, DataType.*BAG*));
>>>
>>> }*catch* (Exception e){
>>>
>>> *return* *null*;
>>>
>>> }
>>>
>>> }
>>> I am using the UDF in PIG as
>>> topkws = FOREACH kwgroup {
>>>   sorted = ORDER kws BY visits DESC;
>>>   GENERATE FLATTEN(AggregateOthers(sorted));}
>>>
>>> where AggregateOthers is my UDF. If I DESCRIBE topkws I get
>>> topkws: {com.pig.udfs.topx_sorted_104::sorted: {reportdate: chararray,
>>> appid: int,keyword: chararray,searchengine: chararray,visits: long,f2:
>>> long,f3:
>>> long,f4: long,f5: long,f6: long,f7: long,f8: long,visitor: long}}
>>>
>>> com.pig.udfs.topx_sorted is my package name. Not sure what "104" stands
>>> for?
>>>
>>> How do I access each field in topkws? I need to join reportdate,appid and
>>> keyword in topkws with another file.
>>>
>>> Appreciate any help
>>>
>>> thanks
>>> Sheeba
>>>
>>>
>>> On Sun, Nov 28, 2010 at 2:07 AM, Daniel Dai <da...@gmail.com> wrote:
>>>
>>>
>>>
>>>       
>>>> Limit only takes constant. So "limit sorted_asc (COUNT(*kws*) - 5)" does
>>>> not
>>>> work.
>>>>
>>>> You will need a UDF, which returns DataBag. One example is
>>>> org.apache.pig.builtin.COR, which returns DataBag. Basically, you can
>>>> write
>>>> a UDF like this:
>>>>
>>>> public class BagTest extends EvalFunc<DataBag> {
>>>>  @Override
>>>>  public DataBag exec(Tuple input) throws IOException {
>>>>      DataBag inputDB = (DataBag)input.get(0);
>>>>      DataBag db = new DefaultDataBag();
>>>>      // Construct your db
>>>>      return db;
>>>>  }
>>>>  @Override
>>>>  public Schema outputSchema(Schema input) {
>>>>      return new Schema(new
>>>> Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(),
>>>> input), DataType.BAG));
>>>>  }
>>>> }
>>>>
>>>> Daniel
>>>>
>>>> -----Original Message----- From: Sheeba George
>>>> Sent: Thursday, November 25, 2010 7:01 PM
>>>> To: user@pig.apache.org
>>>> Subject: Question on getting TotalCount - X records
>>>>
>>>>
>>>> Hi  all
>>>>
>>>>   I need some help with PIG. The requirement is to generate the topX
>>>> records for a group. I can easily do this using PIG script where I can
>>>> order
>>>> by DESC and then limit at X.  If there are more than X records in the
>>>> group,I need to aggregate the rest as a single record. How can I achieve
>>>> this?
>>>>
>>>> I am generating topX as below
>>>>
>>>> *kwgroup* = GROUP *kws* BY (type,category);
>>>>
>>>> *topkws* = FOREACH *kwgroup* {
>>>>
>>>>           sorted = ORDER *kws* BY visits DESC;
>>>>
>>>>           *ltd* = limit sorted 5;
>>>>
>>>>           GENERATE FLATTEN(*ltd*);}
>>>>
>>>> For aggregating the rest,
>>>> I was thinking of sorting by ASC and truncate at (TotalCount – X). Then
>>>> aggregate these records.  How can I get the TotalCount of records in a
>>>> group? I tried the below, but fails.
>>>>
>>>> *
>>>>
>>>> bottomkws* = FOREACH kwgroup_cnt_gt_top {
>>>>
>>>> sorted_asc = ORDER *kws* BY visits ASC;
>>>>
>>>> ltd_bottom = limit sorted_asc (COUNT(*kws*) - 5) ;
>>>>
>>>> GENERATE FLATTEN(ltd_bottom);}
>>>>
>>>> But this fails with the erro message that we should use INTEGER instead
>>>> of
>>>> COUNT(kws)
>>>>
>>>> Is it better to do this using UDF? In that case UDF will have to sort,
>>>> limit
>>>> ,aggregate .Could you point to some samples that take a group of records
>>>> and
>>>> return a group(bag)
>>>>
>>>>
>>>>
>>>> Any help in this regard is appreciated.
>>>>
>>>>
>>>>
>>>> Thanks
>>>>
>>>> Sheeba
>>>>
>>>>
>>>>
>>>>         
>>>
>>>
>>>
>>>       
>>     
>
>
>

Re: Question on getting TotalCount - X records

Posted by Sheeba George <sh...@gmail.com>.

Your suggestion helped me to do the join. But I want to have the output
schema generic and in fact same as input schema as this UDF will be shared
by different inputs. How do I do that?

thanks
Sheeba

On Tue, Dec 14, 2010 at 12:43 AM, Sheeba George <sh...@gmail.com>wrote:

>
> Hi Daniel
> Is it possible to get the schema string from the "input" param rather than
> hardcoding?
> Thanks
> Sheeba
>   On Mon, Dec 13, 2010 at 11:53 PM, Daniel Dai <ji...@yahoo-inc.com>wrote:
>
>> There is something wrong in outputSchema I gave you last time, try this:
>>
>>
>>   public Schema outputSchema(Schema input) {
>>       try {
>>           Schema schema =
>> org.apache.pig.impl.util.Utils.getSchemaFromString("topx_sorted:bag{t:tuple(reportdate:chararray,appid:int,keyword:chararray,searchengine:chararray,visits:long,
>> etc....)}");
>>           return schema;
>>
>>       } catch (Exception e) {
>>           return null;
>>       }
>>   }
>>
>> Daniel
>>
>>
>> Sheeba George wrote:
>>
>>> Hi Daniel
>>>    Thanks for your help ... I have created a UDF that aggregates the
>>> rest.
>>> So my UDF takes a DataBag as input and DataBag which has the same schema
>>> as
>>> input as the output. My outputschema method is as below.
>>>
>>> *
>>>
>>> public* Schema outputSchema(Schema input) {
>>>
>>> *try*{
>>>
>>> Schema bagSchema = *new* Schema();
>>>
>>> bagSchema.add(*new* Schema.FieldSchema(input.getField(0)));
>>>
>>> *return* *new* Schema(*new*
>>>
>>> Schema.FieldSchema(getSchemaName(*this*.getClass().getName().toLowerCase(),
>>> input),
>>>
>>> bagSchema, DataType.*BAG*));
>>>
>>> }*catch* (Exception e){
>>>
>>> *return* *null*;
>>>
>>> }
>>>
>>> }
>>> I am using the UDF in PIG as
>>> topkws = FOREACH kwgroup {
>>>   sorted = ORDER kws BY visits DESC;
>>>   GENERATE FLATTEN(AggregateOthers(sorted));}
>>>
>>> where AggregateOthers is my UDF. If I DESCRIBE topkws I get
>>> topkws: {com.pig.udfs.topx_sorted_104::sorted: {reportdate: chararray,
>>> appid: int,keyword: chararray,searchengine: chararray,visits: long,f2:
>>> long,f3:
>>> long,f4: long,f5: long,f6: long,f7: long,f8: long,visitor: long}}
>>>
>>> com.pig.udfs.topx_sorted is my package name. Not sure what "104" stands
>>> for?
>>>
>>> How do I access each field in topkws? I need to join reportdate,appid and
>>> keyword in topkws with another file.
>>>
>>> Appreciate any help
>>>
>>> thanks
>>> Sheeba
>>>
>>>
>>> On Sun, Nov 28, 2010 at 2:07 AM, Daniel Dai <da...@gmail.com> wrote:
>>>
>>>
>>>
>>>> Limit only takes constant. So "limit sorted_asc (COUNT(*kws*) - 5)" does
>>>> not
>>>> work.
>>>>
>>>> You will need a UDF, which returns DataBag. One example is
>>>> org.apache.pig.builtin.COR, which returns DataBag. Basically, you can
>>>> write
>>>> a UDF like this:
>>>>
>>>> public class BagTest extends EvalFunc<DataBag> {
>>>>  @Override
>>>>  public DataBag exec(Tuple input) throws IOException {
>>>>      DataBag inputDB = (DataBag)input.get(0);
>>>>      DataBag db = new DefaultDataBag();
>>>>      // Construct your db
>>>>      return db;
>>>>  }
>>>>  @Override
>>>>  public Schema outputSchema(Schema input) {
>>>>      return new Schema(new
>>>>
>>>> Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(),
>>>> input), DataType.BAG));
>>>>  }
>>>> }
>>>>
>>>> Daniel
>>>>
>>>> -----Original Message----- From: Sheeba George
>>>> Sent: Thursday, November 25, 2010 7:01 PM
>>>> To: user@pig.apache.org
>>>> Subject: Question on getting TotalCount - X records
>>>>
>>>>
>>>> Hi  all
>>>>
>>>>   I need some help with PIG. The requirement is to generate the topX
>>>> records for a group. I can easily do this using PIG script where I can
>>>> order
>>>> by DESC and then limit at X.  If there are more than X records in the
>>>> group,I need to aggregate the rest as a single record. How can I achieve
>>>> this?
>>>>
>>>> I am generating topX as below
>>>>
>>>> *kwgroup* = GROUP *kws* BY (type,category);
>>>>
>>>> *topkws* = FOREACH *kwgroup* {
>>>>
>>>>           sorted = ORDER *kws* BY visits DESC;
>>>>
>>>>           *ltd* = limit sorted 5;
>>>>
>>>>           GENERATE FLATTEN(*ltd*);}
>>>>
>>>> For aggregating the rest,
>>>> I was thinking of sorting by ASC and truncate at (TotalCount – X). Then
>>>> aggregate these records.  How can I get the TotalCount of records in a
>>>> group? I tried the below, but fails.
>>>>
>>>> *
>>>>
>>>> bottomkws* = FOREACH kwgroup_cnt_gt_top {
>>>>
>>>> sorted_asc = ORDER *kws* BY visits ASC;
>>>>
>>>> ltd_bottom = limit sorted_asc (COUNT(*kws*) - 5) ;
>>>>
>>>> GENERATE FLATTEN(ltd_bottom);}
>>>>
>>>> But this fails with the erro message that we should use INTEGER instead
>>>> of
>>>> COUNT(kws)
>>>>
>>>> Is it better to do this using UDF? In that case UDF will have to sort,
>>>> limit
>>>> ,aggregate .Could you point to some samples that take a group of records
>>>> and
>>>> return a group(bag)
>>>>
>>>>
>>>>
>>>> Any help in this regard is appreciated.
>>>>
>>>>
>>>>
>>>> Thanks
>>>>
>>>> Sheeba
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>
>
> --
> Sheeba Ann George
>
>


-- 
Sheeba Ann George

Re: Question on getting TotalCount - X records

Posted by Sheeba George <sh...@gmail.com>.

Hi Daniel
Is it possible to get the schema string from the "input" param rather than
hardcoding?
Thanks
Sheeba
On Mon, Dec 13, 2010 at 11:53 PM, Daniel Dai <ji...@yahoo-inc.com> wrote:

> There is something wrong in outputSchema I gave you last time, try this:
>
>
>   public Schema outputSchema(Schema input) {
>       try {
>           Schema schema =
> org.apache.pig.impl.util.Utils.getSchemaFromString("topx_sorted:bag{t:tuple(reportdate:chararray,appid:int,keyword:chararray,searchengine:chararray,visits:long,
> etc....)}");
>           return schema;
>
>       } catch (Exception e) {
>           return null;
>       }
>   }
>
> Daniel
>
>
> Sheeba George wrote:
>
>> Hi Daniel
>>    Thanks for your help ... I have created a UDF that aggregates the rest.
>> So my UDF takes a DataBag as input and DataBag which has the same schema
>> as
>> input as the output. My outputschema method is as below.
>>
>> *
>>
>> public* Schema outputSchema(Schema input) {
>>
>> *try*{
>>
>> Schema bagSchema = *new* Schema();
>>
>> bagSchema.add(*new* Schema.FieldSchema(input.getField(0)));
>>
>> *return* *new* Schema(*new*
>>
>> Schema.FieldSchema(getSchemaName(*this*.getClass().getName().toLowerCase(),
>> input),
>>
>> bagSchema, DataType.*BAG*));
>>
>> }*catch* (Exception e){
>>
>> *return* *null*;
>>
>> }
>>
>> }
>> I am using the UDF in PIG as
>> topkws = FOREACH kwgroup {
>>   sorted = ORDER kws BY visits DESC;
>>   GENERATE FLATTEN(AggregateOthers(sorted));}
>>
>> where AggregateOthers is my UDF. If I DESCRIBE topkws I get
>> topkws: {com.pig.udfs.topx_sorted_104::sorted: {reportdate: chararray,
>> appid: int,keyword: chararray,searchengine: chararray,visits: long,f2:
>> long,f3:
>> long,f4: long,f5: long,f6: long,f7: long,f8: long,visitor: long}}
>>
>> com.pig.udfs.topx_sorted is my package name. Not sure what "104" stands
>> for?
>>
>> How do I access each field in topkws? I need to join reportdate,appid and
>> keyword in topkws with another file.
>>
>> Appreciate any help
>>
>> thanks
>> Sheeba
>>
>>
>> On Sun, Nov 28, 2010 at 2:07 AM, Daniel Dai <da...@gmail.com> wrote:
>>
>>
>>
>>> Limit only takes constant. So "limit sorted_asc (COUNT(*kws*) - 5)" does
>>> not
>>> work.
>>>
>>> You will need a UDF, which returns DataBag. One example is
>>> org.apache.pig.builtin.COR, which returns DataBag. Basically, you can
>>> write
>>> a UDF like this:
>>>
>>> public class BagTest extends EvalFunc<DataBag> {
>>>  @Override
>>>  public DataBag exec(Tuple input) throws IOException {
>>>      DataBag inputDB = (DataBag)input.get(0);
>>>      DataBag db = new DefaultDataBag();
>>>      // Construct your db
>>>      return db;
>>>  }
>>>  @Override
>>>  public Schema outputSchema(Schema input) {
>>>      return new Schema(new
>>> Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(),
>>> input), DataType.BAG));
>>>  }
>>> }
>>>
>>> Daniel
>>>
>>> -----Original Message----- From: Sheeba George
>>> Sent: Thursday, November 25, 2010 7:01 PM
>>> To: user@pig.apache.org
>>> Subject: Question on getting TotalCount - X records
>>>
>>>
>>> Hi  all
>>>
>>>   I need some help with PIG. The requirement is to generate the topX
>>> records for a group. I can easily do this using PIG script where I can
>>> order
>>> by DESC and then limit at X.  If there are more than X records in the
>>> group,I need to aggregate the rest as a single record. How can I achieve
>>> this?
>>>
>>> I am generating topX as below
>>>
>>> *kwgroup* = GROUP *kws* BY (type,category);
>>>
>>> *topkws* = FOREACH *kwgroup* {
>>>
>>>           sorted = ORDER *kws* BY visits DESC;
>>>
>>>           *ltd* = limit sorted 5;
>>>
>>>           GENERATE FLATTEN(*ltd*);}
>>>
>>> For aggregating the rest,
>>> I was thinking of sorting by ASC and truncate at (TotalCount – X). Then
>>> aggregate these records.  How can I get the TotalCount of records in a
>>> group? I tried the below, but fails.
>>>
>>> *
>>>
>>> bottomkws* = FOREACH kwgroup_cnt_gt_top {
>>>
>>> sorted_asc = ORDER *kws* BY visits ASC;
>>>
>>> ltd_bottom = limit sorted_asc (COUNT(*kws*) - 5) ;
>>>
>>> GENERATE FLATTEN(ltd_bottom);}
>>>
>>> But this fails with the erro message that we should use INTEGER instead
>>> of
>>> COUNT(kws)
>>>
>>> Is it better to do this using UDF? In that case UDF will have to sort,
>>> limit
>>> ,aggregate .Could you point to some samples that take a group of records
>>> and
>>> return a group(bag)
>>>
>>>
>>>
>>> Any help in this regard is appreciated.
>>>
>>>
>>>
>>> Thanks
>>>
>>> Sheeba
>>>
>>>
>>>
>>
>>
>>
>>
>>
>
>


-- 
Sheeba Ann George

Re: Question on getting TotalCount - X records

Posted by Daniel Dai <ji...@yahoo-inc.com>.

There is something wrong in outputSchema I gave you last time, try this:

    public Schema outputSchema(Schema input) {
        try {
            Schema schema = 
org.apache.pig.impl.util.Utils.getSchemaFromString("topx_sorted:bag{t:tuple(reportdate:chararray,appid:int,keyword:chararray,searchengine:chararray,visits:long, 
etc....)}");
            return schema;
        } catch (Exception e) {
            return null;
        }
    }

Daniel

Sheeba George wrote:
> Hi Daniel
>     Thanks for your help ... I have created a UDF that aggregates the rest.
> So my UDF takes a DataBag as input and DataBag which has the same schema as
> input as the output. My outputschema method is as below.
>
> *
>
> public* Schema outputSchema(Schema input) {
>
> *try*{
>
> Schema bagSchema = *new* Schema();
>
> bagSchema.add(*new* Schema.FieldSchema(input.getField(0)));
>
> *return* *new* Schema(*new*
> Schema.FieldSchema(getSchemaName(*this*.getClass().getName().toLowerCase(),
> input),
>
> bagSchema, DataType.*BAG*));
>
> }*catch* (Exception e){
>
> *return* *null*;
>
> }
>
> }
> I am using the UDF in PIG as
> topkws = FOREACH kwgroup {
>    sorted = ORDER kws BY visits DESC;
>    GENERATE FLATTEN(AggregateOthers(sorted));}
>
> where AggregateOthers is my UDF. If I DESCRIBE topkws I get
> topkws: {com.pig.udfs.topx_sorted_104::sorted: {reportdate: chararray,
> appid: int,keyword: chararray,searchengine: chararray,visits: long,f2:
> long,f3:
> long,f4: long,f5: long,f6: long,f7: long,f8: long,visitor: long}}
>
> com.pig.udfs.topx_sorted is my package name. Not sure what "104" stands for?
>
> How do I access each field in topkws? I need to join reportdate,appid and
> keyword in topkws with another file.
>
> Appreciate any help
>
> thanks
> Sheeba
>
>
> On Sun, Nov 28, 2010 at 2:07 AM, Daniel Dai <da...@gmail.com> wrote:
>
>   
>> Limit only takes constant. So "limit sorted_asc (COUNT(*kws*) - 5)" does
>> not
>> work.
>>
>> You will need a UDF, which returns DataBag. One example is
>> org.apache.pig.builtin.COR, which returns DataBag. Basically, you can write
>> a UDF like this:
>>
>> public class BagTest extends EvalFunc<DataBag> {
>>   @Override
>>   public DataBag exec(Tuple input) throws IOException {
>>       DataBag inputDB = (DataBag)input.get(0);
>>       DataBag db = new DefaultDataBag();
>>       // Construct your db
>>       return db;
>>   }
>>   @Override
>>   public Schema outputSchema(Schema input) {
>>       return new Schema(new
>> Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(),
>> input), DataType.BAG));
>>   }
>> }
>>
>> Daniel
>>
>> -----Original Message----- From: Sheeba George
>> Sent: Thursday, November 25, 2010 7:01 PM
>> To: user@pig.apache.org
>> Subject: Question on getting TotalCount - X records
>>
>>
>> Hi  all
>>
>>    I need some help with PIG. The requirement is to generate the topX
>> records for a group. I can easily do this using PIG script where I can
>> order
>> by DESC and then limit at X.  If there are more than X records in the
>> group,I need to aggregate the rest as a single record. How can I achieve
>> this?
>>
>> I am generating topX as below
>>
>> *kwgroup* = GROUP *kws* BY (type,category);
>>
>> *topkws* = FOREACH *kwgroup* {
>>
>>            sorted = ORDER *kws* BY visits DESC;
>>
>>            *ltd* = limit sorted 5;
>>
>>            GENERATE FLATTEN(*ltd*);}
>>
>> For aggregating the rest,
>> I was thinking of sorting by ASC and truncate at (TotalCount – X). Then
>> aggregate these records.  How can I get the TotalCount of records in a
>> group? I tried the below, but fails.
>>
>> *
>>
>> bottomkws* = FOREACH kwgroup_cnt_gt_top {
>>
>> sorted_asc = ORDER *kws* BY visits ASC;
>>
>> ltd_bottom = limit sorted_asc (COUNT(*kws*) - 5) ;
>>
>> GENERATE FLATTEN(ltd_bottom);}
>>
>> But this fails with the erro message that we should use INTEGER instead of
>> COUNT(kws)
>>
>> Is it better to do this using UDF? In that case UDF will have to sort,
>> limit
>> ,aggregate .Could you point to some samples that take a group of records
>> and
>> return a group(bag)
>>
>>
>>
>> Any help in this regard is appreciated.
>>
>>
>>
>> Thanks
>>
>> Sheeba
>>
>>     
>
>
>
>

Re: Question on getting TotalCount - X records

Posted by Sheeba George <sh...@gmail.com>.

Hi Daniel
    Thanks for your help ... I have created a UDF that aggregates the rest.
So my UDF takes a DataBag as input and DataBag which has the same schema as
input as the output. My outputschema method is as below.

*

public* Schema outputSchema(Schema input) {

*try*{

Schema bagSchema = *new* Schema();

bagSchema.add(*new* Schema.FieldSchema(input.getField(0)));

*return* *new* Schema(*new*
Schema.FieldSchema(getSchemaName(*this*.getClass().getName().toLowerCase(),
input),

bagSchema, DataType.*BAG*));

}*catch* (Exception e){

*return* *null*;

}

}
I am using the UDF in PIG as
topkws = FOREACH kwgroup {
   sorted = ORDER kws BY visits DESC;
   GENERATE FLATTEN(AggregateOthers(sorted));}

where AggregateOthers is my UDF. If I DESCRIBE topkws I get
topkws: {com.pig.udfs.topx_sorted_104::sorted: {reportdate: chararray,
appid: int,keyword: chararray,searchengine: chararray,visits: long,f2:
long,f3:
long,f4: long,f5: long,f6: long,f7: long,f8: long,visitor: long}}

com.pig.udfs.topx_sorted is my package name. Not sure what "104" stands for?

How do I access each field in topkws? I need to join reportdate,appid and
keyword in topkws with another file.

Appreciate any help

thanks
Sheeba


On Sun, Nov 28, 2010 at 2:07 AM, Daniel Dai <da...@gmail.com> wrote:

> Limit only takes constant. So "limit sorted_asc (COUNT(*kws*) - 5)" does
> not
> work.
>
> You will need a UDF, which returns DataBag. One example is
> org.apache.pig.builtin.COR, which returns DataBag. Basically, you can write
> a UDF like this:
>
> public class BagTest extends EvalFunc<DataBag> {
>   @Override
>   public DataBag exec(Tuple input) throws IOException {
>       DataBag inputDB = (DataBag)input.get(0);
>       DataBag db = new DefaultDataBag();
>       // Construct your db
>       return db;
>   }
>   @Override
>   public Schema outputSchema(Schema input) {
>       return new Schema(new
> Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(),
> input), DataType.BAG));
>   }
> }
>
> Daniel
>
> -----Original Message----- From: Sheeba George
> Sent: Thursday, November 25, 2010 7:01 PM
> To: user@pig.apache.org
> Subject: Question on getting TotalCount - X records
>
>
> Hi  all
>
>    I need some help with PIG. The requirement is to generate the topX
> records for a group. I can easily do this using PIG script where I can
> order
> by DESC and then limit at X.  If there are more than X records in the
> group,I need to aggregate the rest as a single record. How can I achieve
> this?
>
> I am generating topX as below
>
> *kwgroup* = GROUP *kws* BY (type,category);
>
> *topkws* = FOREACH *kwgroup* {
>
>            sorted = ORDER *kws* BY visits DESC;
>
>            *ltd* = limit sorted 5;
>
>            GENERATE FLATTEN(*ltd*);}
>
> For aggregating the rest,
> I was thinking of sorting by ASC and truncate at (TotalCount – X). Then
> aggregate these records.  How can I get the TotalCount of records in a
> group? I tried the below, but fails.
>
> *
>
> bottomkws* = FOREACH kwgroup_cnt_gt_top {
>
> sorted_asc = ORDER *kws* BY visits ASC;
>
> ltd_bottom = limit sorted_asc (COUNT(*kws*) - 5) ;
>
> GENERATE FLATTEN(ltd_bottom);}
>
> But this fails with the erro message that we should use INTEGER instead of
> COUNT(kws)
>
> Is it better to do this using UDF? In that case UDF will have to sort,
> limit
> ,aggregate .Could you point to some samples that take a group of records
> and
> return a group(bag)
>
>
>
> Any help in this regard is appreciated.
>
>
>
> Thanks
>
> Sheeba
>



-- 
Sheeba Ann George

Re: Question on getting TotalCount - X records

Posted by Daniel Dai <da...@gmail.com>.

Limit only takes constant. So "limit sorted_asc (COUNT(*kws*) - 5)" does not
work.

You will need a UDF, which returns DataBag. One example is
org.apache.pig.builtin.COR, which returns DataBag. Basically, you can write
a UDF like this:

public class BagTest extends EvalFunc<DataBag> {
    @Override
    public DataBag exec(Tuple input) throws IOException {
        DataBag inputDB = (DataBag)input.get(0);
        DataBag db = new DefaultDataBag();
        // Construct your db
        return db;
    }
    @Override
    public Schema outputSchema(Schema input) {
        return new Schema(new
Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(),
input), DataType.BAG));
    }
}

Daniel

-----Original Message----- 
From: Sheeba George
Sent: Thursday, November 25, 2010 7:01 PM
To: user@pig.apache.org
Subject: Question on getting TotalCount - X records

Hi  all

     I need some help with PIG. The requirement is to generate the topX
records for a group. I can easily do this using PIG script where I can order
by DESC and then limit at X.  If there are more than X records in the
group,I need to aggregate the rest as a single record. How can I achieve
this?

I am generating topX as below

*kwgroup* = GROUP *kws* BY (type,category);

*topkws* = FOREACH *kwgroup* {

             sorted = ORDER *kws* BY visits DESC;

             *ltd* = limit sorted 5;

             GENERATE FLATTEN(*ltd*);}

For aggregating the rest,
I was thinking of sorting by ASC and truncate at (TotalCount – X). Then
aggregate these records.  How can I get the TotalCount of records in a
group? I tried the below, but fails.

*

bottomkws* = FOREACH kwgroup_cnt_gt_top {

sorted_asc = ORDER *kws* BY visits ASC;

ltd_bottom = limit sorted_asc (COUNT(*kws*) - 5) ;

GENERATE FLATTEN(ltd_bottom);}

But this fails with the erro message that we should use INTEGER instead of
COUNT(kws)

Is it better to do this using UDF? In that case UDF will have to sort, limit
,aggregate .Could you point to some samples that take a group of records and
return a group(bag)



Any help in this regard is appreciated.



Thanks

Sheeba