You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Sheeba George <sh...@gmail.com> on 2010/12/14 03:58:02 UTC

Re: Question on getting TotalCount - X records

Hi Daniel
    Thanks for your help ... I have created a UDF that aggregates the rest.
So my UDF takes a DataBag as input and DataBag which has the same schema as
input as the output. My outputschema method is as below.

*

public* Schema outputSchema(Schema input) {

*try*{

Schema bagSchema = *new* Schema();

bagSchema.add(*new* Schema.FieldSchema(input.getField(0)));

*return* *new* Schema(*new*
Schema.FieldSchema(getSchemaName(*this*.getClass().getName().toLowerCase(),
input),

bagSchema, DataType.*BAG*));

}*catch* (Exception e){

*return* *null*;

}

}
I am using the UDF in PIG as
topkws = FOREACH kwgroup {
   sorted = ORDER kws BY visits DESC;
   GENERATE FLATTEN(AggregateOthers(sorted));}

where AggregateOthers is my UDF. If I DESCRIBE topkws I get
topkws: {com.pig.udfs.topx_sorted_104::sorted: {reportdate: chararray,
appid: int,keyword: chararray,searchengine: chararray,visits: long,f2:
long,f3:
long,f4: long,f5: long,f6: long,f7: long,f8: long,visitor: long}}

com.pig.udfs.topx_sorted is my package name. Not sure what "104" stands for?

How do I access each field in topkws? I need to join reportdate,appid and
keyword in topkws with another file.

Appreciate any help

thanks
Sheeba


On Sun, Nov 28, 2010 at 2:07 AM, Daniel Dai <da...@gmail.com> wrote:

> Limit only takes constant. So "limit sorted_asc (COUNT(*kws*) - 5)" does
> not
> work.
>
> You will need a UDF, which returns DataBag. One example is
> org.apache.pig.builtin.COR, which returns DataBag. Basically, you can write
> a UDF like this:
>
> public class BagTest extends EvalFunc<DataBag> {
>   @Override
>   public DataBag exec(Tuple input) throws IOException {
>       DataBag inputDB = (DataBag)input.get(0);
>       DataBag db = new DefaultDataBag();
>       // Construct your db
>       return db;
>   }
>   @Override
>   public Schema outputSchema(Schema input) {
>       return new Schema(new
> Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(),
> input), DataType.BAG));
>   }
> }
>
> Daniel
>
> -----Original Message----- From: Sheeba George
> Sent: Thursday, November 25, 2010 7:01 PM
> To: user@pig.apache.org
> Subject: Question on getting TotalCount - X records
>
>
> Hi  all
>
>    I need some help with PIG. The requirement is to generate the topX
> records for a group. I can easily do this using PIG script where I can
> order
> by DESC and then limit at X.  If there are more than X records in the
> group,I need to aggregate the rest as a single record. How can I achieve
> this?
>
> I am generating topX as below
>
> *kwgroup* = GROUP *kws* BY (type,category);
>
> *topkws* = FOREACH *kwgroup* {
>
>            sorted = ORDER *kws* BY visits DESC;
>
>            *ltd* = limit sorted 5;
>
>            GENERATE FLATTEN(*ltd*);}
>
> For aggregating the rest,
> I was thinking of sorting by ASC and truncate at (TotalCount – X). Then
> aggregate these records.  How can I get the TotalCount of records in a
> group? I tried the below, but fails.
>
> *
>
> bottomkws* = FOREACH kwgroup_cnt_gt_top {
>
> sorted_asc = ORDER *kws* BY visits ASC;
>
> ltd_bottom = limit sorted_asc (COUNT(*kws*) - 5) ;
>
> GENERATE FLATTEN(ltd_bottom);}
>
> But this fails with the erro message that we should use INTEGER instead of
> COUNT(kws)
>
> Is it better to do this using UDF? In that case UDF will have to sort,
> limit
> ,aggregate .Could you point to some samples that take a group of records
> and
> return a group(bag)
>
>
>
> Any help in this regard is appreciated.
>
>
>
> Thanks
>
> Sheeba
>



-- 
Sheeba Ann George

Re: Question on getting TotalCount - X records

Posted by Sheeba George <sh...@gmail.com>.
Your suggestion helped me to do the join. But I want to have the output
schema generic and in fact same as input schema as this UDF will be shared
by different inputs. How do I do that?

thanks
Sheeba

On Tue, Dec 14, 2010 at 12:43 AM, Sheeba George <sh...@gmail.com>wrote:

>
> Hi Daniel
> Is it possible to get the schema string from the "input" param rather than
> hardcoding?
> Thanks
> Sheeba
>   On Mon, Dec 13, 2010 at 11:53 PM, Daniel Dai <ji...@yahoo-inc.com>wrote:
>
>> There is something wrong in outputSchema I gave you last time, try this:
>>
>>
>>   public Schema outputSchema(Schema input) {
>>       try {
>>           Schema schema =
>> org.apache.pig.impl.util.Utils.getSchemaFromString("topx_sorted:bag{t:tuple(reportdate:chararray,appid:int,keyword:chararray,searchengine:chararray,visits:long,
>> etc....)}");
>>           return schema;
>>
>>       } catch (Exception e) {
>>           return null;
>>       }
>>   }
>>
>> Daniel
>>
>>
>> Sheeba George wrote:
>>
>>> Hi Daniel
>>>    Thanks for your help ... I have created a UDF that aggregates the
>>> rest.
>>> So my UDF takes a DataBag as input and DataBag which has the same schema
>>> as
>>> input as the output. My outputschema method is as below.
>>>
>>> *
>>>
>>> public* Schema outputSchema(Schema input) {
>>>
>>> *try*{
>>>
>>> Schema bagSchema = *new* Schema();
>>>
>>> bagSchema.add(*new* Schema.FieldSchema(input.getField(0)));
>>>
>>> *return* *new* Schema(*new*
>>>
>>> Schema.FieldSchema(getSchemaName(*this*.getClass().getName().toLowerCase(),
>>> input),
>>>
>>> bagSchema, DataType.*BAG*));
>>>
>>> }*catch* (Exception e){
>>>
>>> *return* *null*;
>>>
>>> }
>>>
>>> }
>>> I am using the UDF in PIG as
>>> topkws = FOREACH kwgroup {
>>>   sorted = ORDER kws BY visits DESC;
>>>   GENERATE FLATTEN(AggregateOthers(sorted));}
>>>
>>> where AggregateOthers is my UDF. If I DESCRIBE topkws I get
>>> topkws: {com.pig.udfs.topx_sorted_104::sorted: {reportdate: chararray,
>>> appid: int,keyword: chararray,searchengine: chararray,visits: long,f2:
>>> long,f3:
>>> long,f4: long,f5: long,f6: long,f7: long,f8: long,visitor: long}}
>>>
>>> com.pig.udfs.topx_sorted is my package name. Not sure what "104" stands
>>> for?
>>>
>>> How do I access each field in topkws? I need to join reportdate,appid and
>>> keyword in topkws with another file.
>>>
>>> Appreciate any help
>>>
>>> thanks
>>> Sheeba
>>>
>>>
>>> On Sun, Nov 28, 2010 at 2:07 AM, Daniel Dai <da...@gmail.com> wrote:
>>>
>>>
>>>
>>>> Limit only takes constant. So "limit sorted_asc (COUNT(*kws*) - 5)" does
>>>> not
>>>> work.
>>>>
>>>> You will need a UDF, which returns DataBag. One example is
>>>> org.apache.pig.builtin.COR, which returns DataBag. Basically, you can
>>>> write
>>>> a UDF like this:
>>>>
>>>> public class BagTest extends EvalFunc<DataBag> {
>>>>  @Override
>>>>  public DataBag exec(Tuple input) throws IOException {
>>>>      DataBag inputDB = (DataBag)input.get(0);
>>>>      DataBag db = new DefaultDataBag();
>>>>      // Construct your db
>>>>      return db;
>>>>  }
>>>>  @Override
>>>>  public Schema outputSchema(Schema input) {
>>>>      return new Schema(new
>>>>
>>>> Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(),
>>>> input), DataType.BAG));
>>>>  }
>>>> }
>>>>
>>>> Daniel
>>>>
>>>> -----Original Message----- From: Sheeba George
>>>> Sent: Thursday, November 25, 2010 7:01 PM
>>>> To: user@pig.apache.org
>>>> Subject: Question on getting TotalCount - X records
>>>>
>>>>
>>>> Hi  all
>>>>
>>>>   I need some help with PIG. The requirement is to generate the topX
>>>> records for a group. I can easily do this using PIG script where I can
>>>> order
>>>> by DESC and then limit at X.  If there are more than X records in the
>>>> group,I need to aggregate the rest as a single record. How can I achieve
>>>> this?
>>>>
>>>> I am generating topX as below
>>>>
>>>> *kwgroup* = GROUP *kws* BY (type,category);
>>>>
>>>> *topkws* = FOREACH *kwgroup* {
>>>>
>>>>           sorted = ORDER *kws* BY visits DESC;
>>>>
>>>>           *ltd* = limit sorted 5;
>>>>
>>>>           GENERATE FLATTEN(*ltd*);}
>>>>
>>>> For aggregating the rest,
>>>> I was thinking of sorting by ASC and truncate at (TotalCount – X). Then
>>>> aggregate these records.  How can I get the TotalCount of records in a
>>>> group? I tried the below, but fails.
>>>>
>>>> *
>>>>
>>>> bottomkws* = FOREACH kwgroup_cnt_gt_top {
>>>>
>>>> sorted_asc = ORDER *kws* BY visits ASC;
>>>>
>>>> ltd_bottom = limit sorted_asc (COUNT(*kws*) - 5) ;
>>>>
>>>> GENERATE FLATTEN(ltd_bottom);}
>>>>
>>>> But this fails with the erro message that we should use INTEGER instead
>>>> of
>>>> COUNT(kws)
>>>>
>>>> Is it better to do this using UDF? In that case UDF will have to sort,
>>>> limit
>>>> ,aggregate .Could you point to some samples that take a group of records
>>>> and
>>>> return a group(bag)
>>>>
>>>>
>>>>
>>>> Any help in this regard is appreciated.
>>>>
>>>>
>>>>
>>>> Thanks
>>>>
>>>> Sheeba
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>
>
> --
> Sheeba Ann George
>
>


-- 
Sheeba Ann George

Re: Question on getting TotalCount - X records

Posted by Daniel Dai <ji...@yahoo-inc.com>.
Yes, actually it is much easier:

public Schema outputSchema(Schema input) {
    return input;
}

Daniel

Sheeba George wrote:
> Hi Daniel
> Is it possible to get the schema string from the "input" param rather than
> hardcoding?
> Thanks
> Sheeba
> On Mon, Dec 13, 2010 at 11:53 PM, Daniel Dai <ji...@yahoo-inc.com> wrote:
>
>   
>> There is something wrong in outputSchema I gave you last time, try this:
>>
>>
>>   public Schema outputSchema(Schema input) {
>>       try {
>>           Schema schema =
>> org.apache.pig.impl.util.Utils.getSchemaFromString("topx_sorted:bag{t:tuple(reportdate:chararray,appid:int,keyword:chararray,searchengine:chararray,visits:long,
>> etc....)}");
>>           return schema;
>>
>>       } catch (Exception e) {
>>           return null;
>>       }
>>   }
>>
>> Daniel
>>
>>
>> Sheeba George wrote:
>>
>>     
>>> Hi Daniel
>>>    Thanks for your help ... I have created a UDF that aggregates the rest.
>>> So my UDF takes a DataBag as input and DataBag which has the same schema
>>> as
>>> input as the output. My outputschema method is as below.
>>>
>>> *
>>>
>>> public* Schema outputSchema(Schema input) {
>>>
>>> *try*{
>>>
>>> Schema bagSchema = *new* Schema();
>>>
>>> bagSchema.add(*new* Schema.FieldSchema(input.getField(0)));
>>>
>>> *return* *new* Schema(*new*
>>>
>>> Schema.FieldSchema(getSchemaName(*this*.getClass().getName().toLowerCase(),
>>> input),
>>>
>>> bagSchema, DataType.*BAG*));
>>>
>>> }*catch* (Exception e){
>>>
>>> *return* *null*;
>>>
>>> }
>>>
>>> }
>>> I am using the UDF in PIG as
>>> topkws = FOREACH kwgroup {
>>>   sorted = ORDER kws BY visits DESC;
>>>   GENERATE FLATTEN(AggregateOthers(sorted));}
>>>
>>> where AggregateOthers is my UDF. If I DESCRIBE topkws I get
>>> topkws: {com.pig.udfs.topx_sorted_104::sorted: {reportdate: chararray,
>>> appid: int,keyword: chararray,searchengine: chararray,visits: long,f2:
>>> long,f3:
>>> long,f4: long,f5: long,f6: long,f7: long,f8: long,visitor: long}}
>>>
>>> com.pig.udfs.topx_sorted is my package name. Not sure what "104" stands
>>> for?
>>>
>>> How do I access each field in topkws? I need to join reportdate,appid and
>>> keyword in topkws with another file.
>>>
>>> Appreciate any help
>>>
>>> thanks
>>> Sheeba
>>>
>>>
>>> On Sun, Nov 28, 2010 at 2:07 AM, Daniel Dai <da...@gmail.com> wrote:
>>>
>>>
>>>
>>>       
>>>> Limit only takes constant. So "limit sorted_asc (COUNT(*kws*) - 5)" does
>>>> not
>>>> work.
>>>>
>>>> You will need a UDF, which returns DataBag. One example is
>>>> org.apache.pig.builtin.COR, which returns DataBag. Basically, you can
>>>> write
>>>> a UDF like this:
>>>>
>>>> public class BagTest extends EvalFunc<DataBag> {
>>>>  @Override
>>>>  public DataBag exec(Tuple input) throws IOException {
>>>>      DataBag inputDB = (DataBag)input.get(0);
>>>>      DataBag db = new DefaultDataBag();
>>>>      // Construct your db
>>>>      return db;
>>>>  }
>>>>  @Override
>>>>  public Schema outputSchema(Schema input) {
>>>>      return new Schema(new
>>>> Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(),
>>>> input), DataType.BAG));
>>>>  }
>>>> }
>>>>
>>>> Daniel
>>>>
>>>> -----Original Message----- From: Sheeba George
>>>> Sent: Thursday, November 25, 2010 7:01 PM
>>>> To: user@pig.apache.org
>>>> Subject: Question on getting TotalCount - X records
>>>>
>>>>
>>>> Hi  all
>>>>
>>>>   I need some help with PIG. The requirement is to generate the topX
>>>> records for a group. I can easily do this using PIG script where I can
>>>> order
>>>> by DESC and then limit at X.  If there are more than X records in the
>>>> group,I need to aggregate the rest as a single record. How can I achieve
>>>> this?
>>>>
>>>> I am generating topX as below
>>>>
>>>> *kwgroup* = GROUP *kws* BY (type,category);
>>>>
>>>> *topkws* = FOREACH *kwgroup* {
>>>>
>>>>           sorted = ORDER *kws* BY visits DESC;
>>>>
>>>>           *ltd* = limit sorted 5;
>>>>
>>>>           GENERATE FLATTEN(*ltd*);}
>>>>
>>>> For aggregating the rest,
>>>> I was thinking of sorting by ASC and truncate at (TotalCount – X). Then
>>>> aggregate these records.  How can I get the TotalCount of records in a
>>>> group? I tried the below, but fails.
>>>>
>>>> *
>>>>
>>>> bottomkws* = FOREACH kwgroup_cnt_gt_top {
>>>>
>>>> sorted_asc = ORDER *kws* BY visits ASC;
>>>>
>>>> ltd_bottom = limit sorted_asc (COUNT(*kws*) - 5) ;
>>>>
>>>> GENERATE FLATTEN(ltd_bottom);}
>>>>
>>>> But this fails with the erro message that we should use INTEGER instead
>>>> of
>>>> COUNT(kws)
>>>>
>>>> Is it better to do this using UDF? In that case UDF will have to sort,
>>>> limit
>>>> ,aggregate .Could you point to some samples that take a group of records
>>>> and
>>>> return a group(bag)
>>>>
>>>>
>>>>
>>>> Any help in this regard is appreciated.
>>>>
>>>>
>>>>
>>>> Thanks
>>>>
>>>> Sheeba
>>>>
>>>>
>>>>
>>>>         
>>>
>>>
>>>
>>>       
>>     
>
>
>   


Re: Question on getting TotalCount - X records

Posted by Sheeba George <sh...@gmail.com>.
Hi Daniel
Is it possible to get the schema string from the "input" param rather than
hardcoding?
Thanks
Sheeba
On Mon, Dec 13, 2010 at 11:53 PM, Daniel Dai <ji...@yahoo-inc.com> wrote:

> There is something wrong in outputSchema I gave you last time, try this:
>
>
>   public Schema outputSchema(Schema input) {
>       try {
>           Schema schema =
> org.apache.pig.impl.util.Utils.getSchemaFromString("topx_sorted:bag{t:tuple(reportdate:chararray,appid:int,keyword:chararray,searchengine:chararray,visits:long,
> etc....)}");
>           return schema;
>
>       } catch (Exception e) {
>           return null;
>       }
>   }
>
> Daniel
>
>
> Sheeba George wrote:
>
>> Hi Daniel
>>    Thanks for your help ... I have created a UDF that aggregates the rest.
>> So my UDF takes a DataBag as input and DataBag which has the same schema
>> as
>> input as the output. My outputschema method is as below.
>>
>> *
>>
>> public* Schema outputSchema(Schema input) {
>>
>> *try*{
>>
>> Schema bagSchema = *new* Schema();
>>
>> bagSchema.add(*new* Schema.FieldSchema(input.getField(0)));
>>
>> *return* *new* Schema(*new*
>>
>> Schema.FieldSchema(getSchemaName(*this*.getClass().getName().toLowerCase(),
>> input),
>>
>> bagSchema, DataType.*BAG*));
>>
>> }*catch* (Exception e){
>>
>> *return* *null*;
>>
>> }
>>
>> }
>> I am using the UDF in PIG as
>> topkws = FOREACH kwgroup {
>>   sorted = ORDER kws BY visits DESC;
>>   GENERATE FLATTEN(AggregateOthers(sorted));}
>>
>> where AggregateOthers is my UDF. If I DESCRIBE topkws I get
>> topkws: {com.pig.udfs.topx_sorted_104::sorted: {reportdate: chararray,
>> appid: int,keyword: chararray,searchengine: chararray,visits: long,f2:
>> long,f3:
>> long,f4: long,f5: long,f6: long,f7: long,f8: long,visitor: long}}
>>
>> com.pig.udfs.topx_sorted is my package name. Not sure what "104" stands
>> for?
>>
>> How do I access each field in topkws? I need to join reportdate,appid and
>> keyword in topkws with another file.
>>
>> Appreciate any help
>>
>> thanks
>> Sheeba
>>
>>
>> On Sun, Nov 28, 2010 at 2:07 AM, Daniel Dai <da...@gmail.com> wrote:
>>
>>
>>
>>> Limit only takes constant. So "limit sorted_asc (COUNT(*kws*) - 5)" does
>>> not
>>> work.
>>>
>>> You will need a UDF, which returns DataBag. One example is
>>> org.apache.pig.builtin.COR, which returns DataBag. Basically, you can
>>> write
>>> a UDF like this:
>>>
>>> public class BagTest extends EvalFunc<DataBag> {
>>>  @Override
>>>  public DataBag exec(Tuple input) throws IOException {
>>>      DataBag inputDB = (DataBag)input.get(0);
>>>      DataBag db = new DefaultDataBag();
>>>      // Construct your db
>>>      return db;
>>>  }
>>>  @Override
>>>  public Schema outputSchema(Schema input) {
>>>      return new Schema(new
>>> Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(),
>>> input), DataType.BAG));
>>>  }
>>> }
>>>
>>> Daniel
>>>
>>> -----Original Message----- From: Sheeba George
>>> Sent: Thursday, November 25, 2010 7:01 PM
>>> To: user@pig.apache.org
>>> Subject: Question on getting TotalCount - X records
>>>
>>>
>>> Hi  all
>>>
>>>   I need some help with PIG. The requirement is to generate the topX
>>> records for a group. I can easily do this using PIG script where I can
>>> order
>>> by DESC and then limit at X.  If there are more than X records in the
>>> group,I need to aggregate the rest as a single record. How can I achieve
>>> this?
>>>
>>> I am generating topX as below
>>>
>>> *kwgroup* = GROUP *kws* BY (type,category);
>>>
>>> *topkws* = FOREACH *kwgroup* {
>>>
>>>           sorted = ORDER *kws* BY visits DESC;
>>>
>>>           *ltd* = limit sorted 5;
>>>
>>>           GENERATE FLATTEN(*ltd*);}
>>>
>>> For aggregating the rest,
>>> I was thinking of sorting by ASC and truncate at (TotalCount – X). Then
>>> aggregate these records.  How can I get the TotalCount of records in a
>>> group? I tried the below, but fails.
>>>
>>> *
>>>
>>> bottomkws* = FOREACH kwgroup_cnt_gt_top {
>>>
>>> sorted_asc = ORDER *kws* BY visits ASC;
>>>
>>> ltd_bottom = limit sorted_asc (COUNT(*kws*) - 5) ;
>>>
>>> GENERATE FLATTEN(ltd_bottom);}
>>>
>>> But this fails with the erro message that we should use INTEGER instead
>>> of
>>> COUNT(kws)
>>>
>>> Is it better to do this using UDF? In that case UDF will have to sort,
>>> limit
>>> ,aggregate .Could you point to some samples that take a group of records
>>> and
>>> return a group(bag)
>>>
>>>
>>>
>>> Any help in this regard is appreciated.
>>>
>>>
>>>
>>> Thanks
>>>
>>> Sheeba
>>>
>>>
>>>
>>
>>
>>
>>
>>
>
>


-- 
Sheeba Ann George

Re: Question on getting TotalCount - X records

Posted by Daniel Dai <ji...@yahoo-inc.com>.
There is something wrong in outputSchema I gave you last time, try this:

    public Schema outputSchema(Schema input) {
        try {
            Schema schema = 
org.apache.pig.impl.util.Utils.getSchemaFromString("topx_sorted:bag{t:tuple(reportdate:chararray,appid:int,keyword:chararray,searchengine:chararray,visits:long, 
etc....)}");
            return schema;
        } catch (Exception e) {
            return null;
        }
    }

Daniel

Sheeba George wrote:
> Hi Daniel
>     Thanks for your help ... I have created a UDF that aggregates the rest.
> So my UDF takes a DataBag as input and DataBag which has the same schema as
> input as the output. My outputschema method is as below.
>
> *
>
> public* Schema outputSchema(Schema input) {
>
> *try*{
>
> Schema bagSchema = *new* Schema();
>
> bagSchema.add(*new* Schema.FieldSchema(input.getField(0)));
>
> *return* *new* Schema(*new*
> Schema.FieldSchema(getSchemaName(*this*.getClass().getName().toLowerCase(),
> input),
>
> bagSchema, DataType.*BAG*));
>
> }*catch* (Exception e){
>
> *return* *null*;
>
> }
>
> }
> I am using the UDF in PIG as
> topkws = FOREACH kwgroup {
>    sorted = ORDER kws BY visits DESC;
>    GENERATE FLATTEN(AggregateOthers(sorted));}
>
> where AggregateOthers is my UDF. If I DESCRIBE topkws I get
> topkws: {com.pig.udfs.topx_sorted_104::sorted: {reportdate: chararray,
> appid: int,keyword: chararray,searchengine: chararray,visits: long,f2:
> long,f3:
> long,f4: long,f5: long,f6: long,f7: long,f8: long,visitor: long}}
>
> com.pig.udfs.topx_sorted is my package name. Not sure what "104" stands for?
>
> How do I access each field in topkws? I need to join reportdate,appid and
> keyword in topkws with another file.
>
> Appreciate any help
>
> thanks
> Sheeba
>
>
> On Sun, Nov 28, 2010 at 2:07 AM, Daniel Dai <da...@gmail.com> wrote:
>
>   
>> Limit only takes constant. So "limit sorted_asc (COUNT(*kws*) - 5)" does
>> not
>> work.
>>
>> You will need a UDF, which returns DataBag. One example is
>> org.apache.pig.builtin.COR, which returns DataBag. Basically, you can write
>> a UDF like this:
>>
>> public class BagTest extends EvalFunc<DataBag> {
>>   @Override
>>   public DataBag exec(Tuple input) throws IOException {
>>       DataBag inputDB = (DataBag)input.get(0);
>>       DataBag db = new DefaultDataBag();
>>       // Construct your db
>>       return db;
>>   }
>>   @Override
>>   public Schema outputSchema(Schema input) {
>>       return new Schema(new
>> Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(),
>> input), DataType.BAG));
>>   }
>> }
>>
>> Daniel
>>
>> -----Original Message----- From: Sheeba George
>> Sent: Thursday, November 25, 2010 7:01 PM
>> To: user@pig.apache.org
>> Subject: Question on getting TotalCount - X records
>>
>>
>> Hi  all
>>
>>    I need some help with PIG. The requirement is to generate the topX
>> records for a group. I can easily do this using PIG script where I can
>> order
>> by DESC and then limit at X.  If there are more than X records in the
>> group,I need to aggregate the rest as a single record. How can I achieve
>> this?
>>
>> I am generating topX as below
>>
>> *kwgroup* = GROUP *kws* BY (type,category);
>>
>> *topkws* = FOREACH *kwgroup* {
>>
>>            sorted = ORDER *kws* BY visits DESC;
>>
>>            *ltd* = limit sorted 5;
>>
>>            GENERATE FLATTEN(*ltd*);}
>>
>> For aggregating the rest,
>> I was thinking of sorting by ASC and truncate at (TotalCount – X). Then
>> aggregate these records.  How can I get the TotalCount of records in a
>> group? I tried the below, but fails.
>>
>> *
>>
>> bottomkws* = FOREACH kwgroup_cnt_gt_top {
>>
>> sorted_asc = ORDER *kws* BY visits ASC;
>>
>> ltd_bottom = limit sorted_asc (COUNT(*kws*) - 5) ;
>>
>> GENERATE FLATTEN(ltd_bottom);}
>>
>> But this fails with the erro message that we should use INTEGER instead of
>> COUNT(kws)
>>
>> Is it better to do this using UDF? In that case UDF will have to sort,
>> limit
>> ,aggregate .Could you point to some samples that take a group of records
>> and
>> return a group(bag)
>>
>>
>>
>> Any help in this regard is appreciated.
>>
>>
>>
>> Thanks
>>
>> Sheeba
>>
>>     
>
>
>
>