You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Sheeba George <sh...@gmail.com> on 2010/11/26 04:01:05 UTC
Question on getting TotalCount - X records
Hi all
I need some help with PIG. The requirement is to generate the topX
records for a group. I can easily do this using PIG script where I can order
by DESC and then limit at X. If there are more than X records in the
group,I need to aggregate the rest as a single record. How can I achieve
this?
I am generating topX as below
*kwgroup* = GROUP *kws* BY (type,category);
*topkws* = FOREACH *kwgroup* {
sorted = ORDER *kws* BY visits DESC;
*ltd* = limit sorted 5;
GENERATE FLATTEN(*ltd*);}
For aggregating the rest,
I was thinking of sorting by ASC and truncate at (TotalCount – X). Then
aggregate these records. How can I get the TotalCount of records in a
group? I tried the below, but fails.
*
bottomkws* = FOREACH kwgroup_cnt_gt_top {
sorted_asc = ORDER *kws* BY visits ASC;
ltd_bottom = limit sorted_asc (COUNT(*kws*) - 5) ;
GENERATE FLATTEN(ltd_bottom);}
But this fails with the erro message that we should use INTEGER instead of
COUNT(kws)
Is it better to do this using UDF? In that case UDF will have to sort, limit
,aggregate .Could you point to some samples that take a group of records and
return a group(bag)
Any help in this regard is appreciated.
Thanks
Sheeba
Re: Question on getting TotalCount - X records
Posted by Daniel Dai <ji...@yahoo-inc.com>.
Yes, actually it is much easier:
public Schema outputSchema(Schema input) {
return input;
}
Daniel
Sheeba George wrote:
> Hi Daniel
> Is it possible to get the schema string from the "input" param rather than
> hardcoding?
> Thanks
> Sheeba
> On Mon, Dec 13, 2010 at 11:53 PM, Daniel Dai <ji...@yahoo-inc.com> wrote:
>
>
>> There is something wrong in outputSchema I gave you last time, try this:
>>
>>
>> public Schema outputSchema(Schema input) {
>> try {
>> Schema schema =
>> org.apache.pig.impl.util.Utils.getSchemaFromString("topx_sorted:bag{t:tuple(reportdate:chararray,appid:int,keyword:chararray,searchengine:chararray,visits:long,
>> etc....)}");
>> return schema;
>>
>> } catch (Exception e) {
>> return null;
>> }
>> }
>>
>> Daniel
>>
>>
>> Sheeba George wrote:
>>
>>
>>> Hi Daniel
>>> Thanks for your help ... I have created a UDF that aggregates the rest.
>>> So my UDF takes a DataBag as input and DataBag which has the same schema
>>> as
>>> input as the output. My outputschema method is as below.
>>>
>>> *
>>>
>>> public* Schema outputSchema(Schema input) {
>>>
>>> *try*{
>>>
>>> Schema bagSchema = *new* Schema();
>>>
>>> bagSchema.add(*new* Schema.FieldSchema(input.getField(0)));
>>>
>>> *return* *new* Schema(*new*
>>>
>>> Schema.FieldSchema(getSchemaName(*this*.getClass().getName().toLowerCase(),
>>> input),
>>>
>>> bagSchema, DataType.*BAG*));
>>>
>>> }*catch* (Exception e){
>>>
>>> *return* *null*;
>>>
>>> }
>>>
>>> }
>>> I am using the UDF in PIG as
>>> topkws = FOREACH kwgroup {
>>> sorted = ORDER kws BY visits DESC;
>>> GENERATE FLATTEN(AggregateOthers(sorted));}
>>>
>>> where AggregateOthers is my UDF. If I DESCRIBE topkws I get
>>> topkws: {com.pig.udfs.topx_sorted_104::sorted: {reportdate: chararray,
>>> appid: int,keyword: chararray,searchengine: chararray,visits: long,f2:
>>> long,f3:
>>> long,f4: long,f5: long,f6: long,f7: long,f8: long,visitor: long}}
>>>
>>> com.pig.udfs.topx_sorted is my package name. Not sure what "104" stands
>>> for?
>>>
>>> How do I access each field in topkws? I need to join reportdate,appid and
>>> keyword in topkws with another file.
>>>
>>> Appreciate any help
>>>
>>> thanks
>>> Sheeba
>>>
>>>
>>> On Sun, Nov 28, 2010 at 2:07 AM, Daniel Dai <da...@gmail.com> wrote:
>>>
>>>
>>>
>>>
>>>> Limit only takes constant. So "limit sorted_asc (COUNT(*kws*) - 5)" does
>>>> not
>>>> work.
>>>>
>>>> You will need a UDF, which returns DataBag. One example is
>>>> org.apache.pig.builtin.COR, which returns DataBag. Basically, you can
>>>> write
>>>> a UDF like this:
>>>>
>>>> public class BagTest extends EvalFunc<DataBag> {
>>>> @Override
>>>> public DataBag exec(Tuple input) throws IOException {
>>>> DataBag inputDB = (DataBag)input.get(0);
>>>> DataBag db = new DefaultDataBag();
>>>> // Construct your db
>>>> return db;
>>>> }
>>>> @Override
>>>> public Schema outputSchema(Schema input) {
>>>> return new Schema(new
>>>> Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(),
>>>> input), DataType.BAG));
>>>> }
>>>> }
>>>>
>>>> Daniel
>>>>
>>>> -----Original Message----- From: Sheeba George
>>>> Sent: Thursday, November 25, 2010 7:01 PM
>>>> To: user@pig.apache.org
>>>> Subject: Question on getting TotalCount - X records
>>>>
>>>>
>>>> Hi all
>>>>
>>>> I need some help with PIG. The requirement is to generate the topX
>>>> records for a group. I can easily do this using PIG script where I can
>>>> order
>>>> by DESC and then limit at X. If there are more than X records in the
>>>> group,I need to aggregate the rest as a single record. How can I achieve
>>>> this?
>>>>
>>>> I am generating topX as below
>>>>
>>>> *kwgroup* = GROUP *kws* BY (type,category);
>>>>
>>>> *topkws* = FOREACH *kwgroup* {
>>>>
>>>> sorted = ORDER *kws* BY visits DESC;
>>>>
>>>> *ltd* = limit sorted 5;
>>>>
>>>> GENERATE FLATTEN(*ltd*);}
>>>>
>>>> For aggregating the rest,
>>>> I was thinking of sorting by ASC and truncate at (TotalCount – X). Then
>>>> aggregate these records. How can I get the TotalCount of records in a
>>>> group? I tried the below, but fails.
>>>>
>>>> *
>>>>
>>>> bottomkws* = FOREACH kwgroup_cnt_gt_top {
>>>>
>>>> sorted_asc = ORDER *kws* BY visits ASC;
>>>>
>>>> ltd_bottom = limit sorted_asc (COUNT(*kws*) - 5) ;
>>>>
>>>> GENERATE FLATTEN(ltd_bottom);}
>>>>
>>>> But this fails with the erro message that we should use INTEGER instead
>>>> of
>>>> COUNT(kws)
>>>>
>>>> Is it better to do this using UDF? In that case UDF will have to sort,
>>>> limit
>>>> ,aggregate .Could you point to some samples that take a group of records
>>>> and
>>>> return a group(bag)
>>>>
>>>>
>>>>
>>>> Any help in this regard is appreciated.
>>>>
>>>>
>>>>
>>>> Thanks
>>>>
>>>> Sheeba
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>
>
>
Re: Question on getting TotalCount - X records
Posted by Sheeba George <sh...@gmail.com>.
Your suggestion helped me to do the join. But I want to have the output
schema generic and in fact same as input schema as this UDF will be shared
by different inputs. How do I do that?
thanks
Sheeba
On Tue, Dec 14, 2010 at 12:43 AM, Sheeba George <sh...@gmail.com>wrote:
>
> Hi Daniel
> Is it possible to get the schema string from the "input" param rather than
> hardcoding?
> Thanks
> Sheeba
> On Mon, Dec 13, 2010 at 11:53 PM, Daniel Dai <ji...@yahoo-inc.com>wrote:
>
>> There is something wrong in outputSchema I gave you last time, try this:
>>
>>
>> public Schema outputSchema(Schema input) {
>> try {
>> Schema schema =
>> org.apache.pig.impl.util.Utils.getSchemaFromString("topx_sorted:bag{t:tuple(reportdate:chararray,appid:int,keyword:chararray,searchengine:chararray,visits:long,
>> etc....)}");
>> return schema;
>>
>> } catch (Exception e) {
>> return null;
>> }
>> }
>>
>> Daniel
>>
>>
>> Sheeba George wrote:
>>
>>> Hi Daniel
>>> Thanks for your help ... I have created a UDF that aggregates the
>>> rest.
>>> So my UDF takes a DataBag as input and DataBag which has the same schema
>>> as
>>> input as the output. My outputschema method is as below.
>>>
>>> *
>>>
>>> public* Schema outputSchema(Schema input) {
>>>
>>> *try*{
>>>
>>> Schema bagSchema = *new* Schema();
>>>
>>> bagSchema.add(*new* Schema.FieldSchema(input.getField(0)));
>>>
>>> *return* *new* Schema(*new*
>>>
>>> Schema.FieldSchema(getSchemaName(*this*.getClass().getName().toLowerCase(),
>>> input),
>>>
>>> bagSchema, DataType.*BAG*));
>>>
>>> }*catch* (Exception e){
>>>
>>> *return* *null*;
>>>
>>> }
>>>
>>> }
>>> I am using the UDF in PIG as
>>> topkws = FOREACH kwgroup {
>>> sorted = ORDER kws BY visits DESC;
>>> GENERATE FLATTEN(AggregateOthers(sorted));}
>>>
>>> where AggregateOthers is my UDF. If I DESCRIBE topkws I get
>>> topkws: {com.pig.udfs.topx_sorted_104::sorted: {reportdate: chararray,
>>> appid: int,keyword: chararray,searchengine: chararray,visits: long,f2:
>>> long,f3:
>>> long,f4: long,f5: long,f6: long,f7: long,f8: long,visitor: long}}
>>>
>>> com.pig.udfs.topx_sorted is my package name. Not sure what "104" stands
>>> for?
>>>
>>> How do I access each field in topkws? I need to join reportdate,appid and
>>> keyword in topkws with another file.
>>>
>>> Appreciate any help
>>>
>>> thanks
>>> Sheeba
>>>
>>>
>>> On Sun, Nov 28, 2010 at 2:07 AM, Daniel Dai <da...@gmail.com> wrote:
>>>
>>>
>>>
>>>> Limit only takes constant. So "limit sorted_asc (COUNT(*kws*) - 5)" does
>>>> not
>>>> work.
>>>>
>>>> You will need a UDF, which returns DataBag. One example is
>>>> org.apache.pig.builtin.COR, which returns DataBag. Basically, you can
>>>> write
>>>> a UDF like this:
>>>>
>>>> public class BagTest extends EvalFunc<DataBag> {
>>>> @Override
>>>> public DataBag exec(Tuple input) throws IOException {
>>>> DataBag inputDB = (DataBag)input.get(0);
>>>> DataBag db = new DefaultDataBag();
>>>> // Construct your db
>>>> return db;
>>>> }
>>>> @Override
>>>> public Schema outputSchema(Schema input) {
>>>> return new Schema(new
>>>>
>>>> Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(),
>>>> input), DataType.BAG));
>>>> }
>>>> }
>>>>
>>>> Daniel
>>>>
>>>> -----Original Message----- From: Sheeba George
>>>> Sent: Thursday, November 25, 2010 7:01 PM
>>>> To: user@pig.apache.org
>>>> Subject: Question on getting TotalCount - X records
>>>>
>>>>
>>>> Hi all
>>>>
>>>> I need some help with PIG. The requirement is to generate the topX
>>>> records for a group. I can easily do this using PIG script where I can
>>>> order
>>>> by DESC and then limit at X. If there are more than X records in the
>>>> group,I need to aggregate the rest as a single record. How can I achieve
>>>> this?
>>>>
>>>> I am generating topX as below
>>>>
>>>> *kwgroup* = GROUP *kws* BY (type,category);
>>>>
>>>> *topkws* = FOREACH *kwgroup* {
>>>>
>>>> sorted = ORDER *kws* BY visits DESC;
>>>>
>>>> *ltd* = limit sorted 5;
>>>>
>>>> GENERATE FLATTEN(*ltd*);}
>>>>
>>>> For aggregating the rest,
>>>> I was thinking of sorting by ASC and truncate at (TotalCount – X). Then
>>>> aggregate these records. How can I get the TotalCount of records in a
>>>> group? I tried the below, but fails.
>>>>
>>>> *
>>>>
>>>> bottomkws* = FOREACH kwgroup_cnt_gt_top {
>>>>
>>>> sorted_asc = ORDER *kws* BY visits ASC;
>>>>
>>>> ltd_bottom = limit sorted_asc (COUNT(*kws*) - 5) ;
>>>>
>>>> GENERATE FLATTEN(ltd_bottom);}
>>>>
>>>> But this fails with the erro message that we should use INTEGER instead
>>>> of
>>>> COUNT(kws)
>>>>
>>>> Is it better to do this using UDF? In that case UDF will have to sort,
>>>> limit
>>>> ,aggregate .Could you point to some samples that take a group of records
>>>> and
>>>> return a group(bag)
>>>>
>>>>
>>>>
>>>> Any help in this regard is appreciated.
>>>>
>>>>
>>>>
>>>> Thanks
>>>>
>>>> Sheeba
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>
>
> --
> Sheeba Ann George
>
>
--
Sheeba Ann George
Re: Question on getting TotalCount - X records
Posted by Sheeba George <sh...@gmail.com>.
Hi Daniel
Is it possible to get the schema string from the "input" param rather than
hardcoding?
Thanks
Sheeba
On Mon, Dec 13, 2010 at 11:53 PM, Daniel Dai <ji...@yahoo-inc.com> wrote:
> There is something wrong in outputSchema I gave you last time, try this:
>
>
> public Schema outputSchema(Schema input) {
> try {
> Schema schema =
> org.apache.pig.impl.util.Utils.getSchemaFromString("topx_sorted:bag{t:tuple(reportdate:chararray,appid:int,keyword:chararray,searchengine:chararray,visits:long,
> etc....)}");
> return schema;
>
> } catch (Exception e) {
> return null;
> }
> }
>
> Daniel
>
>
> Sheeba George wrote:
>
>> Hi Daniel
>> Thanks for your help ... I have created a UDF that aggregates the rest.
>> So my UDF takes a DataBag as input and DataBag which has the same schema
>> as
>> input as the output. My outputschema method is as below.
>>
>> *
>>
>> public* Schema outputSchema(Schema input) {
>>
>> *try*{
>>
>> Schema bagSchema = *new* Schema();
>>
>> bagSchema.add(*new* Schema.FieldSchema(input.getField(0)));
>>
>> *return* *new* Schema(*new*
>>
>> Schema.FieldSchema(getSchemaName(*this*.getClass().getName().toLowerCase(),
>> input),
>>
>> bagSchema, DataType.*BAG*));
>>
>> }*catch* (Exception e){
>>
>> *return* *null*;
>>
>> }
>>
>> }
>> I am using the UDF in PIG as
>> topkws = FOREACH kwgroup {
>> sorted = ORDER kws BY visits DESC;
>> GENERATE FLATTEN(AggregateOthers(sorted));}
>>
>> where AggregateOthers is my UDF. If I DESCRIBE topkws I get
>> topkws: {com.pig.udfs.topx_sorted_104::sorted: {reportdate: chararray,
>> appid: int,keyword: chararray,searchengine: chararray,visits: long,f2:
>> long,f3:
>> long,f4: long,f5: long,f6: long,f7: long,f8: long,visitor: long}}
>>
>> com.pig.udfs.topx_sorted is my package name. Not sure what "104" stands
>> for?
>>
>> How do I access each field in topkws? I need to join reportdate,appid and
>> keyword in topkws with another file.
>>
>> Appreciate any help
>>
>> thanks
>> Sheeba
>>
>>
>> On Sun, Nov 28, 2010 at 2:07 AM, Daniel Dai <da...@gmail.com> wrote:
>>
>>
>>
>>> Limit only takes constant. So "limit sorted_asc (COUNT(*kws*) - 5)" does
>>> not
>>> work.
>>>
>>> You will need a UDF, which returns DataBag. One example is
>>> org.apache.pig.builtin.COR, which returns DataBag. Basically, you can
>>> write
>>> a UDF like this:
>>>
>>> public class BagTest extends EvalFunc<DataBag> {
>>> @Override
>>> public DataBag exec(Tuple input) throws IOException {
>>> DataBag inputDB = (DataBag)input.get(0);
>>> DataBag db = new DefaultDataBag();
>>> // Construct your db
>>> return db;
>>> }
>>> @Override
>>> public Schema outputSchema(Schema input) {
>>> return new Schema(new
>>> Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(),
>>> input), DataType.BAG));
>>> }
>>> }
>>>
>>> Daniel
>>>
>>> -----Original Message----- From: Sheeba George
>>> Sent: Thursday, November 25, 2010 7:01 PM
>>> To: user@pig.apache.org
>>> Subject: Question on getting TotalCount - X records
>>>
>>>
>>> Hi all
>>>
>>> I need some help with PIG. The requirement is to generate the topX
>>> records for a group. I can easily do this using PIG script where I can
>>> order
>>> by DESC and then limit at X. If there are more than X records in the
>>> group,I need to aggregate the rest as a single record. How can I achieve
>>> this?
>>>
>>> I am generating topX as below
>>>
>>> *kwgroup* = GROUP *kws* BY (type,category);
>>>
>>> *topkws* = FOREACH *kwgroup* {
>>>
>>> sorted = ORDER *kws* BY visits DESC;
>>>
>>> *ltd* = limit sorted 5;
>>>
>>> GENERATE FLATTEN(*ltd*);}
>>>
>>> For aggregating the rest,
>>> I was thinking of sorting by ASC and truncate at (TotalCount – X). Then
>>> aggregate these records. How can I get the TotalCount of records in a
>>> group? I tried the below, but fails.
>>>
>>> *
>>>
>>> bottomkws* = FOREACH kwgroup_cnt_gt_top {
>>>
>>> sorted_asc = ORDER *kws* BY visits ASC;
>>>
>>> ltd_bottom = limit sorted_asc (COUNT(*kws*) - 5) ;
>>>
>>> GENERATE FLATTEN(ltd_bottom);}
>>>
>>> But this fails with the erro message that we should use INTEGER instead
>>> of
>>> COUNT(kws)
>>>
>>> Is it better to do this using UDF? In that case UDF will have to sort,
>>> limit
>>> ,aggregate .Could you point to some samples that take a group of records
>>> and
>>> return a group(bag)
>>>
>>>
>>>
>>> Any help in this regard is appreciated.
>>>
>>>
>>>
>>> Thanks
>>>
>>> Sheeba
>>>
>>>
>>>
>>
>>
>>
>>
>>
>
>
--
Sheeba Ann George
Re: Question on getting TotalCount - X records
Posted by Daniel Dai <ji...@yahoo-inc.com>.
There is something wrong in outputSchema I gave you last time, try this:
public Schema outputSchema(Schema input) {
try {
Schema schema =
org.apache.pig.impl.util.Utils.getSchemaFromString("topx_sorted:bag{t:tuple(reportdate:chararray,appid:int,keyword:chararray,searchengine:chararray,visits:long,
etc....)}");
return schema;
} catch (Exception e) {
return null;
}
}
Daniel
Sheeba George wrote:
> Hi Daniel
> Thanks for your help ... I have created a UDF that aggregates the rest.
> So my UDF takes a DataBag as input and DataBag which has the same schema as
> input as the output. My outputschema method is as below.
>
> *
>
> public* Schema outputSchema(Schema input) {
>
> *try*{
>
> Schema bagSchema = *new* Schema();
>
> bagSchema.add(*new* Schema.FieldSchema(input.getField(0)));
>
> *return* *new* Schema(*new*
> Schema.FieldSchema(getSchemaName(*this*.getClass().getName().toLowerCase(),
> input),
>
> bagSchema, DataType.*BAG*));
>
> }*catch* (Exception e){
>
> *return* *null*;
>
> }
>
> }
> I am using the UDF in PIG as
> topkws = FOREACH kwgroup {
> sorted = ORDER kws BY visits DESC;
> GENERATE FLATTEN(AggregateOthers(sorted));}
>
> where AggregateOthers is my UDF. If I DESCRIBE topkws I get
> topkws: {com.pig.udfs.topx_sorted_104::sorted: {reportdate: chararray,
> appid: int,keyword: chararray,searchengine: chararray,visits: long,f2:
> long,f3:
> long,f4: long,f5: long,f6: long,f7: long,f8: long,visitor: long}}
>
> com.pig.udfs.topx_sorted is my package name. Not sure what "104" stands for?
>
> How do I access each field in topkws? I need to join reportdate,appid and
> keyword in topkws with another file.
>
> Appreciate any help
>
> thanks
> Sheeba
>
>
> On Sun, Nov 28, 2010 at 2:07 AM, Daniel Dai <da...@gmail.com> wrote:
>
>
>> Limit only takes constant. So "limit sorted_asc (COUNT(*kws*) - 5)" does
>> not
>> work.
>>
>> You will need a UDF, which returns DataBag. One example is
>> org.apache.pig.builtin.COR, which returns DataBag. Basically, you can write
>> a UDF like this:
>>
>> public class BagTest extends EvalFunc<DataBag> {
>> @Override
>> public DataBag exec(Tuple input) throws IOException {
>> DataBag inputDB = (DataBag)input.get(0);
>> DataBag db = new DefaultDataBag();
>> // Construct your db
>> return db;
>> }
>> @Override
>> public Schema outputSchema(Schema input) {
>> return new Schema(new
>> Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(),
>> input), DataType.BAG));
>> }
>> }
>>
>> Daniel
>>
>> -----Original Message----- From: Sheeba George
>> Sent: Thursday, November 25, 2010 7:01 PM
>> To: user@pig.apache.org
>> Subject: Question on getting TotalCount - X records
>>
>>
>> Hi all
>>
>> I need some help with PIG. The requirement is to generate the topX
>> records for a group. I can easily do this using PIG script where I can
>> order
>> by DESC and then limit at X. If there are more than X records in the
>> group,I need to aggregate the rest as a single record. How can I achieve
>> this?
>>
>> I am generating topX as below
>>
>> *kwgroup* = GROUP *kws* BY (type,category);
>>
>> *topkws* = FOREACH *kwgroup* {
>>
>> sorted = ORDER *kws* BY visits DESC;
>>
>> *ltd* = limit sorted 5;
>>
>> GENERATE FLATTEN(*ltd*);}
>>
>> For aggregating the rest,
>> I was thinking of sorting by ASC and truncate at (TotalCount – X). Then
>> aggregate these records. How can I get the TotalCount of records in a
>> group? I tried the below, but fails.
>>
>> *
>>
>> bottomkws* = FOREACH kwgroup_cnt_gt_top {
>>
>> sorted_asc = ORDER *kws* BY visits ASC;
>>
>> ltd_bottom = limit sorted_asc (COUNT(*kws*) - 5) ;
>>
>> GENERATE FLATTEN(ltd_bottom);}
>>
>> But this fails with the erro message that we should use INTEGER instead of
>> COUNT(kws)
>>
>> Is it better to do this using UDF? In that case UDF will have to sort,
>> limit
>> ,aggregate .Could you point to some samples that take a group of records
>> and
>> return a group(bag)
>>
>>
>>
>> Any help in this regard is appreciated.
>>
>>
>>
>> Thanks
>>
>> Sheeba
>>
>>
>
>
>
>
Re: Question on getting TotalCount - X records
Posted by Sheeba George <sh...@gmail.com>.
Hi Daniel
Thanks for your help ... I have created a UDF that aggregates the rest.
So my UDF takes a DataBag as input and DataBag which has the same schema as
input as the output. My outputschema method is as below.
*
public* Schema outputSchema(Schema input) {
*try*{
Schema bagSchema = *new* Schema();
bagSchema.add(*new* Schema.FieldSchema(input.getField(0)));
*return* *new* Schema(*new*
Schema.FieldSchema(getSchemaName(*this*.getClass().getName().toLowerCase(),
input),
bagSchema, DataType.*BAG*));
}*catch* (Exception e){
*return* *null*;
}
}
I am using the UDF in PIG as
topkws = FOREACH kwgroup {
sorted = ORDER kws BY visits DESC;
GENERATE FLATTEN(AggregateOthers(sorted));}
where AggregateOthers is my UDF. If I DESCRIBE topkws I get
topkws: {com.pig.udfs.topx_sorted_104::sorted: {reportdate: chararray,
appid: int,keyword: chararray,searchengine: chararray,visits: long,f2:
long,f3:
long,f4: long,f5: long,f6: long,f7: long,f8: long,visitor: long}}
com.pig.udfs.topx_sorted is my package name. Not sure what "104" stands for?
How do I access each field in topkws? I need to join reportdate,appid and
keyword in topkws with another file.
Appreciate any help
thanks
Sheeba
On Sun, Nov 28, 2010 at 2:07 AM, Daniel Dai <da...@gmail.com> wrote:
> Limit only takes constant. So "limit sorted_asc (COUNT(*kws*) - 5)" does
> not
> work.
>
> You will need a UDF, which returns DataBag. One example is
> org.apache.pig.builtin.COR, which returns DataBag. Basically, you can write
> a UDF like this:
>
> public class BagTest extends EvalFunc<DataBag> {
> @Override
> public DataBag exec(Tuple input) throws IOException {
> DataBag inputDB = (DataBag)input.get(0);
> DataBag db = new DefaultDataBag();
> // Construct your db
> return db;
> }
> @Override
> public Schema outputSchema(Schema input) {
> return new Schema(new
> Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(),
> input), DataType.BAG));
> }
> }
>
> Daniel
>
> -----Original Message----- From: Sheeba George
> Sent: Thursday, November 25, 2010 7:01 PM
> To: user@pig.apache.org
> Subject: Question on getting TotalCount - X records
>
>
> Hi all
>
> I need some help with PIG. The requirement is to generate the topX
> records for a group. I can easily do this using PIG script where I can
> order
> by DESC and then limit at X. If there are more than X records in the
> group,I need to aggregate the rest as a single record. How can I achieve
> this?
>
> I am generating topX as below
>
> *kwgroup* = GROUP *kws* BY (type,category);
>
> *topkws* = FOREACH *kwgroup* {
>
> sorted = ORDER *kws* BY visits DESC;
>
> *ltd* = limit sorted 5;
>
> GENERATE FLATTEN(*ltd*);}
>
> For aggregating the rest,
> I was thinking of sorting by ASC and truncate at (TotalCount – X). Then
> aggregate these records. How can I get the TotalCount of records in a
> group? I tried the below, but fails.
>
> *
>
> bottomkws* = FOREACH kwgroup_cnt_gt_top {
>
> sorted_asc = ORDER *kws* BY visits ASC;
>
> ltd_bottom = limit sorted_asc (COUNT(*kws*) - 5) ;
>
> GENERATE FLATTEN(ltd_bottom);}
>
> But this fails with the erro message that we should use INTEGER instead of
> COUNT(kws)
>
> Is it better to do this using UDF? In that case UDF will have to sort,
> limit
> ,aggregate .Could you point to some samples that take a group of records
> and
> return a group(bag)
>
>
>
> Any help in this regard is appreciated.
>
>
>
> Thanks
>
> Sheeba
>
--
Sheeba Ann George
Re: Question on getting TotalCount - X records
Posted by Daniel Dai <da...@gmail.com>.
Limit only takes constant. So "limit sorted_asc (COUNT(*kws*) - 5)" does not
work.
You will need a UDF, which returns DataBag. One example is
org.apache.pig.builtin.COR, which returns DataBag. Basically, you can write
a UDF like this:
public class BagTest extends EvalFunc<DataBag> {
@Override
public DataBag exec(Tuple input) throws IOException {
DataBag inputDB = (DataBag)input.get(0);
DataBag db = new DefaultDataBag();
// Construct your db
return db;
}
@Override
public Schema outputSchema(Schema input) {
return new Schema(new
Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(),
input), DataType.BAG));
}
}
Daniel
-----Original Message-----
From: Sheeba George
Sent: Thursday, November 25, 2010 7:01 PM
To: user@pig.apache.org
Subject: Question on getting TotalCount - X records
Hi all
I need some help with PIG. The requirement is to generate the topX
records for a group. I can easily do this using PIG script where I can order
by DESC and then limit at X. If there are more than X records in the
group,I need to aggregate the rest as a single record. How can I achieve
this?
I am generating topX as below
*kwgroup* = GROUP *kws* BY (type,category);
*topkws* = FOREACH *kwgroup* {
sorted = ORDER *kws* BY visits DESC;
*ltd* = limit sorted 5;
GENERATE FLATTEN(*ltd*);}
For aggregating the rest,
I was thinking of sorting by ASC and truncate at (TotalCount – X). Then
aggregate these records. How can I get the TotalCount of records in a
group? I tried the below, but fails.
*
bottomkws* = FOREACH kwgroup_cnt_gt_top {
sorted_asc = ORDER *kws* BY visits ASC;
ltd_bottom = limit sorted_asc (COUNT(*kws*) - 5) ;
GENERATE FLATTEN(ltd_bottom);}
But this fails with the erro message that we should use INTEGER instead of
COUNT(kws)
Is it better to do this using UDF? In that case UDF will have to sort, limit
,aggregate .Could you point to some samples that take a group of records and
return a group(bag)
Any help in this regard is appreciated.
Thanks
Sheeba