You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Mohammad Tariq <do...@gmail.com> on 2012/05/21 13:54:07 UTC

How to use TOP?

Hello list,

  I have an Hdfs file that has 6 columns that contain some data stored
in an Hbase table.the data looks like this -

18.98	2000	         1.21 	193.46	2.64	    58.17
52.49	2000.5	 4.32	        947.11	2.74	    64.45
115.24	2001	         16.8 	878.58	2.66	    94.49
55.55	2001.5	 33.03	656.56	2.82	    60.76
156.14	2002	         35.52	83.75	2.6	    59.57
138.77	2002.5	 21.51	105.76	2.62	    85.89
71.89	2003	         27.79	709.01	2.63	    85.44
59.84	2003.5	 32.1	        444.82	2.72	    70.8
103.18	2004	         4.09 	413.15	2.8	    54.37

Now I have to take each record along with its next 4 records and do
some processing(for example, in the first shot I have to take records
1-5, in the next shot I have to take 2-6 and so on)..I am trying to
use TOP for this, but getting the following error -

2012-05-21 17:04:30,328 [main] ERROR org.apache.pig.tools.grunt.Grunt
- ERROR 1200: Pig script failed to parse:
<line 6, column 37> Invalid scalar projection: parameters : A column
needs to be projected from a relation for it to be used as a scalar
Details at logfile: /home/mohammad/pig-0.9.2/logs/pig_1337599211281.log

I am using following commands -

grunt> a = load 'hbase://logdata'
>> using org.apache.pig.backend.hadoop.hbase.HBaseStorage(
>> 'cf:DGR cf:HD cf:POR cf:RES cf:RHOB cf:SON', '-loadKey true')
>> as (id, DGR, HD, POR, RES, RHOB, SON);
grunt> b = foreach a { c = TOP(5,3,a);
>> generate flatten(c);
>> }

Could anyone tell me how to achieve that????Many thanks.

Regards,
    Mohammad Tariq

Re: How to use TOP?

Posted by Mohammad Tariq <do...@gmail.com>.
Yes, it would be better if I do it at the time of insertion.Just have
to add one more column.Thanks again.

Regards,
    Mohammad Tariq


On Tue, May 22, 2012 at 2:36 PM, Abhinav Neelam <ab...@gmail.com> wrote:
> Doing it in the pig script is not feasible because pig doesn't have any
> notion of sequentiality - to maintain it, you'd need to have access to
> state that's shared globally by all the mappers and reducers. One way I can
> think of doing this is to have a UDF that maintains state - perhaps it can
> maintain a file that's NFS mounted/or in HDFS so that it's available on all
> the task nodes; then any call to the UDF can update that file (atomically)
> and return a 'row number' that you could associate with your current tuple.
> Something like:
> B = FOREACH A GENERATE $0, $1, $2, $3, MyUDFs.GETROWNUM() as rownum;
>
> However, AFAIK, you'd be better off doing it in HBase - perhaps at the time
> of record insert, you could also add a 'row number' into the record?
>
> On 22 May 2012 12:43, Mohammad Tariq <do...@gmail.com> wrote:
>
>> Hi Abhinav,
>>
>>   Thanks a lot for the valuable response..Actually I was thinking of
>> doing the same thing, but being new to Pig I thought of asking it on
>> the mailing list first..As far as the data is concerned, second column
>> will always be in ascending order.But I don't think it will be of any
>> help..I think whatever you have suggested here would be the
>> appropriate solution..Although I would like to ask you one thing..Is
>> it feasible to add that first column having count in my pig script or
>> do I have to change the data in my Hbase table itself???If yes then
>> how can I achieve it in my script??Many thanks.
>>
>> Regards,
>>     Mohammad Tariq
>>
>>
>> On Tue, May 22, 2012 at 1:16 AM, Abhinav Neelam <ab...@gmail.com>
>> wrote:
>> > Hey Mohammad,
>> >
>> > You need to have sorting requirements when you say 'top 5' records.
>> Because
>> > relations/bags in Pig are unordered, it's natural to ask: 'top 5 by what
>> > parameter?' I'm unfamiliar with HBase, but if your data in HBase has an
>> > implicit ordering with say an auto-increment primary key, or an explicit
>> > one, you could include that field in your input to Pig and then apply TOP
>> > on that field.
>> >
>> > Having said that, if I understand your problem correctly, you don't need
>> > TOP at all - you just want to process your input in groups of 5 tuples
>> at a
>> > time. Again, I can't think of a way of doing this without modifying your
>> > input. For example, if your input included an extra field like this:
>> > 1 18.98   2000             1.21   193.46  2.64        58.17
>> > 1 52.49   2000.5   4.32           947.11  2.74        64.45
>> > 1 115.24  2001             16.8   878.58  2.66        94.49
>> > 1 55.55   2001.5   33.03  656.56  2.82        60.76
>> > 1 156.14  2002             35.52  83.75   2.6         59.57
>> > 2 138.77  2002.5   21.51  105.76  2.62        85.89
>> > 2 71.89   2003             27.79  709.01  2.63        85.44
>> > 2 59.84   2003.5   32.1           444.82  2.72        70.8
>> > 2 103.18  2004             4.09   413.15  2.8         54.37
>> >
>> > you could do a group on that field and proceed. Even if you had a field
>> > like 'line number' or 'record number' in your input, you could still
>> > manipulate that field (say through integer division by 5) to use it for
>> > grouping. In any case, you need something to let Pig bring together your
>> 5
>> > tuple groups.
>> >
>> > B = group A by $0;
>> > C = FOREACH B { <do some processing on your 5 tuple bag A> ...
>> >
>> > Thanks,
>> > Abhinav
>> >
>> > On 21 May 2012 23:03, Mohammad Tariq <do...@gmail.com> wrote:
>> >
>> >> Hi Ruslan,
>> >>
>> >>    Thanks for the response.I think I have made a mistake.Actually I
>> >> just want the top 5 records each time.I don't have any sorting
>> >> requirements.
>> >>
>> >> Regards,
>> >>     Mohammad Tariq
>> >>
>> >>
>> >> On Mon, May 21, 2012 at 9:31 PM, Ruslan Al-fakikh
>> >> <ru...@jalent.ru> wrote:
>> >> > Hey Mohammad,
>> >> >
>> >> > Here
>> >> > c = TOP(5,3,a);
>> >> > you say: take 5 records out of a that have the biggest values in the
>> >> third
>> >> > column. Do you really need that sorting by the third column?
>> >> >
>> >> > -----Original Message-----
>> >> > From: Mohammad Tariq [mailto:dontariq@gmail.com]
>> >> > Sent: Monday, May 21, 2012 3:54 PM
>> >> > To: user@pig.apache.org
>> >> > Subject: How to use TOP?
>> >> >
>> >> > Hello list,
>> >> >
>> >> >  I have an Hdfs file that has 6 columns that contain some data stored
>> in
>> >> an
>> >> > Hbase table.the data looks like this -
>> >> >
>> >> > 18.98   2000             1.21   193.46  2.64        58.17
>> >> > 52.49   2000.5   4.32           947.11  2.74        64.45
>> >> > 115.24  2001             16.8   878.58  2.66        94.49
>> >> > 55.55   2001.5   33.03  656.56  2.82        60.76
>> >> > 156.14  2002             35.52  83.75   2.6         59.57
>> >> > 138.77  2002.5   21.51  105.76  2.62        85.89
>> >> > 71.89   2003             27.79  709.01  2.63        85.44
>> >> > 59.84   2003.5   32.1           444.82  2.72        70.8
>> >> > 103.18  2004             4.09   413.15  2.8         54.37
>> >> >
>> >> > Now I have to take each record along with its next 4 records and do
>> some
>> >> > processing(for example, in the first shot I have to take records 1-5,
>> in
>> >> the
>> >> > next shot I have to take 2-6 and so on)..I am trying to use TOP for
>> this,
>> >> > but getting the following error -
>> >> >
>> >> > 2012-05-21 17:04:30,328 [main] ERROR org.apache.pig.tools.grunt.Grunt
>> >> > - ERROR 1200: Pig script failed to parse:
>> >> > <line 6, column 37> Invalid scalar projection: parameters : A column
>> >> needs
>> >> > to be projected from a relation for it to be used as a scalar Details
>> at
>> >> > logfile: /home/mohammad/pig-0.9.2/logs/pig_1337599211281.log
>> >> >
>> >> > I am using following commands -
>> >> >
>> >> > grunt> a = load 'hbase://logdata'
>> >> >>> using org.apache.pig.backend.hadoop.hbase.HBaseStorage(
>> >> >>> 'cf:DGR cf:HD cf:POR cf:RES cf:RHOB cf:SON', '-loadKey true') as
>> (id,
>> >> >>> DGR, HD, POR, RES, RHOB, SON);
>> >> > grunt> b = foreach a { c = TOP(5,3,a);
>> >> >>> generate flatten(c);
>> >> >>> }
>> >> >
>> >> > Could anyone tell me how to achieve that????Many thanks.
>> >> >
>> >> > Regards,
>> >> >     Mohammad Tariq
>> >> >
>> >>
>> >
>> >
>> >
>> > --
>> > Hacking is, and always has been, the Holy
>> > Grail of computer science.
>>
>
>
>
> --
> Hacking is, and always has been, the Holy
> Grail of computer science.

Re: How to use TOP?

Posted by Abhinav Neelam <ab...@gmail.com>.
Doing it in the pig script is not feasible because pig doesn't have any
notion of sequentiality - to maintain it, you'd need to have access to
state that's shared globally by all the mappers and reducers. One way I can
think of doing this is to have a UDF that maintains state - perhaps it can
maintain a file that's NFS mounted/or in HDFS so that it's available on all
the task nodes; then any call to the UDF can update that file (atomically)
and return a 'row number' that you could associate with your current tuple.
Something like:
B = FOREACH A GENERATE $0, $1, $2, $3, MyUDFs.GETROWNUM() as rownum;

However, AFAIK, you'd be better off doing it in HBase - perhaps at the time
of record insert, you could also add a 'row number' into the record?

On 22 May 2012 12:43, Mohammad Tariq <do...@gmail.com> wrote:

> Hi Abhinav,
>
>   Thanks a lot for the valuable response..Actually I was thinking of
> doing the same thing, but being new to Pig I thought of asking it on
> the mailing list first..As far as the data is concerned, second column
> will always be in ascending order.But I don't think it will be of any
> help..I think whatever you have suggested here would be the
> appropriate solution..Although I would like to ask you one thing..Is
> it feasible to add that first column having count in my pig script or
> do I have to change the data in my Hbase table itself???If yes then
> how can I achieve it in my script??Many thanks.
>
> Regards,
>     Mohammad Tariq
>
>
> On Tue, May 22, 2012 at 1:16 AM, Abhinav Neelam <ab...@gmail.com>
> wrote:
> > Hey Mohammad,
> >
> > You need to have sorting requirements when you say 'top 5' records.
> Because
> > relations/bags in Pig are unordered, it's natural to ask: 'top 5 by what
> > parameter?' I'm unfamiliar with HBase, but if your data in HBase has an
> > implicit ordering with say an auto-increment primary key, or an explicit
> > one, you could include that field in your input to Pig and then apply TOP
> > on that field.
> >
> > Having said that, if I understand your problem correctly, you don't need
> > TOP at all - you just want to process your input in groups of 5 tuples
> at a
> > time. Again, I can't think of a way of doing this without modifying your
> > input. For example, if your input included an extra field like this:
> > 1 18.98   2000             1.21   193.46  2.64        58.17
> > 1 52.49   2000.5   4.32           947.11  2.74        64.45
> > 1 115.24  2001             16.8   878.58  2.66        94.49
> > 1 55.55   2001.5   33.03  656.56  2.82        60.76
> > 1 156.14  2002             35.52  83.75   2.6         59.57
> > 2 138.77  2002.5   21.51  105.76  2.62        85.89
> > 2 71.89   2003             27.79  709.01  2.63        85.44
> > 2 59.84   2003.5   32.1           444.82  2.72        70.8
> > 2 103.18  2004             4.09   413.15  2.8         54.37
> >
> > you could do a group on that field and proceed. Even if you had a field
> > like 'line number' or 'record number' in your input, you could still
> > manipulate that field (say through integer division by 5) to use it for
> > grouping. In any case, you need something to let Pig bring together your
> 5
> > tuple groups.
> >
> > B = group A by $0;
> > C = FOREACH B { <do some processing on your 5 tuple bag A> ...
> >
> > Thanks,
> > Abhinav
> >
> > On 21 May 2012 23:03, Mohammad Tariq <do...@gmail.com> wrote:
> >
> >> Hi Ruslan,
> >>
> >>    Thanks for the response.I think I have made a mistake.Actually I
> >> just want the top 5 records each time.I don't have any sorting
> >> requirements.
> >>
> >> Regards,
> >>     Mohammad Tariq
> >>
> >>
> >> On Mon, May 21, 2012 at 9:31 PM, Ruslan Al-fakikh
> >> <ru...@jalent.ru> wrote:
> >> > Hey Mohammad,
> >> >
> >> > Here
> >> > c = TOP(5,3,a);
> >> > you say: take 5 records out of a that have the biggest values in the
> >> third
> >> > column. Do you really need that sorting by the third column?
> >> >
> >> > -----Original Message-----
> >> > From: Mohammad Tariq [mailto:dontariq@gmail.com]
> >> > Sent: Monday, May 21, 2012 3:54 PM
> >> > To: user@pig.apache.org
> >> > Subject: How to use TOP?
> >> >
> >> > Hello list,
> >> >
> >> >  I have an Hdfs file that has 6 columns that contain some data stored
> in
> >> an
> >> > Hbase table.the data looks like this -
> >> >
> >> > 18.98   2000             1.21   193.46  2.64        58.17
> >> > 52.49   2000.5   4.32           947.11  2.74        64.45
> >> > 115.24  2001             16.8   878.58  2.66        94.49
> >> > 55.55   2001.5   33.03  656.56  2.82        60.76
> >> > 156.14  2002             35.52  83.75   2.6         59.57
> >> > 138.77  2002.5   21.51  105.76  2.62        85.89
> >> > 71.89   2003             27.79  709.01  2.63        85.44
> >> > 59.84   2003.5   32.1           444.82  2.72        70.8
> >> > 103.18  2004             4.09   413.15  2.8         54.37
> >> >
> >> > Now I have to take each record along with its next 4 records and do
> some
> >> > processing(for example, in the first shot I have to take records 1-5,
> in
> >> the
> >> > next shot I have to take 2-6 and so on)..I am trying to use TOP for
> this,
> >> > but getting the following error -
> >> >
> >> > 2012-05-21 17:04:30,328 [main] ERROR org.apache.pig.tools.grunt.Grunt
> >> > - ERROR 1200: Pig script failed to parse:
> >> > <line 6, column 37> Invalid scalar projection: parameters : A column
> >> needs
> >> > to be projected from a relation for it to be used as a scalar Details
> at
> >> > logfile: /home/mohammad/pig-0.9.2/logs/pig_1337599211281.log
> >> >
> >> > I am using following commands -
> >> >
> >> > grunt> a = load 'hbase://logdata'
> >> >>> using org.apache.pig.backend.hadoop.hbase.HBaseStorage(
> >> >>> 'cf:DGR cf:HD cf:POR cf:RES cf:RHOB cf:SON', '-loadKey true') as
> (id,
> >> >>> DGR, HD, POR, RES, RHOB, SON);
> >> > grunt> b = foreach a { c = TOP(5,3,a);
> >> >>> generate flatten(c);
> >> >>> }
> >> >
> >> > Could anyone tell me how to achieve that????Many thanks.
> >> >
> >> > Regards,
> >> >     Mohammad Tariq
> >> >
> >>
> >
> >
> >
> > --
> > Hacking is, and always has been, the Holy
> > Grail of computer science.
>



-- 
Hacking is, and always has been, the Holy
Grail of computer science.

Re: How to use TOP?

Posted by Mohammad Tariq <do...@gmail.com>.
Hi Abhinav,

   Thanks a lot for the valuable response..Actually I was thinking of
doing the same thing, but being new to Pig I thought of asking it on
the mailing list first..As far as the data is concerned, second column
will always be in ascending order.But I don't think it will be of any
help..I think whatever you have suggested here would be the
appropriate solution..Although I would like to ask you one thing..Is
it feasible to add that first column having count in my pig script or
do I have to change the data in my Hbase table itself???If yes then
how can I achieve it in my script??Many thanks.

Regards,
    Mohammad Tariq


On Tue, May 22, 2012 at 1:16 AM, Abhinav Neelam <ab...@gmail.com> wrote:
> Hey Mohammad,
>
> You need to have sorting requirements when you say 'top 5' records. Because
> relations/bags in Pig are unordered, it's natural to ask: 'top 5 by what
> parameter?' I'm unfamiliar with HBase, but if your data in HBase has an
> implicit ordering with say an auto-increment primary key, or an explicit
> one, you could include that field in your input to Pig and then apply TOP
> on that field.
>
> Having said that, if I understand your problem correctly, you don't need
> TOP at all - you just want to process your input in groups of 5 tuples at a
> time. Again, I can't think of a way of doing this without modifying your
> input. For example, if your input included an extra field like this:
> 1 18.98   2000             1.21   193.46  2.64        58.17
> 1 52.49   2000.5   4.32           947.11  2.74        64.45
> 1 115.24  2001             16.8   878.58  2.66        94.49
> 1 55.55   2001.5   33.03  656.56  2.82        60.76
> 1 156.14  2002             35.52  83.75   2.6         59.57
> 2 138.77  2002.5   21.51  105.76  2.62        85.89
> 2 71.89   2003             27.79  709.01  2.63        85.44
> 2 59.84   2003.5   32.1           444.82  2.72        70.8
> 2 103.18  2004             4.09   413.15  2.8         54.37
>
> you could do a group on that field and proceed. Even if you had a field
> like 'line number' or 'record number' in your input, you could still
> manipulate that field (say through integer division by 5) to use it for
> grouping. In any case, you need something to let Pig bring together your 5
> tuple groups.
>
> B = group A by $0;
> C = FOREACH B { <do some processing on your 5 tuple bag A> ...
>
> Thanks,
> Abhinav
>
> On 21 May 2012 23:03, Mohammad Tariq <do...@gmail.com> wrote:
>
>> Hi Ruslan,
>>
>>    Thanks for the response.I think I have made a mistake.Actually I
>> just want the top 5 records each time.I don't have any sorting
>> requirements.
>>
>> Regards,
>>     Mohammad Tariq
>>
>>
>> On Mon, May 21, 2012 at 9:31 PM, Ruslan Al-fakikh
>> <ru...@jalent.ru> wrote:
>> > Hey Mohammad,
>> >
>> > Here
>> > c = TOP(5,3,a);
>> > you say: take 5 records out of a that have the biggest values in the
>> third
>> > column. Do you really need that sorting by the third column?
>> >
>> > -----Original Message-----
>> > From: Mohammad Tariq [mailto:dontariq@gmail.com]
>> > Sent: Monday, May 21, 2012 3:54 PM
>> > To: user@pig.apache.org
>> > Subject: How to use TOP?
>> >
>> > Hello list,
>> >
>> >  I have an Hdfs file that has 6 columns that contain some data stored in
>> an
>> > Hbase table.the data looks like this -
>> >
>> > 18.98   2000             1.21   193.46  2.64        58.17
>> > 52.49   2000.5   4.32           947.11  2.74        64.45
>> > 115.24  2001             16.8   878.58  2.66        94.49
>> > 55.55   2001.5   33.03  656.56  2.82        60.76
>> > 156.14  2002             35.52  83.75   2.6         59.57
>> > 138.77  2002.5   21.51  105.76  2.62        85.89
>> > 71.89   2003             27.79  709.01  2.63        85.44
>> > 59.84   2003.5   32.1           444.82  2.72        70.8
>> > 103.18  2004             4.09   413.15  2.8         54.37
>> >
>> > Now I have to take each record along with its next 4 records and do some
>> > processing(for example, in the first shot I have to take records 1-5, in
>> the
>> > next shot I have to take 2-6 and so on)..I am trying to use TOP for this,
>> > but getting the following error -
>> >
>> > 2012-05-21 17:04:30,328 [main] ERROR org.apache.pig.tools.grunt.Grunt
>> > - ERROR 1200: Pig script failed to parse:
>> > <line 6, column 37> Invalid scalar projection: parameters : A column
>> needs
>> > to be projected from a relation for it to be used as a scalar Details at
>> > logfile: /home/mohammad/pig-0.9.2/logs/pig_1337599211281.log
>> >
>> > I am using following commands -
>> >
>> > grunt> a = load 'hbase://logdata'
>> >>> using org.apache.pig.backend.hadoop.hbase.HBaseStorage(
>> >>> 'cf:DGR cf:HD cf:POR cf:RES cf:RHOB cf:SON', '-loadKey true') as (id,
>> >>> DGR, HD, POR, RES, RHOB, SON);
>> > grunt> b = foreach a { c = TOP(5,3,a);
>> >>> generate flatten(c);
>> >>> }
>> >
>> > Could anyone tell me how to achieve that????Many thanks.
>> >
>> > Regards,
>> >     Mohammad Tariq
>> >
>>
>
>
>
> --
> Hacking is, and always has been, the Holy
> Grail of computer science.

Re: How to use TOP?

Posted by Abhinav Neelam <ab...@gmail.com>.
Hey Mohammad,

You need to have sorting requirements when you say 'top 5' records. Because
relations/bags in Pig are unordered, it's natural to ask: 'top 5 by what
parameter?' I'm unfamiliar with HBase, but if your data in HBase has an
implicit ordering with say an auto-increment primary key, or an explicit
one, you could include that field in your input to Pig and then apply TOP
on that field.

Having said that, if I understand your problem correctly, you don't need
TOP at all - you just want to process your input in groups of 5 tuples at a
time. Again, I can't think of a way of doing this without modifying your
input. For example, if your input included an extra field like this:
1 18.98   2000             1.21   193.46  2.64        58.17
1 52.49   2000.5   4.32           947.11  2.74        64.45
1 115.24  2001             16.8   878.58  2.66        94.49
1 55.55   2001.5   33.03  656.56  2.82        60.76
1 156.14  2002             35.52  83.75   2.6         59.57
2 138.77  2002.5   21.51  105.76  2.62        85.89
2 71.89   2003             27.79  709.01  2.63        85.44
2 59.84   2003.5   32.1           444.82  2.72        70.8
2 103.18  2004             4.09   413.15  2.8         54.37

you could do a group on that field and proceed. Even if you had a field
like 'line number' or 'record number' in your input, you could still
manipulate that field (say through integer division by 5) to use it for
grouping. In any case, you need something to let Pig bring together your 5
tuple groups.

B = group A by $0;
C = FOREACH B { <do some processing on your 5 tuple bag A> ...

Thanks,
Abhinav

On 21 May 2012 23:03, Mohammad Tariq <do...@gmail.com> wrote:

> Hi Ruslan,
>
>    Thanks for the response.I think I have made a mistake.Actually I
> just want the top 5 records each time.I don't have any sorting
> requirements.
>
> Regards,
>     Mohammad Tariq
>
>
> On Mon, May 21, 2012 at 9:31 PM, Ruslan Al-fakikh
> <ru...@jalent.ru> wrote:
> > Hey Mohammad,
> >
> > Here
> > c = TOP(5,3,a);
> > you say: take 5 records out of a that have the biggest values in the
> third
> > column. Do you really need that sorting by the third column?
> >
> > -----Original Message-----
> > From: Mohammad Tariq [mailto:dontariq@gmail.com]
> > Sent: Monday, May 21, 2012 3:54 PM
> > To: user@pig.apache.org
> > Subject: How to use TOP?
> >
> > Hello list,
> >
> >  I have an Hdfs file that has 6 columns that contain some data stored in
> an
> > Hbase table.the data looks like this -
> >
> > 18.98   2000             1.21   193.46  2.64        58.17
> > 52.49   2000.5   4.32           947.11  2.74        64.45
> > 115.24  2001             16.8   878.58  2.66        94.49
> > 55.55   2001.5   33.03  656.56  2.82        60.76
> > 156.14  2002             35.52  83.75   2.6         59.57
> > 138.77  2002.5   21.51  105.76  2.62        85.89
> > 71.89   2003             27.79  709.01  2.63        85.44
> > 59.84   2003.5   32.1           444.82  2.72        70.8
> > 103.18  2004             4.09   413.15  2.8         54.37
> >
> > Now I have to take each record along with its next 4 records and do some
> > processing(for example, in the first shot I have to take records 1-5, in
> the
> > next shot I have to take 2-6 and so on)..I am trying to use TOP for this,
> > but getting the following error -
> >
> > 2012-05-21 17:04:30,328 [main] ERROR org.apache.pig.tools.grunt.Grunt
> > - ERROR 1200: Pig script failed to parse:
> > <line 6, column 37> Invalid scalar projection: parameters : A column
> needs
> > to be projected from a relation for it to be used as a scalar Details at
> > logfile: /home/mohammad/pig-0.9.2/logs/pig_1337599211281.log
> >
> > I am using following commands -
> >
> > grunt> a = load 'hbase://logdata'
> >>> using org.apache.pig.backend.hadoop.hbase.HBaseStorage(
> >>> 'cf:DGR cf:HD cf:POR cf:RES cf:RHOB cf:SON', '-loadKey true') as (id,
> >>> DGR, HD, POR, RES, RHOB, SON);
> > grunt> b = foreach a { c = TOP(5,3,a);
> >>> generate flatten(c);
> >>> }
> >
> > Could anyone tell me how to achieve that????Many thanks.
> >
> > Regards,
> >     Mohammad Tariq
> >
>



-- 
Hacking is, and always has been, the Holy
Grail of computer science.

Re: How to use TOP?

Posted by Mohammad Tariq <do...@gmail.com>.
Hi Ruslan,

    Thanks for the response.I think I have made a mistake.Actually I
just want the top 5 records each time.I don't have any sorting
requirements.

Regards,
    Mohammad Tariq


On Mon, May 21, 2012 at 9:31 PM, Ruslan Al-fakikh
<ru...@jalent.ru> wrote:
> Hey Mohammad,
>
> Here
> c = TOP(5,3,a);
> you say: take 5 records out of a that have the biggest values in the third
> column. Do you really need that sorting by the third column?
>
> -----Original Message-----
> From: Mohammad Tariq [mailto:dontariq@gmail.com]
> Sent: Monday, May 21, 2012 3:54 PM
> To: user@pig.apache.org
> Subject: How to use TOP?
>
> Hello list,
>
>  I have an Hdfs file that has 6 columns that contain some data stored in an
> Hbase table.the data looks like this -
>
> 18.98   2000             1.21   193.46  2.64        58.17
> 52.49   2000.5   4.32           947.11  2.74        64.45
> 115.24  2001             16.8   878.58  2.66        94.49
> 55.55   2001.5   33.03  656.56  2.82        60.76
> 156.14  2002             35.52  83.75   2.6         59.57
> 138.77  2002.5   21.51  105.76  2.62        85.89
> 71.89   2003             27.79  709.01  2.63        85.44
> 59.84   2003.5   32.1           444.82  2.72        70.8
> 103.18  2004             4.09   413.15  2.8         54.37
>
> Now I have to take each record along with its next 4 records and do some
> processing(for example, in the first shot I have to take records 1-5, in the
> next shot I have to take 2-6 and so on)..I am trying to use TOP for this,
> but getting the following error -
>
> 2012-05-21 17:04:30,328 [main] ERROR org.apache.pig.tools.grunt.Grunt
> - ERROR 1200: Pig script failed to parse:
> <line 6, column 37> Invalid scalar projection: parameters : A column needs
> to be projected from a relation for it to be used as a scalar Details at
> logfile: /home/mohammad/pig-0.9.2/logs/pig_1337599211281.log
>
> I am using following commands -
>
> grunt> a = load 'hbase://logdata'
>>> using org.apache.pig.backend.hadoop.hbase.HBaseStorage(
>>> 'cf:DGR cf:HD cf:POR cf:RES cf:RHOB cf:SON', '-loadKey true') as (id,
>>> DGR, HD, POR, RES, RHOB, SON);
> grunt> b = foreach a { c = TOP(5,3,a);
>>> generate flatten(c);
>>> }
>
> Could anyone tell me how to achieve that????Many thanks.
>
> Regards,
>     Mohammad Tariq
>

RE: How to use TOP?

Posted by Ruslan Al-fakikh <ru...@jalent.ru>.
Hey Mohammad,

Here
c = TOP(5,3,a);
you say: take 5 records out of a that have the biggest values in the third
column. Do you really need that sorting by the third column?

-----Original Message-----
From: Mohammad Tariq [mailto:dontariq@gmail.com] 
Sent: Monday, May 21, 2012 3:54 PM
To: user@pig.apache.org
Subject: How to use TOP?

Hello list,

  I have an Hdfs file that has 6 columns that contain some data stored in an
Hbase table.the data looks like this -

18.98	2000	         1.21 	193.46	2.64	    58.17
52.49	2000.5	 4.32	        947.11	2.74	    64.45
115.24	2001	         16.8 	878.58	2.66	    94.49
55.55	2001.5	 33.03	656.56	2.82	    60.76
156.14	2002	         35.52	83.75	2.6	    59.57
138.77	2002.5	 21.51	105.76	2.62	    85.89
71.89	2003	         27.79	709.01	2.63	    85.44
59.84	2003.5	 32.1	        444.82	2.72	    70.8
103.18	2004	         4.09 	413.15	2.8	    54.37

Now I have to take each record along with its next 4 records and do some
processing(for example, in the first shot I have to take records 1-5, in the
next shot I have to take 2-6 and so on)..I am trying to use TOP for this,
but getting the following error -

2012-05-21 17:04:30,328 [main] ERROR org.apache.pig.tools.grunt.Grunt
- ERROR 1200: Pig script failed to parse:
<line 6, column 37> Invalid scalar projection: parameters : A column needs
to be projected from a relation for it to be used as a scalar Details at
logfile: /home/mohammad/pig-0.9.2/logs/pig_1337599211281.log

I am using following commands -

grunt> a = load 'hbase://logdata'
>> using org.apache.pig.backend.hadoop.hbase.HBaseStorage(
>> 'cf:DGR cf:HD cf:POR cf:RES cf:RHOB cf:SON', '-loadKey true') as (id, 
>> DGR, HD, POR, RES, RHOB, SON);
grunt> b = foreach a { c = TOP(5,3,a);
>> generate flatten(c);
>> }

Could anyone tell me how to achieve that????Many thanks.

Regards,
    Mohammad Tariq