You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by charles du <ta...@gmail.com> on 2008/08/07 23:55:09 UTC

Re: how to get the size of a data bag

Thanks. It works.

My concern right now is the performance. For example, I have 2 million
records that belongs to two types. If I want to count the number of records
for each type, I need group records based on the type as follows:

A = LOAD <my file> as (type, ...);
B = GROUP  A  BY  type;
C = foreach A generate COUNT(A);

I notices it usually takes hadoop a long time to get the results back. My
experience with hadoop is that if there are a large number of values for a
key, hadoop is very slow on the reduce function. I understand it is a more
hadoop problem, instead of pig's. Do you guys know any way to speedup the
calculation?

Thanks.

Chuang

On Fri, Jul 18, 2008 at 2:10 PM, Olga Natkovich <ol...@yahoo-inc.com> wrote:

> How was you bag created?
>
> Normally, you would load the data then group it into a bag using group
> by or group all and then apply the count:
>
> A = load 'input';
> B = group A all;
> C = foreach A generate COUNT(A);
>
> Olga
>
> > -----Original Message-----
> > From: charles du [mailto:taiping.du@gmail.com]
> > Sent: Friday, July 18, 2008 12:23 PM
> > To: pig-user@incubator.apache.org
> > Subject: how to get the size of a data bag
> >
> > Hi:
> >
> > Just start learning hadoop and pig latin. How can I get the
> > number of elements in a data bag?
> >
> > For example, a data bag like follow has four elements.
> >   B= {1, 2, 3, 5}
> >
> > I tried C = COUNT(B), it did not work. Thanks.
> >
> > --
> > tp
> >
>

-- 
tp

RE: how to get the size of a data bag

Posted by Olga Natkovich <ol...@yahoo-inc.com>.

Sorry, group should be after generate:

C = foreach B generate group, COUNT(A); 

> -----Original Message-----
> From: charles du [mailto:taiping.du@gmail.com] 
> Sent: Thursday, August 07, 2008 4:20 PM
> To: pig-user@incubator.apache.org
> Subject: Re: how to get the size of a data bag
> 
> Hi:
> 
> I tried the command, and pig threw an exception complaining 
> about the grammar.
> 
> Also, just want to make sure. Should the command be?
> 
>   A = LOAD <my file> as (type, ...);
>   B = GROUP  A  BY  type;
>   C = foreach *B *group, generate COUNT(A);
> 
> How does pig decide when to use a combiner, and when not to?
> 
> Thanks.
> 
> Chuang
> 
> 
> On Thu, Aug 7, 2008 at 2:59 PM, Olga Natkovich 
> <ol...@yahoo-inc.com> wrote:
> 
> > The way your query is formulated combiner is not called and 
> that would 
> > account for the slowness.
> >
> > Try this:
> >
> > A = LOAD <my file> as (type, ...);
> > B = GROUP  A  BY  type;
> > C = foreach A group, generate COUNT(A);
> >
> > You can check if combiner will be called by running
> >
> > Explain C;
> >
> > Olga
> >
> > > -----Original Message-----
> > > From: charles du [mailto:taiping.du@gmail.com]
> > > Sent: Thursday, August 07, 2008 2:55 PM
> > > To: pig-user@incubator.apache.org
> > > Subject: Re: how to get the size of a data bag
> > >
> > > Thanks. It works.
> > >
> > > My concern right now is the performance. For example, I have
> > > 2 million records that belongs to two types. If I want to 
> count the 
> > > number of records for each type, I need group records 
> based on the 
> > > type as follows:
> > >
> > > A = LOAD <my file> as (type, ...);
> > > B = GROUP  A  BY  type;
> > > C = foreach A generate COUNT(A);
> > >
> > > I notices it usually takes hadoop a long time to get the results 
> > > back. My experience with hadoop is that if there are a 
> large number 
> > > of values for a key, hadoop is very slow on the reduce 
> function. I 
> > > understand it is a more hadoop problem, instead of pig's. Do you 
> > > guys know any way to speedup the calculation?
> > >
> > >
> > > Thanks.
> > >
> > > Chuang
> > >
> > > On Fri, Jul 18, 2008 at 2:10 PM, Olga Natkovich 
> > > <ol...@yahoo-inc.com> wrote:
> > >
> > > > How was you bag created?
> > > >
> > > > Normally, you would load the data then group it into a bag
> > > using group
> > > > by or group all and then apply the count:
> > > >
> > > > A = load 'input';
> > > > B = group A all;
> > > > C = foreach A generate COUNT(A);
> > > >
> > > > Olga
> > > >
> > > > > -----Original Message-----
> > > > > From: charles du [mailto:taiping.du@gmail.com]
> > > > > Sent: Friday, July 18, 2008 12:23 PM
> > > > > To: pig-user@incubator.apache.org
> > > > > Subject: how to get the size of a data bag
> > > > >
> > > > > Hi:
> > > > >
> > > > > Just start learning hadoop and pig latin. How can I get
> > > the number
> > > > > of elements in a data bag?
> > > > >
> > > > > For example, a data bag like follow has four elements.
> > > > >   B= {1, 2, 3, 5}
> > > > >
> > > > > I tried C = COUNT(B), it did not work. Thanks.
> > > > >
> > > > > --
> > > > > tp
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > tp
> > >
> >
> 
> 
> 
> --
> tp
>

Re: how to get the size of a data bag

Posted by charles du <ta...@gmail.com>.

Hi:

I tried the command, and pig threw an exception complaining about the
grammar.

Also, just want to make sure. Should the command be?

  A = LOAD <my file> as (type, ...);
  B = GROUP  A  BY  type;
  C = foreach *B *group, generate COUNT(A);

How does pig decide when to use a combiner, and when not to?

Thanks.

Chuang


On Thu, Aug 7, 2008 at 2:59 PM, Olga Natkovich <ol...@yahoo-inc.com> wrote:

> The way your query is formulated combiner is not called and that would
> account for the slowness.
>
> Try this:
>
> A = LOAD <my file> as (type, ...);
> B = GROUP  A  BY  type;
> C = foreach A group, generate COUNT(A);
>
> You can check if combiner will be called by running
>
> Explain C;
>
> Olga
>
> > -----Original Message-----
> > From: charles du [mailto:taiping.du@gmail.com]
> > Sent: Thursday, August 07, 2008 2:55 PM
> > To: pig-user@incubator.apache.org
> > Subject: Re: how to get the size of a data bag
> >
> > Thanks. It works.
> >
> > My concern right now is the performance. For example, I have
> > 2 million records that belongs to two types. If I want to
> > count the number of records for each type, I need group
> > records based on the type as follows:
> >
> > A = LOAD <my file> as (type, ...);
> > B = GROUP  A  BY  type;
> > C = foreach A generate COUNT(A);
> >
> > I notices it usually takes hadoop a long time to get the
> > results back. My experience with hadoop is that if there are
> > a large number of values for a key, hadoop is very slow on
> > the reduce function. I understand it is a more hadoop
> > problem, instead of pig's. Do you guys know any way to
> > speedup the calculation?
> >
> >
> > Thanks.
> >
> > Chuang
> >
> > On Fri, Jul 18, 2008 at 2:10 PM, Olga Natkovich
> > <ol...@yahoo-inc.com> wrote:
> >
> > > How was you bag created?
> > >
> > > Normally, you would load the data then group it into a bag
> > using group
> > > by or group all and then apply the count:
> > >
> > > A = load 'input';
> > > B = group A all;
> > > C = foreach A generate COUNT(A);
> > >
> > > Olga
> > >
> > > > -----Original Message-----
> > > > From: charles du [mailto:taiping.du@gmail.com]
> > > > Sent: Friday, July 18, 2008 12:23 PM
> > > > To: pig-user@incubator.apache.org
> > > > Subject: how to get the size of a data bag
> > > >
> > > > Hi:
> > > >
> > > > Just start learning hadoop and pig latin. How can I get
> > the number
> > > > of elements in a data bag?
> > > >
> > > > For example, a data bag like follow has four elements.
> > > >   B= {1, 2, 3, 5}
> > > >
> > > > I tried C = COUNT(B), it did not work. Thanks.
> > > >
> > > > --
> > > > tp
> > > >
> > >
> >
> >
> >
> > --
> > tp
> >
>



-- 
tp

RE: how to get the size of a data bag

Posted by Olga Natkovich <ol...@yahoo-inc.com>.

The way your query is formulated combiner is not called and that would
account for the slowness.

Try this:

A = LOAD <my file> as (type, ...);
B = GROUP  A  BY  type;
C = foreach A group, generate COUNT(A); 

You can check if combiner will be called by running

Explain C;

Olga

> -----Original Message-----
> From: charles du [mailto:taiping.du@gmail.com] 
> Sent: Thursday, August 07, 2008 2:55 PM
> To: pig-user@incubator.apache.org
> Subject: Re: how to get the size of a data bag
> 
> Thanks. It works.
> 
> My concern right now is the performance. For example, I have 
> 2 million records that belongs to two types. If I want to 
> count the number of records for each type, I need group 
> records based on the type as follows:
> 
> A = LOAD <my file> as (type, ...);
> B = GROUP  A  BY  type;
> C = foreach A generate COUNT(A);
> 
> I notices it usually takes hadoop a long time to get the 
> results back. My experience with hadoop is that if there are 
> a large number of values for a key, hadoop is very slow on 
> the reduce function. I understand it is a more hadoop 
> problem, instead of pig's. Do you guys know any way to 
> speedup the calculation?
> 
> 
> Thanks.
> 
> Chuang
> 
> On Fri, Jul 18, 2008 at 2:10 PM, Olga Natkovich 
> <ol...@yahoo-inc.com> wrote:
> 
> > How was you bag created?
> >
> > Normally, you would load the data then group it into a bag 
> using group 
> > by or group all and then apply the count:
> >
> > A = load 'input';
> > B = group A all;
> > C = foreach A generate COUNT(A);
> >
> > Olga
> >
> > > -----Original Message-----
> > > From: charles du [mailto:taiping.du@gmail.com]
> > > Sent: Friday, July 18, 2008 12:23 PM
> > > To: pig-user@incubator.apache.org
> > > Subject: how to get the size of a data bag
> > >
> > > Hi:
> > >
> > > Just start learning hadoop and pig latin. How can I get 
> the number 
> > > of elements in a data bag?
> > >
> > > For example, a data bag like follow has four elements.
> > >   B= {1, 2, 3, 5}
> > >
> > > I tried C = COUNT(B), it did not work. Thanks.
> > >
> > > --
> > > tp
> > >
> >
> 
> 
> 
> --
> tp
>