You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Rodrigo Ferreira <we...@gmail.com> on 2014/07/24 15:11:58 UTC

Pigs don't fly

Hi everyone,

I have a question for you guys.

Well, I've started doing some experiments with the UDFs that I've created.
And at this point I'm interested in assessing their performance.

I have something like:

A = LOAD ... using JsonLoader();

B = FOREACH A GENERATE MyUDF();

This code, that is translated into a single Map task (no reduce) takes 1:20
to execute. If I comment the projection and just load the data it takes 27
seconds. So the first assumption is that the rest of the time was spent in
MyUDF right? Not quite.

I printed (using System.nanoTime()) all the calls to exec() and they don't
sum up more than 5 seconds. So where have the other 48 seconds gone?

The output of my UDF is a bag. Basically for each input tuple I "create"
several output tuples and put them in a bag.

Thanks,

Rodrigo Ferreira.

Re: Pigs don't fly

Posted by Suraj Nayak <sn...@gmail.com>.

Rodrigo,

What is the amount of data that UDF is processing? I have used UDF on input
which contains 3billion+ records and produce 1 billion+ records. In my
case, at max 1200 records were entering UDF as a bag. That is, i was
grouping data together to isolate the task on those set of records. Those
grouped bags were sent to UDF. UDF pig code used to finish in approx 15mins
mins on 90+ nodes.

If you send all your data to UDF(in only 1 bag) , you cannot utilize hadoop
parallelism. It will be as bad as processing all data on 1 machine.

Note : You can use Algebraic Interface.

Using UDF needs careful observation of data.

If possible, Kindly let us know what is the size of input. How are you
using the UDF? What is the maximum data that is getting into the UDF?
Compute nodes (just to understand if the data is too much to process in
small number of machines, if this is  private data, need not be published).

Thanks
Suraj Nayak
On 24-Jul-2014 10:10 PM, "Rodrigo Ferreira" <we...@gmail.com> wrote:

> Yes, Satish. I did. It seems that at the end of the day I won't get the
> performance that I'd like to have with Pig so easily. Right now, I'm
> considering some alternatives.
>
> The main one is to rewrite the scripts. In fact, we have an old system and
> I'm "translating" its language to Pig. Maybe this translation (a naïve one,
> I have to say) is not considering Pig's limitations and specific
> characteristics. I'll start with that.
>
> Thanks,
> Rodrigo.
>
>
> 2014-07-24 18:25 GMT+02:00 Satish Kolli <fe...@gmail.com>:
>
> > Generally the Mapreduce jobs take some to get set up and distributed. Did
> > you account for that time?
> >
> > Thanks
> > On Jul 24, 2014 12:18 PM, "Rodrigo Ferreira" <we...@gmail.com> wrote:
> >
> > > You are right, Paul. No doubt about that. Unfortunately, the project
> I'm
> > > involved in is closely related to Pig so I have to get the best from
> it.
> > >
> > > Pig is great, don't get me wrong. I'm just trying to understand if
> > there's
> > > still something that can be done to tune its performance or if this is
> > the
> > > best I can get.
> > >
> > > Thanks,
> > > Rodrigo.
> > >
> > >
> > > 2014-07-24 18:06 GMT+02:00 Paul Houle <on...@gmail.com>:
> > >
> > > > I don't think anybody uses Pig because it is efficient use of a
> > > > computer cluster.  Instead people use it because it is an efficient
> > > > use of their time.
> > > >
> > > > If you're getting to the point where CPU performance matters you can
> > > > generally write a plain Hadoop job that is faster,  particularly if
> > > > you think a lot about the algorithms and data structures.
> > > > ᐧ
> > > >
> > > > On Thu, Jul 24, 2014 at 9:11 AM, Rodrigo Ferreira <we...@gmail.com>
> > > > wrote:
> > > > > Hi everyone,
> > > > >
> > > > > I have a question for you guys.
> > > > >
> > > > > Well, I've started doing some experiments with the UDFs that I've
> > > > created.
> > > > > And at this point I'm interested in assessing their performance.
> > > > >
> > > > > I have something like:
> > > > >
> > > > > A = LOAD ... using JsonLoader();
> > > > >
> > > > > B = FOREACH A GENERATE MyUDF();
> > > > >
> > > > > This code, that is translated into a single Map task (no reduce)
> > takes
> > > > 1:20
> > > > > to execute. If I comment the projection and just load the data it
> > takes
> > > > 27
> > > > > seconds. So the first assumption is that the rest of the time was
> > spent
> > > > in
> > > > > MyUDF right? Not quite.
> > > > >
> > > > > I printed (using System.nanoTime()) all the calls to exec() and
> they
> > > > don't
> > > > > sum up more than 5 seconds. So where have the other 48 seconds
> gone?
> > > > >
> > > > > The output of my UDF is a bag. Basically for each input tuple I
> > > "create"
> > > > > several output tuples and put them in a bag.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Rodrigo Ferreira.
> > > >
> > > >
> > > >
> > > > --
> > > > Paul Houle
> > > > Expert on Freebase, DBpedia, Hadoop and RDF
> > > > (607) 539 6254    paul.houle on Skype   ontology2@gmail.com
> > > >
> > >
> >
>

Re: Pigs don't fly

Posted by Rodrigo Ferreira <we...@gmail.com>.

Yes, Satish. I did. It seems that at the end of the day I won't get the
performance that I'd like to have with Pig so easily. Right now, I'm
considering some alternatives.

The main one is to rewrite the scripts. In fact, we have an old system and
I'm "translating" its language to Pig. Maybe this translation (a naïve one,
I have to say) is not considering Pig's limitations and specific
characteristics. I'll start with that.

Thanks,
Rodrigo.


2014-07-24 18:25 GMT+02:00 Satish Kolli <fe...@gmail.com>:

> Generally the Mapreduce jobs take some to get set up and distributed. Did
> you account for that time?
>
> Thanks
> On Jul 24, 2014 12:18 PM, "Rodrigo Ferreira" <we...@gmail.com> wrote:
>
> > You are right, Paul. No doubt about that. Unfortunately, the project I'm
> > involved in is closely related to Pig so I have to get the best from it.
> >
> > Pig is great, don't get me wrong. I'm just trying to understand if
> there's
> > still something that can be done to tune its performance or if this is
> the
> > best I can get.
> >
> > Thanks,
> > Rodrigo.
> >
> >
> > 2014-07-24 18:06 GMT+02:00 Paul Houle <on...@gmail.com>:
> >
> > > I don't think anybody uses Pig because it is efficient use of a
> > > computer cluster.  Instead people use it because it is an efficient
> > > use of their time.
> > >
> > > If you're getting to the point where CPU performance matters you can
> > > generally write a plain Hadoop job that is faster,  particularly if
> > > you think a lot about the algorithms and data structures.
> > > ᐧ
> > >
> > > On Thu, Jul 24, 2014 at 9:11 AM, Rodrigo Ferreira <we...@gmail.com>
> > > wrote:
> > > > Hi everyone,
> > > >
> > > > I have a question for you guys.
> > > >
> > > > Well, I've started doing some experiments with the UDFs that I've
> > > created.
> > > > And at this point I'm interested in assessing their performance.
> > > >
> > > > I have something like:
> > > >
> > > > A = LOAD ... using JsonLoader();
> > > >
> > > > B = FOREACH A GENERATE MyUDF();
> > > >
> > > > This code, that is translated into a single Map task (no reduce)
> takes
> > > 1:20
> > > > to execute. If I comment the projection and just load the data it
> takes
> > > 27
> > > > seconds. So the first assumption is that the rest of the time was
> spent
> > > in
> > > > MyUDF right? Not quite.
> > > >
> > > > I printed (using System.nanoTime()) all the calls to exec() and they
> > > don't
> > > > sum up more than 5 seconds. So where have the other 48 seconds gone?
> > > >
> > > > The output of my UDF is a bag. Basically for each input tuple I
> > "create"
> > > > several output tuples and put them in a bag.
> > > >
> > > > Thanks,
> > > >
> > > > Rodrigo Ferreira.
> > >
> > >
> > >
> > > --
> > > Paul Houle
> > > Expert on Freebase, DBpedia, Hadoop and RDF
> > > (607) 539 6254    paul.houle on Skype   ontology2@gmail.com
> > >
> >
>

Re: Pigs don't fly

Posted by Satish Kolli <fe...@gmail.com>.

Generally the Mapreduce jobs take some to get set up and distributed. Did
you account for that time?

Thanks
On Jul 24, 2014 12:18 PM, "Rodrigo Ferreira" <we...@gmail.com> wrote:

> You are right, Paul. No doubt about that. Unfortunately, the project I'm
> involved in is closely related to Pig so I have to get the best from it.
>
> Pig is great, don't get me wrong. I'm just trying to understand if there's
> still something that can be done to tune its performance or if this is the
> best I can get.
>
> Thanks,
> Rodrigo.
>
>
> 2014-07-24 18:06 GMT+02:00 Paul Houle <on...@gmail.com>:
>
> > I don't think anybody uses Pig because it is efficient use of a
> > computer cluster.  Instead people use it because it is an efficient
> > use of their time.
> >
> > If you're getting to the point where CPU performance matters you can
> > generally write a plain Hadoop job that is faster,  particularly if
> > you think a lot about the algorithms and data structures.
> > ᐧ
> >
> > On Thu, Jul 24, 2014 at 9:11 AM, Rodrigo Ferreira <we...@gmail.com>
> > wrote:
> > > Hi everyone,
> > >
> > > I have a question for you guys.
> > >
> > > Well, I've started doing some experiments with the UDFs that I've
> > created.
> > > And at this point I'm interested in assessing their performance.
> > >
> > > I have something like:
> > >
> > > A = LOAD ... using JsonLoader();
> > >
> > > B = FOREACH A GENERATE MyUDF();
> > >
> > > This code, that is translated into a single Map task (no reduce) takes
> > 1:20
> > > to execute. If I comment the projection and just load the data it takes
> > 27
> > > seconds. So the first assumption is that the rest of the time was spent
> > in
> > > MyUDF right? Not quite.
> > >
> > > I printed (using System.nanoTime()) all the calls to exec() and they
> > don't
> > > sum up more than 5 seconds. So where have the other 48 seconds gone?
> > >
> > > The output of my UDF is a bag. Basically for each input tuple I
> "create"
> > > several output tuples and put them in a bag.
> > >
> > > Thanks,
> > >
> > > Rodrigo Ferreira.
> >
> >
> >
> > --
> > Paul Houle
> > Expert on Freebase, DBpedia, Hadoop and RDF
> > (607) 539 6254    paul.houle on Skype   ontology2@gmail.com
> >
>

Re: Pigs don't fly

Posted by Rodrigo Ferreira <we...@gmail.com>.

You are right, Paul. No doubt about that. Unfortunately, the project I'm
involved in is closely related to Pig so I have to get the best from it.

Pig is great, don't get me wrong. I'm just trying to understand if there's
still something that can be done to tune its performance or if this is the
best I can get.

Thanks,
Rodrigo.


2014-07-24 18:06 GMT+02:00 Paul Houle <on...@gmail.com>:

> I don't think anybody uses Pig because it is efficient use of a
> computer cluster.  Instead people use it because it is an efficient
> use of their time.
>
> If you're getting to the point where CPU performance matters you can
> generally write a plain Hadoop job that is faster,  particularly if
> you think a lot about the algorithms and data structures.
> ᐧ
>
> On Thu, Jul 24, 2014 at 9:11 AM, Rodrigo Ferreira <we...@gmail.com>
> wrote:
> > Hi everyone,
> >
> > I have a question for you guys.
> >
> > Well, I've started doing some experiments with the UDFs that I've
> created.
> > And at this point I'm interested in assessing their performance.
> >
> > I have something like:
> >
> > A = LOAD ... using JsonLoader();
> >
> > B = FOREACH A GENERATE MyUDF();
> >
> > This code, that is translated into a single Map task (no reduce) takes
> 1:20
> > to execute. If I comment the projection and just load the data it takes
> 27
> > seconds. So the first assumption is that the rest of the time was spent
> in
> > MyUDF right? Not quite.
> >
> > I printed (using System.nanoTime()) all the calls to exec() and they
> don't
> > sum up more than 5 seconds. So where have the other 48 seconds gone?
> >
> > The output of my UDF is a bag. Basically for each input tuple I "create"
> > several output tuples and put them in a bag.
> >
> > Thanks,
> >
> > Rodrigo Ferreira.
>
>
>
> --
> Paul Houle
> Expert on Freebase, DBpedia, Hadoop and RDF
> (607) 539 6254    paul.houle on Skype   ontology2@gmail.com
>

Re: Pigs don't fly

Posted by Paul Houle <on...@gmail.com>.

I don't think anybody uses Pig because it is efficient use of a
computer cluster.  Instead people use it because it is an efficient
use of their time.

If you're getting to the point where CPU performance matters you can
generally write a plain Hadoop job that is faster,  particularly if
you think a lot about the algorithms and data structures.
ᐧ

On Thu, Jul 24, 2014 at 9:11 AM, Rodrigo Ferreira <we...@gmail.com> wrote:
> Hi everyone,
>
> I have a question for you guys.
>
> Well, I've started doing some experiments with the UDFs that I've created.
> And at this point I'm interested in assessing their performance.
>
> I have something like:
>
> A = LOAD ... using JsonLoader();
>
> B = FOREACH A GENERATE MyUDF();
>
> This code, that is translated into a single Map task (no reduce) takes 1:20
> to execute. If I comment the projection and just load the data it takes 27
> seconds. So the first assumption is that the rest of the time was spent in
> MyUDF right? Not quite.
>
> I printed (using System.nanoTime()) all the calls to exec() and they don't
> sum up more than 5 seconds. So where have the other 48 seconds gone?
>
> The output of my UDF is a bag. Basically for each input tuple I "create"
> several output tuples and put them in a bag.
>
> Thanks,
>
> Rodrigo Ferreira.



-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254    paul.houle on Skype   ontology2@gmail.com