You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by S Malligarjunan <sm...@yahoo.com.INVALID> on 2014/07/11 19:21:59 UTC

Apache PIG performance benchmark

Hello All,

I am a newbie to Apache PIG, I would like to know the performance benchmark of Apache PIG.

My current requirement is as follows
I have few files in 2 s3 buckets
Each file may have minimum of 1 million records. File data are tab separated.
Have to compare few columns and filter the records.

Right now I am using Hive, it is taking more than 2 days to filter the records.
Please find the hive query below

INSERT OVERWRITE TABLE cnv_algo3 
SELECT * FROM table1 t1 JOIN table2 t2

  WHERE unix_timestamp(t2.time, 'yyyy-MM-dd HH:mm:ss,SSS') > unix_timestamp(t1.time, 'yyyy-MM-dd HH:mm:ss,SSS')
and compare(t1.column1, t1.column2, t2.column1, t2.column4);

Here compare is the UDF function.
Assume table1 1 has 20 million records and table2 has 5 million records.
Let me know how much time PIG will to take filter the records in a standard configuration.

It is pretty urgent to take an decision to move the project to use PIG. Hence help me. I highly appreciate your help.

 
Thanks and Regards,
Malligarjunan S.

Re: Apache PIG performance benchmark

Posted by Paul Houle <on...@gmail.com>.

Knowing very little about your case it is hard to give specifics.

I know I do run jobs in Amazon EMR that process billions of records
that are stored in AMZN S3 and I've never felt that S3 reading or
writing was a problem,  although I know the jobs finish a little
quicker (maybe 10-20%) if I store temporary data in HDFS instead of
S3.  I know this job saturates the CPU during the map phase so I have
not been in a hurry to optimize I/O or investigate in the depth.

Of course I am running this inside the AMZN cloud and everything is in
the same reason so S3 performance is at its best.  If your cluster is
far away from AMZN network-wise,  then S3 is going to perform worse.

You should try to understand the query plan for your query.  The first
problem that bothers me is that if compare() is a UDF,  I think the
join would be inefficient with most tools.  I'd think that the system
would have to do a Cartesian join or something close to a Cartesian
join,  it would probably need to produce 0.5 * 20 milllion * 5 million
= 50 trillion intermediate records.  That's really a lot!  I think
also that evaluating the time comparison would be unnecessarily
expensive if the system ends up parsing the time stamps for every
comparison.

It makes sense to convert the dates to integers as soon as possible in
the calculation and also to get as much of the comparison work done in
a way that is visible to the query optimizer so it can find an
efficient way to write the join.
ᐧ

On Sat, Jul 12, 2014 at 6:29 PM, Rodrigo Ferreira <we...@gmail.com> wrote:
> Hi Malligarjunan,
>
> I agree partially with Suraj. Of course staging data directly in the
> cluster and using distributed cache and all these things are going to give
> you the best performance.
>
> Anyway, I'm also using some input data from S3 and this doesn't preclude me
> from taking advantage of Hadoop's paralelism. I don't know exactly how your
> environment integrates Hive and S3 buckets but in our case here we only
> "pay" for the download and upload time, but once our data is in the Hadoop
> cluster, everything works well and with a good performance.
>
> Regarding your problem, I think you could take a look at Mortar Data
> website. We are using it for our project, at least for fast testing and
> prototyping. They use Amazon EMR and S3 as back end and you can write Pig
> scripts using their web interface. The main feature of this service is that
> you don't have to pay if the code you create there is open source (you only
> pay for the Amazon costs). Maybe you can use this website to test how Pig
> performs with your data, or at least a sample of it.
>
> I hope this helps a bit.
>
> Rodrigo Ferreira.
>
>
> 2014-07-12 23:10 GMT+02:00 Suraj Nayak <sn...@gmail.com>:
>
>> Hi Malligarjunan,
>>
>> Pig or Hive, if you are not using Tez, converts the statements or SQL to
>> (multiple) MapReduce job which is launched in the cluster, thus you achieve
>> parallel processing. But if you use s3, you cannot use the core principle
>> of Hadoop, i,e Data Localization. Thus data has to come to process and
>> needs more time and only 1 machine is processing the data. Thus it turns
>> out to be sequential read. That's why it is taking 2 days or more to
>> process the data.
>>
>> I think going to Pig or Hive is based on use case. That is, if your logic
>> is involved of lot of processing pipelines, lot of transformations and
>> custom algorithms, Pig will be the choice. Hive will be a better choice if
>> the problem you are solving can be answered using a SQL like statements.
>>
>> Thus, I suggest to rethink to use HDFS instead of s3 as I can see your
>> query involves a join and your input data set is relatively large.
>> Otherwise you'll end up running 1 process for very long time.
>>
>> Also it depends on what is the cluster size and machine configuration.
>>
>> --
>> Suraj Nayak
>> On 11-Jul-2014 10:52 PM, "S Malligarjunan"
>> <sm...@yahoo.com.invalid>
>> wrote:
>>
>> > Hello All,
>> >
>> > I am a newbie to Apache PIG, I would like to know the performance
>> > benchmark of Apache PIG.
>> >
>> > My current requirement is as follows
>> > I have few files in 2 s3 buckets
>> > Each file may have minimum of 1 million records. File data are tab
>> > separated.
>> > Have to compare few columns and filter the records.
>> >
>> > Right now I am using Hive, it is taking more than 2 days to filter the
>> > records.
>> > Please find the hive query below
>> >
>> > INSERT OVERWRITE TABLE cnv_algo3
>> > SELECT * FROM table1 t1 JOIN table2 t2
>> >
>> >   WHERE unix_timestamp(t2.time, 'yyyy-MM-dd HH:mm:ss,SSS') >
>> > unix_timestamp(t1.time, 'yyyy-MM-dd HH:mm:ss,SSS')
>> > and compare(t1.column1, t1.column2, t2.column1, t2.column4);
>> >
>> > Here compare is the UDF function.
>> > Assume table1 1 has 20 million records and table2 has 5 million records.
>> > Let me know how much time PIG will to take filter the records in a
>> > standard configuration.
>> >
>> > It is pretty urgent to take an decision to move the project to use PIG.
>> > Hence help me. I highly appreciate your help.
>> >
>> >
>> > Thanks and Regards,
>> > Malligarjunan S.
>> >
>>



-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254    paul.houle on Skype   ontology2@gmail.com

Re: Apache PIG performance benchmark

Posted by Rodrigo Ferreira <we...@gmail.com>.

Hi Malligarjunan,

I agree partially with Suraj. Of course staging data directly in the
cluster and using distributed cache and all these things are going to give
you the best performance.

Anyway, I'm also using some input data from S3 and this doesn't preclude me
from taking advantage of Hadoop's paralelism. I don't know exactly how your
environment integrates Hive and S3 buckets but in our case here we only
"pay" for the download and upload time, but once our data is in the Hadoop
cluster, everything works well and with a good performance.

Regarding your problem, I think you could take a look at Mortar Data
website. We are using it for our project, at least for fast testing and
prototyping. They use Amazon EMR and S3 as back end and you can write Pig
scripts using their web interface. The main feature of this service is that
you don't have to pay if the code you create there is open source (you only
pay for the Amazon costs). Maybe you can use this website to test how Pig
performs with your data, or at least a sample of it.

I hope this helps a bit.

Rodrigo Ferreira.

2014-07-12 23:10 GMT+02:00 Suraj Nayak <sn...@gmail.com>:

> Hi Malligarjunan,
>
> Pig or Hive, if you are not using Tez, converts the statements or SQL to
> (multiple) MapReduce job which is launched in the cluster, thus you achieve
> parallel processing. But if you use s3, you cannot use the core principle
> of Hadoop, i,e Data Localization. Thus data has to come to process and
> needs more time and only 1 machine is processing the data. Thus it turns
> out to be sequential read. That's why it is taking 2 days or more to
> process the data.
>
> I think going to Pig or Hive is based on use case. That is, if your logic
> is involved of lot of processing pipelines, lot of transformations and
> custom algorithms, Pig will be the choice. Hive will be a better choice if
> the problem you are solving can be answered using a SQL like statements.
>
> Thus, I suggest to rethink to use HDFS instead of s3 as I can see your
> query involves a join and your input data set is relatively large.
> Otherwise you'll end up running 1 process for very long time.
>
> Also it depends on what is the cluster size and machine configuration.
>
> --
> Suraj Nayak
> On 11-Jul-2014 10:52 PM, "S Malligarjunan"
> <sm...@yahoo.com.invalid>
> wrote:
>
> > Hello All,
> >
> > I am a newbie to Apache PIG, I would like to know the performance
> > benchmark of Apache PIG.
> >
> > My current requirement is as follows
> > I have few files in 2 s3 buckets
> > Each file may have minimum of 1 million records. File data are tab
> > separated.
> > Have to compare few columns and filter the records.
> >
> > Right now I am using Hive, it is taking more than 2 days to filter the
> > records.
> > Please find the hive query below
> >
> > INSERT OVERWRITE TABLE cnv_algo3
> > SELECT * FROM table1 t1 JOIN table2 t2
> >
> >   WHERE unix_timestamp(t2.time, 'yyyy-MM-dd HH:mm:ss,SSS') >
> > unix_timestamp(t1.time, 'yyyy-MM-dd HH:mm:ss,SSS')
> > and compare(t1.column1, t1.column2, t2.column1, t2.column4);
> >
> > Here compare is the UDF function.
> > Assume table1 1 has 20 million records and table2 has 5 million records.
> > Let me know how much time PIG will to take filter the records in a
> > standard configuration.
> >
> > It is pretty urgent to take an decision to move the project to use PIG.
> > Hence help me. I highly appreciate your help.
> >
> >
> > Thanks and Regards,
> > Malligarjunan S.
> >
>

Re: Apache PIG performance benchmark

Posted by Suraj Nayak <sn...@gmail.com>.

Hi Malligarjunan,

Pig or Hive, if you are not using Tez, converts the statements or SQL to
(multiple) MapReduce job which is launched in the cluster, thus you achieve
parallel processing. But if you use s3, you cannot use the core principle
of Hadoop, i,e Data Localization. Thus data has to come to process and
needs more time and only 1 machine is processing the data. Thus it turns
out to be sequential read. That's why it is taking 2 days or more to
process the data.

I think going to Pig or Hive is based on use case. That is, if your logic
is involved of lot of processing pipelines, lot of transformations and
custom algorithms, Pig will be the choice. Hive will be a better choice if
the problem you are solving can be answered using a SQL like statements.

Thus, I suggest to rethink to use HDFS instead of s3 as I can see your
query involves a join and your input data set is relatively large.
Otherwise you'll end up running 1 process for very long time.

Also it depends on what is the cluster size and machine configuration.

--
Suraj Nayak
On 11-Jul-2014 10:52 PM, "S Malligarjunan" <sm...@yahoo.com.invalid>
wrote:

> Hello All,
>
> I am a newbie to Apache PIG, I would like to know the performance
> benchmark of Apache PIG.
>
> My current requirement is as follows
> I have few files in 2 s3 buckets
> Each file may have minimum of 1 million records. File data are tab
> separated.
> Have to compare few columns and filter the records.
>
> Right now I am using Hive, it is taking more than 2 days to filter the
> records.
> Please find the hive query below
>
> INSERT OVERWRITE TABLE cnv_algo3
> SELECT * FROM table1 t1 JOIN table2 t2
>
>   WHERE unix_timestamp(t2.time, 'yyyy-MM-dd HH:mm:ss,SSS') >
> unix_timestamp(t1.time, 'yyyy-MM-dd HH:mm:ss,SSS')
> and compare(t1.column1, t1.column2, t2.column1, t2.column4);
>
> Here compare is the UDF function.
> Assume table1 1 has 20 million records and table2 has 5 million records.
> Let me know how much time PIG will to take filter the records in a
> standard configuration.
>
> It is pretty urgent to take an decision to move the project to use PIG.
> Hence help me. I highly appreciate your help.
>
>
> Thanks and Regards,
> Malligarjunan S.
>