You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Thomas Foricher <th...@gmail.com> on 2013/04/30 16:09:46 UTC

Pig efficiency

Hello everybody,

I'm very new on Pig and Hadoop, and I'm trying to do some efficiency 
tests between different technologies. For that I'm using a movie 
database found here :

http://www.grouplens.org/taxonomy/term/14

It's a 10M entries database, it weigths about 250 Mb. I know it is 
certainly not enough to see the Map/Reduce real power but I thought it 
could give an idea.

So the data are formated as below :

movies.dat

     movieid::title:genres

ratings.dat

     userid::movieid::rating::timestamp

1. MySQL

My first test on MySQL was to join the two tables on movieid, and group 
them on "genres" and then compute the AVG ratings and get the TOP20 as so:

SELECT genres,  AVG( a.ratings )
FROM ratings a
JOIN movies b ON a.movieid = b.movieid
GROUP BY genres
ORDER BY AVG( a.ratings ) DESC
LIMIT 20

It tooks about two minutes to get the results.

2. Python/Pandas

I've done the same with Python and Pandas library and took about 4 
seconds to dump the results.

3. Hadoop/Pig

I've run a Ubuntu 12.04 64bit server release on VMware player. Install 
Hadoop on pseudo distributed single node cluster and pig.

Here is my complete code to get the same results as below:

I had to clean the file first since you cannot use two separators with 
Pig as so:

movies = load 'movie/movies.dat' as (line : chararray);
moviesclean = foreach movies generate flatten(STRSPLIT(line, '::'));
STORE moviesclean INTO 'movies';

ratings = load 'movie/ratings.dat' as (line : chararray);
ratingsclean = foreach ratings generate flatten(STRSPLIT(line, '::'));
STORE ratingsclean INTO 'ratings';


movies = LOAD 'movies' as (movieid:int, title:chararray, genres:chararray);
ratings = LOAD 'ratings' as (userid:int, movieid:int, rating:float, 
timestamp:chararray);

X = JOIN ratings BY movieid, movies BY movieid;
Y = GROUP X BY genres;
Z = FOREACH Y GENERATE group, AVG(X.rating);
STORE Z INTO 'results';

# It took about 10 minutes to store the Z file

Z = LOAD 'results' AS (genres:chararray, avgrating:float);
Zsorted = ORDER Z BY avgrating DESC;
results = LIMIT Zsorted 20;
dump results;

# And here about 4 minutes

So in total it takes about 15 minutes to do the same analysis. Do you 
know if it's normal?
Is it because the power is only revealed from terabytes of data?

Thanks

Regards.

Thomas


Re: Pig efficiency

Posted by Thomas Foricher <th...@gmail.com>.
That's what I thought.

Thank you Gerrit!

Regards.

Thomas

Le 01/05/2013 06:04, Gerrit Jansen van Vuuren a écrit :
> Hadoop  is made for data starting at gigabytes.
> If you've got < 1-5 gigs of data you can always do it faster on a single
> machine, the use case for Hadoop is  when you have 40 gigs or 200 gigs, N
> TBs etc.
>
> Remember that sending a job for distributed processing has a certain amount
> of overhead, that you do not have when processing locally, this is why for
> 250mb you would always see faster processing times locally.
>
>
> Regards,
>   Gerrit
>
> On Tue, Apr 30, 2013 at 4:09 PM, Thomas Foricher
> <th...@gmail.com>wrote:
>
>> Hello everybody,
>>
>> I'm very new on Pig and Hadoop, and I'm trying to do some efficiency tests
>> between different technologies. For that I'm using a movie database found
>> here :
>>
>> http://www.grouplens.org/**taxonomy/term/14<http://www.grouplens.org/taxonomy/term/14>
>>
>> It's a 10M entries database, it weigths about 250 Mb. I know it is
>> certainly not enough to see the Map/Reduce real power but I thought it
>> could give an idea.
>>
>> So the data are formated as below :
>>
>> movies.dat
>>
>>      movieid::title:genres
>>
>> ratings.dat
>>
>>      userid::movieid::rating::**timestamp
>>
>> 1. MySQL
>>
>> My first test on MySQL was to join the two tables on movieid, and group
>> them on "genres" and then compute the AVG ratings and get the TOP20 as so:
>>
>> SELECT genres,  AVG( a.ratings )
>> FROM ratings a
>> JOIN movies b ON a.movieid = b.movieid
>> GROUP BY genres
>> ORDER BY AVG( a.ratings ) DESC
>> LIMIT 20
>>
>> It tooks about two minutes to get the results.
>>
>> 2. Python/Pandas
>>
>> I've done the same with Python and Pandas library and took about 4 seconds
>> to dump the results.
>>
>> 3. Hadoop/Pig
>>
>> I've run a Ubuntu 12.04 64bit server release on VMware player. Install
>> Hadoop on pseudo distributed single node cluster and pig.
>>
>> Here is my complete code to get the same results as below:
>>
>> I had to clean the file first since you cannot use two separators with Pig
>> as so:
>>
>> movies = load 'movie/movies.dat' as (line : chararray);
>> moviesclean = foreach movies generate flatten(STRSPLIT(line, '::'));
>> STORE moviesclean INTO 'movies';
>>
>> ratings = load 'movie/ratings.dat' as (line : chararray);
>> ratingsclean = foreach ratings generate flatten(STRSPLIT(line, '::'));
>> STORE ratingsclean INTO 'ratings';
>>
>>
>> movies = LOAD 'movies' as (movieid:int, title:chararray, genres:chararray);
>> ratings = LOAD 'ratings' as (userid:int, movieid:int, rating:float,
>> timestamp:chararray);
>>
>> X = JOIN ratings BY movieid, movies BY movieid;
>> Y = GROUP X BY genres;
>> Z = FOREACH Y GENERATE group, AVG(X.rating);
>> STORE Z INTO 'results';
>>
>> # It took about 10 minutes to store the Z file
>>
>> Z = LOAD 'results' AS (genres:chararray, avgrating:float);
>> Zsorted = ORDER Z BY avgrating DESC;
>> results = LIMIT Zsorted 20;
>> dump results;
>>
>> # And here about 4 minutes
>>
>> So in total it takes about 15 minutes to do the same analysis. Do you know
>> if it's normal?
>> Is it because the power is only revealed from terabytes of data?
>>
>> Thanks
>>
>> Regards.
>>
>> Thomas
>>
>>
>>
>>
>> This message may contain confidential and/or privileged information. If it
>> has
>> been sent to you in error, please reply to advise the sender of the error
>> and
>> then immediately delete this message.
>>


Re: Pig efficiency

Posted by Gerrit Jansen van Vuuren <ge...@gmail.com>.
Hadoop  is made for data starting at gigabytes.
If you've got < 1-5 gigs of data you can always do it faster on a single
machine, the use case for Hadoop is  when you have 40 gigs or 200 gigs, N
TBs etc.

Remember that sending a job for distributed processing has a certain amount
of overhead, that you do not have when processing locally, this is why for
250mb you would always see faster processing times locally.


Regards,
 Gerrit

On Tue, Apr 30, 2013 at 4:09 PM, Thomas Foricher
<th...@gmail.com>wrote:

> Hello everybody,
>
> I'm very new on Pig and Hadoop, and I'm trying to do some efficiency tests
> between different technologies. For that I'm using a movie database found
> here :
>
> http://www.grouplens.org/**taxonomy/term/14<http://www.grouplens.org/taxonomy/term/14>
>
> It's a 10M entries database, it weigths about 250 Mb. I know it is
> certainly not enough to see the Map/Reduce real power but I thought it
> could give an idea.
>
> So the data are formated as below :
>
> movies.dat
>
>     movieid::title:genres
>
> ratings.dat
>
>     userid::movieid::rating::**timestamp
>
> 1. MySQL
>
> My first test on MySQL was to join the two tables on movieid, and group
> them on "genres" and then compute the AVG ratings and get the TOP20 as so:
>
> SELECT genres,  AVG( a.ratings )
> FROM ratings a
> JOIN movies b ON a.movieid = b.movieid
> GROUP BY genres
> ORDER BY AVG( a.ratings ) DESC
> LIMIT 20
>
> It tooks about two minutes to get the results.
>
> 2. Python/Pandas
>
> I've done the same with Python and Pandas library and took about 4 seconds
> to dump the results.
>
> 3. Hadoop/Pig
>
> I've run a Ubuntu 12.04 64bit server release on VMware player. Install
> Hadoop on pseudo distributed single node cluster and pig.
>
> Here is my complete code to get the same results as below:
>
> I had to clean the file first since you cannot use two separators with Pig
> as so:
>
> movies = load 'movie/movies.dat' as (line : chararray);
> moviesclean = foreach movies generate flatten(STRSPLIT(line, '::'));
> STORE moviesclean INTO 'movies';
>
> ratings = load 'movie/ratings.dat' as (line : chararray);
> ratingsclean = foreach ratings generate flatten(STRSPLIT(line, '::'));
> STORE ratingsclean INTO 'ratings';
>
>
> movies = LOAD 'movies' as (movieid:int, title:chararray, genres:chararray);
> ratings = LOAD 'ratings' as (userid:int, movieid:int, rating:float,
> timestamp:chararray);
>
> X = JOIN ratings BY movieid, movies BY movieid;
> Y = GROUP X BY genres;
> Z = FOREACH Y GENERATE group, AVG(X.rating);
> STORE Z INTO 'results';
>
> # It took about 10 minutes to store the Z file
>
> Z = LOAD 'results' AS (genres:chararray, avgrating:float);
> Zsorted = ORDER Z BY avgrating DESC;
> results = LIMIT Zsorted 20;
> dump results;
>
> # And here about 4 minutes
>
> So in total it takes about 15 minutes to do the same analysis. Do you know
> if it's normal?
> Is it because the power is only revealed from terabytes of data?
>
> Thanks
>
> Regards.
>
> Thomas
>
>
>
>
> This message may contain confidential and/or privileged information. If it
> has
> been sent to you in error, please reply to advise the sender of the error
> and
> then immediately delete this message.
>