You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Mark Snow <ma...@yahoo.com> on 2008/06/26 05:07:28 UTC

Slow tutorial?

Hi All,

I downloaded the pig tutorial to give it a whirl, set it up on a hadoop cluster I've used for a few other tasks (7 nodes, ec2) and went through the instructions to launch tutorial script1 with the excite bz file on hdfs. Two things jumped out:

1) Only one mapper launched
2) It's really slow. It's been almost 5 hours and still under 10% of the mapper is completed

Have I misconfigured something? What's a good benchmark run time for the tutorial scripts to complete?



      

RE: Slow tutorial?

Posted by Amir Youssefi <am...@yahoo-inc.com>.
Also, using bz2 gives error, runs were with uncompressed excite.log.bz2:
excite.log

[amiry@gsgw1011 pigtmp]$ java -cp pig_latest.jar org.apache.pig.Main -x
local script1-local.pig
2008-06-26 20:27:09,708 [main] ERROR org.apache.pig.tools.grunt.Grunt -
java.io.IOException: Unable to store alias null
        at
org.apache.pig.impl.util.WrappedIOException.wrap(WrappedIOException.java
:16)
        at org.apache.pig.PigServer.registerQuery(PigServer.java:296)
        at
org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:457)
        at
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptPar
ser.java:233)
        at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java
:63)
        at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:60)
        at org.apache.pig.Main.main(Main.java:294)
?Pt???4??fd???@Q(/??C?!Ap7??;?+???w]?<=v}k?m??w??[3?{=?Z????????u???????
???6r??v????l??8???????Y??vlwR??P??;??P
8p\?b?????;??}??+??|?[t??}?v>?????y?z?^h=?]??;j>w???<?Z??}?????{?c?{?n>?
????wh>?@(????W???????????m?n?????;ol????|p'{?}?t{???[???>??>???^??
                  ?oxf?)
        at
org.apache.pig.backend.local.executionengine.LocalExecutionEngine.execut
e(LocalExecutionEngine.java:136)
        at
org.apache.pig.backend.local.executionengine.LocalExecutionEngine.execut
e(LocalExecutionEngine.java:27)
        at
org.apache.pig.PigServer.optimizeAndRunQuery(PigServer.java:413)
        at org.apache.pig.PigServer.registerQuery(PigServer.java:293)
        ... 5 more

Amir 

-----Original Message-----
From: Amir Youssefi [mailto:amiry@yahoo-inc.com] 
Sent: Thursday, June 26, 2008 2:06 PM
To: pig-user@incubator.apache.org; Olga Natkovich
Subject: RE: Slow tutorial?


 I created latest pig.jar, tested defaults/pig.properties with PIG-235. 

 Local mode is still running after half an hour and may not finish in
hours.

 3 nodes on Hadoop/mapreduce mode ran in less than 10 min (similar to
old runs we had). 

Amir


-----Original Message-----
From: Amir Youssefi [mailto:amiry@yahoo-inc.com]
Sent: Thursday, June 26, 2008 12:30 PM
To: pig-user@incubator.apache.org
Subject: RE: Slow tutorial?

Hi Mark, 

 pig.jar that comes with it is old and doesn't have pig.properties.

 Try making a new build (June 26th or later) and make sure you have
these in pig.properties: 

#Do not spill temp files smaller than this size (bytes)
pig.spill.size.threshold=5000000
#EXPERIMENT: Activate garbage collection when spilling a file bigger
than this size (bytes) #This should help reduce the number of files
being spilled.
pig.spill.gc.activation.size=40000000

or similar numbers...

Amir

-----Original Message-----
From: Mark Snow [mailto:marksnow452@yahoo.com]
Sent: Wednesday, June 25, 2008 8:07 PM
To: pig-user@incubator.apache.org
Subject: Slow tutorial?

Hi All,

I downloaded the pig tutorial to give it a whirl, set it up on a hadoop
cluster I've used for a few other tasks (7 nodes, ec2) and went through
the instructions to launch tutorial script1 with the excite bz file on
hdfs. Two things jumped out:

1) Only one mapper launched
2) It's really slow. It's been almost 5 hours and still under 10% of the
mapper is completed

Have I misconfigured something? What's a good benchmark run time for the
tutorial scripts to complete?



      

RE: Slow tutorial?

Posted by Amir Youssefi <am...@yahoo-inc.com>.
 I created latest pig.jar, tested defaults/pig.properties with PIG-235. 

 Local mode is still running after half an hour and may not finish in
hours.

 3 nodes on Hadoop/mapreduce mode ran in less than 10 min (similar to
old runs we had). 

Amir


-----Original Message-----
From: Amir Youssefi [mailto:amiry@yahoo-inc.com] 
Sent: Thursday, June 26, 2008 12:30 PM
To: pig-user@incubator.apache.org
Subject: RE: Slow tutorial?

Hi Mark, 

 pig.jar that comes with it is old and doesn't have pig.properties.

 Try making a new build (June 26th or later) and make sure you have
these in pig.properties: 

#Do not spill temp files smaller than this size (bytes)
pig.spill.size.threshold=5000000
#EXPERIMENT: Activate garbage collection when spilling a file bigger
than this size (bytes) #This should help reduce the number of files
being spilled.
pig.spill.gc.activation.size=40000000

or similar numbers...

Amir

-----Original Message-----
From: Mark Snow [mailto:marksnow452@yahoo.com]
Sent: Wednesday, June 25, 2008 8:07 PM
To: pig-user@incubator.apache.org
Subject: Slow tutorial?

Hi All,

I downloaded the pig tutorial to give it a whirl, set it up on a hadoop
cluster I've used for a few other tasks (7 nodes, ec2) and went through
the instructions to launch tutorial script1 with the excite bz file on
hdfs. Two things jumped out:

1) Only one mapper launched
2) It's really slow. It's been almost 5 hours and still under 10% of the
mapper is completed

Have I misconfigured something? What's a good benchmark run time for the
tutorial scripts to complete?



      

RE: Slow tutorial?

Posted by Amir Youssefi <am...@yahoo-inc.com>.
Checking the code just committed I see that defaults are there: 

-    private static long gcActivationSize = Long.MAX_VALUE ;
-    private static long spillFileSizeThreshold = 0L ;
+    // if we freed at least this much, invoke GC 
+    // (default 40 MB - this can be overridden by user supplied
property)
+    private static long gcActivationSize = 40000000L ;
     
+    // spill file size should be at least this much
+    // (default 5MB - this can be overridden by user supplied property)
+    private static long spillFileSizeThreshold = 5000000L ;
+    
+    // this will keep track of memory freed across spills
+    // and between GC invocations
+    private static long accumulatedFreeSize = 0L;
+    
+    // fraction of biggest heap for which we want to get
+    // "memory usage threshold exceeded" notifications
+    private static double memoryThresholdFraction = 0.7;
+    
+    // fraction of biggest heap for which we want to get
+    // "collection threshold exceeded" notifications
+    private static double collectionMemoryThresholdFraction = 0.5;


So I am running it again to see how it goes this time. 

Amir 

-----Original Message-----
From: Amir Youssefi [mailto:amiry@yahoo-inc.com] 
Sent: Thursday, June 26, 2008 12:30 PM
To: pig-user@incubator.apache.org
Subject: RE: Slow tutorial?

Hi Mark, 

 pig.jar that comes with it is old and doesn't have pig.properties.

 Try making a new build (June 26th or later) and make sure you have
these in pig.properties: 

#Do not spill temp files smaller than this size (bytes)
pig.spill.size.threshold=5000000
#EXPERIMENT: Activate garbage collection when spilling a file bigger
than this size (bytes) #This should help reduce the number of files
being spilled.
pig.spill.gc.activation.size=40000000

or similar numbers...

Amir

-----Original Message-----
From: Mark Snow [mailto:marksnow452@yahoo.com]
Sent: Wednesday, June 25, 2008 8:07 PM
To: pig-user@incubator.apache.org
Subject: Slow tutorial?

Hi All,

I downloaded the pig tutorial to give it a whirl, set it up on a hadoop
cluster I've used for a few other tasks (7 nodes, ec2) and went through
the instructions to launch tutorial script1 with the excite bz file on
hdfs. Two things jumped out:

1) Only one mapper launched
2) It's really slow. It's been almost 5 hours and still under 10% of the
mapper is completed

Have I misconfigured something? What's a good benchmark run time for the
tutorial scripts to complete?



      

RE: Slow tutorial?

Posted by Amir Youssefi <am...@yahoo-inc.com>.
Hi Mark, 

 pig.jar that comes with it is old and doesn't have pig.properties.

 Try making a new build (June 26th or later) and make sure you have
these in pig.properties: 

#Do not spill temp files smaller than this size (bytes)
pig.spill.size.threshold=5000000
#EXPERIMENT: Activate garbage collection when spilling a file bigger
than this size (bytes)
#This should help reduce the number of files being spilled.
pig.spill.gc.activation.size=40000000

or similar numbers...

Amir

-----Original Message-----
From: Mark Snow [mailto:marksnow452@yahoo.com] 
Sent: Wednesday, June 25, 2008 8:07 PM
To: pig-user@incubator.apache.org
Subject: Slow tutorial?

Hi All,

I downloaded the pig tutorial to give it a whirl, set it up on a hadoop
cluster I've used for a few other tasks (7 nodes, ec2) and went through
the instructions to launch tutorial script1 with the excite bz file on
hdfs. Two things jumped out:

1) Only one mapper launched
2) It's really slow. It's been almost 5 hours and still under 10% of the
mapper is completed

Have I misconfigured something? What's a good benchmark run time for the
tutorial scripts to complete?