You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Markus Resch <ma...@adtech.de> on 2012/05/31 11:38:51 UTC

Job setup for a pig run takes ages

Hi all,

when we're running a pig job for aggregating some amount of slightly
compressed avro data (~160GByte), the time until the first actual mapred
job starts takes ages:
15:27:21,052 [main] INFO  org.apache.pig.Main - Logging error messages
to:
[...]
15:57:27,816 [main] INFO  org.apache.pig.tools.pigstats.ScriptState -
Pig features used in the script: 
[...]
16:07:00,969 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
[...]
16:07:30,886 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=463325937621
[...]
16:15:38,022 [Thread-16] INFO
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
paths to process : 50353

This log messages are from our test cluster which has a dedicated
jobtracker and namenode each and 5 data nodes with a map task capacity
of 15 and a reduce task capacity of 10. There were 6899 map tasks and
464 reduce tasks set up.

During the initialisation phase we were observing the work load and
memory usage of jobtracker, namenode and some data nodes using top.
Those were nearly all the time kind of bored (e.g. 30% cpu load on the
namenode, total idle on he data nodes). When the jobs were running most
of the tasks where in "waiting for IO" most of the time. It seemed there
was some swapping space reserved but rarely used in those times. 

In our eyes it looks like a hadoop config issue but we have no idea what
it exaclty could be. Jobs with about 10GBytes of input data were running
quite well. 

Any hint where to tweak will be appreciated

Thanks
Markus

Re: Job setup for a pig run takes ages

Posted by Thejas Nair <th...@hortonworks.com>.

Hi Markus,
Thanks for reporting the results of the change, and the jstack.
The jstack information is useful, as I suspected the time is being spent 
finding the schema of from the load function (which is taking really 
long because avro must be stat'ing each of the large number of files to 
determine the schema).
We can also improve things in pig by making fewer calls to the 
LoadFunc's getSchema().
Thanks,
Thejas


On 6/4/12 10:28 AM, Markus Resch wrote:
> Hi Thejas,
>
> Starting from you assumption we did some investigation by generating
> some test data in chunks of 500MByte and ran the script on that and the
> result was extremely fast,
>
> Thanks for that hint.
>
> Markus
>
> I also did that jstack thing and here is where the thread hangs: (in
> both "lags" btw)
>
>
> "main" prio=10 tid=0x000000005cb1e800 nid=0x545d runnable
> [0x0000000041e85000]
>     java.lang.Thread.State: RUNNABLE
>          at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>          at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
>          at
> sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
>          at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
>          - locked<0x00000000c2725498>  (a sun.nio.ch.Util$2)
>          - locked<0x00000000c2725488>  (a java.util.Collections
> $UnmodifiableSet)
>          - locked<0x00000000c2725260>  (a sun.nio.ch.EPollSelectorImpl)
>          at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
>          at org.apache.hadoop.net.SocketIOWithTimeout
> $SelectorPool.select(SocketIOWithTimeout.java:332)
>          at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
>          at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>          at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>          at
> java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>          at
> java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>          - locked<0x00000000eb76dec0>  (a java.io.BufferedInputStream)
>          at java.io.DataInputStream.readShort(DataInputStream.java:295)
>          at org.apache.hadoop.hdfs.DFSClient
> $RemoteBlockReader.newBlockReader(DFSClient.java:1664)
>          at org.apache.hadoop.hdfs.DFSClient
> $DFSInputStream.getBlockReader(DFSClient.java:2383)
>          at org.apache.hadoop.hdfs.DFSClient
> $DFSInputStream.blockSeekTo(DFSClient.java:2056)
>          - locked<0x00000000eb768c20>  (a
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream)
>          at org.apache.hadoop.hdfs.DFSClient
> $DFSInputStream.read(DFSClient.java:2170)
>          - locked<0x00000000eb768c20>  (a
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream)
>          at java.io.DataInputStream.read(DataInputStream.java:132)
>          at org.apache.avro.io.BinaryDecoder
> $InputStreamByteSource.readRaw(BinaryDecoder.java:804)
>          at
> org.apache.avro.io.BinaryDecoder.doReadBytes(BinaryDecoder.java:331)
>          at
> org.apache.avro.io.BinaryDecoder.readFixed(BinaryDecoder.java:287)
>          at org.apache.avro.io.Decoder.readFixed(Decoder.java:143)
>          at
> org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:100)
>          at
> org.apache.avro.file.DataFileStream.<init>(DataFileStream.java:84)
>          at
> org.apache.pig.piggybank.storage.avro.AvroStorage.getSchema(AvroStorage.java:217)
>          at
> org.apache.pig.piggybank.storage.avro.AvroStorage.getAvroSchema(AvroStorage.java:168)
>          at
> org.apache.pig.piggybank.storage.avro.AvroStorage.getAvroSchema(AvroStorage.java:144)
>          at
> org.apache.pig.piggybank.storage.avro.AvroStorage.getSchema(AvroStorage.java:297)
>          at
> org.apache.pig.impl.logicalLayer.LOLoad.determineSchema(LOLoad.java:186)
>          at
> org.apache.pig.impl.logicalLayer.LOLoad.getSchema(LOLoad.java:151)
>          at
> org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:851)
>          at
> org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
>          at org.apache.pig.PigServer
> $Graph.parseQuery(PigServer.java:1612)
>          at org.apache.pig.PigServer
> $Graph.registerQuery(PigServer.java:1562)
>          at org.apache.pig.PigServer.registerQuery(PigServer.java:534)
>          at
> org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:871)
>          at
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:388)
>          at
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
>          at
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
>          at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90)
>          at org.apache.pig.Main.run(Main.java:500)
>          at org.apache.pig.Main.main(Main.java:107)
>
>
>
>
> Am Freitag, den 01.06.2012, 09:45 -0700 schrieb Thejas Nair:
>> Can you do a jstack<pid>  on the pig client process id a few times and
>> see what it is doing when it is taking so long and send the results ?
>> With that we should be able to easily identify what might be happening
>> and suggest a fix.
>>
>> I think pig calls getSchema() on load functions more times than actually
>> necessary during the query optimization phase (each time after a
>> transform). One theory I have is that your avro load function does a
>> stat on all the 50k files each time to determine the common schema. This
>> might be causing the delay to generate query plan.
>>
>> About the second large lag before queue seems to be time spend in input
>> split calculations, not within pig logic.
>>
>> These are just theories I have. The jstack output would help determine
>> if this is actually the case.
>>
>> Also, I think this delay is likely to be caused by the number of input
>> files, and not the actual data size. You might also want to look at ways
>> to reduce the actual number of input files. Is it generated by a
>> previous MR job ? If yes, given your cluster size, it works against you
>> to have so many maps or reducers in the previous MR job. If pig is
>> generating the data,  you should consider setting the parallelism
>> (reducers) or the number of maps (see
>> http://pig.apache.org/docs/r0.10.0/perf.html#combine-files) .
>> While reducing number of input files will most likely solve your
>> problem, I am still interested in knowing where the delays are coming
>> from and if we can fix something in pig to improve the situation.
>>
>> Thanks,
>> Thejas
>>
>>
>> On 6/1/12 6:34 AM, Markus Resch wrote:
>>> Hi Prashant, Hi Thejas,
>>>
>>> thanks for your very quick answer.
>>> No, this is not a typo. Those time stamps are true and as I said the
>>> machines are not very busy during this time.
>>>
>>> As this is our test cluster I am sure I am the only one who is running
>>> jobs on it. Another issue we have is that we are currently only able to
>>> run one job at a time but this shouldn't be the topic of this request.
>>> We even have no continuous input stream to that cluster but copied a
>>> bunch of data to it some time ago.
>>>   From my perspective the 464 GB of input data you are mentioned is the
>>> uncompressed amount of the 160GByte compressed files. Which I get when I
>>> use hadoop -f dus on that folder.
>>>
>>> Another interesting fact for you could be that we're running the
>>> cloudera CDH3 Update 3 version on our systems.
>>>
>>> I suspect this could be due to some fancy avro schema validation
>>> implicitly executed by the avro storage? If so, can this be avoided?
>>>
>>> Sadly I'm currently not able to provide you the actual script currently
>>> as it contains confidential information but I will try to provide you a
>>> version as soon as possible. But I'd rather think of a configuration
>>> problem to the hadoop or pig anyways as the script works fine with a
>>> smaller amount of input data
>>>
>>> I would ask the hadoop mailing list if this issue would occur during the
>>> actual mapred run but as this occur even before a single mapred job is
>>> launched I suspect pig to have a problem.
>>>
>>> Thanks
>>> Markus
>>>
>>> This is the full log until the main work job starts:
>>> mapred@ournamenode$ pig OurScript.pig
>>> 2012-05-30 15:27:21,052 [main] INFO  org.apache.pig.Main - Logging error
>>> messages to: /tmp/pig_1338384441037.log
>>> 2012-05-30 15:27:21,368 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>>> Connecting to hadoop file system at: hdfs://OurNamenode:9000
>>> 2012-05-30 15:27:21,609 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>>> Connecting to map-reduce job tracker at:
>>> dev-jobtracker001.eu-fra.adtech.com:54311
>>> 2012-05-30 15:57:27,814 [main] WARN  org.apache.pig.PigServer -
>>> Encountered Warning IMPLICIT_CAST_TO_LONG 1 time(s).
>>> 2012-05-30 15:57:27,816 [main] INFO
>>> org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
>>> script: REPLICATED_JOIN,COGROUP,GROUP_BY,FILTER
>>> 2012-05-30 15:57:27,816 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>>> pig.usenewlogicalplan is set to true. New logical plan will be used.
>>> 2012-05-30 16:06:55,304 [main] INFO
>>> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned
>>> for CampaignInfo: $0, $1, $2, $4, $5, $6, $8, $9
>>> 2012-05-30 16:06:55,308 [main] INFO
>>> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned
>>> for dataImport: $2, $3, $4
>>> 2012-05-30 16:06:55,441 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
>>> OutputData1:
>>> Store(SomeOutputFile1.csv:org.apache.pig.builtin.PigStorage) - scope-521
>>> Operator Key: scope-521)
>>> 2012-05-30 16:06:55,441 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
>>> OutputData2:
>>> Store(/SomeOutputFile2.csv:org.apache.pig.builtin.PigStorage) -
>>> scope-524 Operator Key: scope-524)
>>> 2012-05-30 16:06:55,441 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
>>> OutputData2:
>>> Store(/SomeOutputFile3.csv:org.apache.pig.builtin.PigStorage) -
>>> scope-483 Operator Key: scope-483)
>>> 2012-05-30 16:06:55,453 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler
>>> - File concatenation threshold: 100 optimistic? false
>>> 2012-05-30 16:06:55,467 [main] INFO
>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
>>> paths to process : 1
>>> 2012-05-30 16:06:55,471 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
>>> input paths to process : 1
>>> 2012-05-30 16:06:55,483 [main] WARN
>>> org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native library
>>> is available
>>> 2012-05-30 16:06:55,484 [main] INFO
>>> org.apache.hadoop.util.NativeCodeLoader - Loaded the native-hadoop
>>> library
>>> 2012-05-30 16:06:55,484 [main] INFO
>>> org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native library
>>> loaded
>>> 2012-05-30 16:06:55,486 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
>>> input paths (combined) to process : 1
>>> 2012-05-30 16:06:55,486 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler
>>> - number of input files: 1
>>> 2012-05-30 16:06:55,516 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 7
>>> 2012-05-30 16:06:55,516 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 map-only splittees.
>>> 2012-05-30 16:06:55,516 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 out of total 3 MR operators.
>>> 2012-05-30 16:06:55,517 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 map-only splittees.
>>> 2012-05-30 16:06:55,517 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 out of total 3 MR operators.
>>> 2012-05-30 16:06:55,517 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 0 out of total 2 MR operators.
>>> 2012-05-30 16:06:55,517 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 3
>>> 2012-05-30 16:06:56,131 [main] INFO
>>> org.apache.pig.tools.pigstats.ScriptState - Pig script settings are
>>> added to the job
>>> 2012-05-30 16:06:56,138 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
>>> 2012-05-30 16:07:00,432 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
>>> 2012-05-30 16:07:00,468 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission
>>> 2012-05-30 16:07:00,969 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
>>> 2012-05-30 16:07:01,460 [Thread-4] INFO
>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
>>> paths to process : 1
>>> 2012-05-30 16:07:01,460 [Thread-4] INFO
>>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
>>> input paths to process : 1
>>> 2012-05-30 16:07:01,462 [Thread-4] INFO
>>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
>>> input paths (combined) to process : 1
>>> 2012-05-30 16:07:02,353 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201205080931_0178
>>> 2012-05-30 16:07:02,353 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: http://ourJobtracker:50030/jobdetails.jsp?jobid=job_201205080931_0178
>>> 2012-05-30 16:07:12,421 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 16% complete
>>> 2012-05-30 16:07:13,929 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 33% complete
>>> 2012-05-30 16:07:16,969 [main] INFO
>>> org.apache.pig.tools.pigstats.ScriptState - Pig script settings are
>>> added to the job
>>> 2012-05-30 16:07:16,970 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
>>> 2012-05-30 16:07:26,888 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
>>> 2012-05-30 16:07:30,886 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=463325937621
>>> 2012-05-30 16:07:30,886 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Neither PARALLEL nor default parallelism is set for this job. Setting number of reducers to 464
>>> 2012-05-30 16:07:30,947 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
>>> 2012-05-30 16:15:38,022 [Thread-16] INFO
>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
>>> paths to process : 50353
>>> 2012-05-30 16:15:54,421 [Thread-16] INFO
>>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
>>> input paths (combined) to process : 6899
>>> 2012-05-30 16:15:57,844 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201205080931_0179
>>>
>>>
>>>
>>>
>>> Am Donnerstag, den 31.05.2012, 02:57 -0700 schrieb Prashant Kommireddi:
>>>> Hi Markus,
>>>>
>>>> Have you checked the JobTracker at the time of launching the job that Map
>>>> slots were available?
>>>>
>>>> Looks like the input dataset size is ~464 GB. Since you mentioned 10 GB
>>>> jobs are running fine, there should be no reason a larger dataset should be
>>>> stuck, atleast not on Pig side. I can't think of a good reason why the job
>>>> does not take off other than the fact that cluster was busy running some
>>>> other job.
>>>>
>>>> I see that the number of files being processed is large, 50353. That could
>>>> be a reason for slowness, but ~8 minutes as shown in the logs seems to be
>>>> on the higher end for that.
>>>>
>>>> May be also post your script here.
>>>>
>>>> On Thu, May 31, 2012 at 2:38 AM, Markus Resch<ma...@adtech.de>wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> when we're running a pig job for aggregating some amount of slightly
>>>>> compressed avro data (~160GByte), the time until the first actual mapred
>>>>> job starts takes ages:
>>>>> 15:27:21,052 [main] INFO  org.apache.pig.Main - Logging error messages
>>>>> to:
>>>>> [...]
>>>>> 15:57:27,816 [main] INFO  org.apache.pig.tools.pigstats.ScriptState -
>>>>> Pig features used in the script:
>>>>> [...]
>>>>> 16:07:00,969 [main] INFO
>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>>>>> - 0% complete
>>>>> [...]
>>>>> 16:07:30,886 [main] INFO
>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
>>>>> - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=463325937621
>>>>> [...]
>>>>> 16:15:38,022 [Thread-16] INFO
>>>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
>>>>> paths to process : 50353
>>>>>
>>>>> This log messages are from our test cluster which has a dedicated
>>>>> jobtracker and namenode each and 5 data nodes with a map task capacity
>>>>> of 15 and a reduce task capacity of 10. There were 6899 map tasks and
>>>>> 464 reduce tasks set up.
>>>>>
>>>>> During the initialisation phase we were observing the work load and
>>>>> memory usage of jobtracker, namenode and some data nodes using top.
>>>>> Those were nearly all the time kind of bored (e.g. 30% cpu load on the
>>>>> namenode, total idle on he data nodes). When the jobs were running most
>>>>> of the tasks where in "waiting for IO" most of the time. It seemed there
>>>>> was some swapping space reserved but rarely used in those times.
>>>>>
>>>>> In our eyes it looks like a hadoop config issue but we have no idea what
>>>>> it exaclty could be. Jobs with about 10GBytes of input data were running
>>>>> quite well.
>>>>>
>>>>> Any hint where to tweak will be appreciated
>>>>>
>>>>> Thanks
>>>>> Markus
>>>>>
>>>>>
>>>
>>
>

Re: Job setup for a pig run takes ages

Posted by Markus Resch <ma...@adtech.de>.

Hi Thejas,

Starting from you assumption we did some investigation by generating
some test data in chunks of 500MByte and ran the script on that and the
result was extremely fast,

Thanks for that hint. 

Markus

I also did that jstack thing and here is where the thread hangs: (in
both "lags" btw)


"main" prio=10 tid=0x000000005cb1e800 nid=0x545d runnable
[0x0000000041e85000]
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
        at
sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
        - locked <0x00000000c2725498> (a sun.nio.ch.Util$2)
        - locked <0x00000000c2725488> (a java.util.Collections
$UnmodifiableSet)
        - locked <0x00000000c2725260> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
        at org.apache.hadoop.net.SocketIOWithTimeout
$SelectorPool.select(SocketIOWithTimeout.java:332)
        at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
        at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
        at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
        at
java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
        at
java.io.BufferedInputStream.read(BufferedInputStream.java:237)
        - locked <0x00000000eb76dec0> (a java.io.BufferedInputStream)
        at java.io.DataInputStream.readShort(DataInputStream.java:295)
        at org.apache.hadoop.hdfs.DFSClient
$RemoteBlockReader.newBlockReader(DFSClient.java:1664)
        at org.apache.hadoop.hdfs.DFSClient
$DFSInputStream.getBlockReader(DFSClient.java:2383)
        at org.apache.hadoop.hdfs.DFSClient
$DFSInputStream.blockSeekTo(DFSClient.java:2056)
        - locked <0x00000000eb768c20> (a
org.apache.hadoop.hdfs.DFSClient$DFSInputStream)
        at org.apache.hadoop.hdfs.DFSClient
$DFSInputStream.read(DFSClient.java:2170)
        - locked <0x00000000eb768c20> (a
org.apache.hadoop.hdfs.DFSClient$DFSInputStream)
        at java.io.DataInputStream.read(DataInputStream.java:132)
        at org.apache.avro.io.BinaryDecoder
$InputStreamByteSource.readRaw(BinaryDecoder.java:804)
        at
org.apache.avro.io.BinaryDecoder.doReadBytes(BinaryDecoder.java:331)
        at
org.apache.avro.io.BinaryDecoder.readFixed(BinaryDecoder.java:287)
        at org.apache.avro.io.Decoder.readFixed(Decoder.java:143)
        at
org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:100)
        at
org.apache.avro.file.DataFileStream.<init>(DataFileStream.java:84)
        at
org.apache.pig.piggybank.storage.avro.AvroStorage.getSchema(AvroStorage.java:217)
        at
org.apache.pig.piggybank.storage.avro.AvroStorage.getAvroSchema(AvroStorage.java:168)
        at
org.apache.pig.piggybank.storage.avro.AvroStorage.getAvroSchema(AvroStorage.java:144)
        at
org.apache.pig.piggybank.storage.avro.AvroStorage.getSchema(AvroStorage.java:297)
        at
org.apache.pig.impl.logicalLayer.LOLoad.determineSchema(LOLoad.java:186)
        at
org.apache.pig.impl.logicalLayer.LOLoad.getSchema(LOLoad.java:151)
        at
org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:851)
        at
org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
        at org.apache.pig.PigServer
$Graph.parseQuery(PigServer.java:1612)
        at org.apache.pig.PigServer
$Graph.registerQuery(PigServer.java:1562)
        at org.apache.pig.PigServer.registerQuery(PigServer.java:534)
        at
org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:871)
        at
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:388)
        at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
        at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
        at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90)
        at org.apache.pig.Main.run(Main.java:500)
        at org.apache.pig.Main.main(Main.java:107)




Am Freitag, den 01.06.2012, 09:45 -0700 schrieb Thejas Nair:
> Can you do a jstack <pid> on the pig client process id a few times and 
> see what it is doing when it is taking so long and send the results ?
> With that we should be able to easily identify what might be happening 
> and suggest a fix.
> 
> I think pig calls getSchema() on load functions more times than actually 
> necessary during the query optimization phase (each time after a 
> transform). One theory I have is that your avro load function does a 
> stat on all the 50k files each time to determine the common schema. This 
> might be causing the delay to generate query plan.
> 
> About the second large lag before queue seems to be time spend in input 
> split calculations, not within pig logic.
> 
> These are just theories I have. The jstack output would help determine 
> if this is actually the case.
> 
> Also, I think this delay is likely to be caused by the number of input 
> files, and not the actual data size. You might also want to look at ways 
> to reduce the actual number of input files. Is it generated by a 
> previous MR job ? If yes, given your cluster size, it works against you 
> to have so many maps or reducers in the previous MR job. If pig is 
> generating the data,  you should consider setting the parallelism 
> (reducers) or the number of maps (see 
> http://pig.apache.org/docs/r0.10.0/perf.html#combine-files) .
> While reducing number of input files will most likely solve your 
> problem, I am still interested in knowing where the delays are coming 
> from and if we can fix something in pig to improve the situation.
> 
> Thanks,
> Thejas
> 
> 
> On 6/1/12 6:34 AM, Markus Resch wrote:
> > Hi Prashant, Hi Thejas,
> >
> > thanks for your very quick answer.
> > No, this is not a typo. Those time stamps are true and as I said the
> > machines are not very busy during this time.
> >
> > As this is our test cluster I am sure I am the only one who is running
> > jobs on it. Another issue we have is that we are currently only able to
> > run one job at a time but this shouldn't be the topic of this request.
> > We even have no continuous input stream to that cluster but copied a
> > bunch of data to it some time ago.
> >  From my perspective the 464 GB of input data you are mentioned is the
> > uncompressed amount of the 160GByte compressed files. Which I get when I
> > use hadoop -f dus on that folder.
> >
> > Another interesting fact for you could be that we're running the
> > cloudera CDH3 Update 3 version on our systems.
> >
> > I suspect this could be due to some fancy avro schema validation
> > implicitly executed by the avro storage? If so, can this be avoided?
> >
> > Sadly I'm currently not able to provide you the actual script currently
> > as it contains confidential information but I will try to provide you a
> > version as soon as possible. But I'd rather think of a configuration
> > problem to the hadoop or pig anyways as the script works fine with a
> > smaller amount of input data
> >
> > I would ask the hadoop mailing list if this issue would occur during the
> > actual mapred run but as this occur even before a single mapred job is
> > launched I suspect pig to have a problem.
> >
> > Thanks
> > Markus
> >
> > This is the full log until the main work job starts:
> > mapred@ournamenode$ pig OurScript.pig
> > 2012-05-30 15:27:21,052 [main] INFO  org.apache.pig.Main - Logging error
> > messages to: /tmp/pig_1338384441037.log
> > 2012-05-30 15:27:21,368 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> > Connecting to hadoop file system at: hdfs://OurNamenode:9000
> > 2012-05-30 15:27:21,609 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> > Connecting to map-reduce job tracker at:
> > dev-jobtracker001.eu-fra.adtech.com:54311
> > 2012-05-30 15:57:27,814 [main] WARN  org.apache.pig.PigServer -
> > Encountered Warning IMPLICIT_CAST_TO_LONG 1 time(s).
> > 2012-05-30 15:57:27,816 [main] INFO
> > org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
> > script: REPLICATED_JOIN,COGROUP,GROUP_BY,FILTER
> > 2012-05-30 15:57:27,816 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> > pig.usenewlogicalplan is set to true. New logical plan will be used.
> > 2012-05-30 16:06:55,304 [main] INFO
> > org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned
> > for CampaignInfo: $0, $1, $2, $4, $5, $6, $8, $9
> > 2012-05-30 16:06:55,308 [main] INFO
> > org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned
> > for dataImport: $2, $3, $4
> > 2012-05-30 16:06:55,441 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
> > OutputData1:
> > Store(SomeOutputFile1.csv:org.apache.pig.builtin.PigStorage) - scope-521
> > Operator Key: scope-521)
> > 2012-05-30 16:06:55,441 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
> > OutputData2:
> > Store(/SomeOutputFile2.csv:org.apache.pig.builtin.PigStorage) -
> > scope-524 Operator Key: scope-524)
> > 2012-05-30 16:06:55,441 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
> > OutputData2:
> > Store(/SomeOutputFile3.csv:org.apache.pig.builtin.PigStorage) -
> > scope-483 Operator Key: scope-483)
> > 2012-05-30 16:06:55,453 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler
> > - File concatenation threshold: 100 optimistic? false
> > 2012-05-30 16:06:55,467 [main] INFO
> > org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
> > paths to process : 1
> > 2012-05-30 16:06:55,471 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> > input paths to process : 1
> > 2012-05-30 16:06:55,483 [main] WARN
> > org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native library
> > is available
> > 2012-05-30 16:06:55,484 [main] INFO
> > org.apache.hadoop.util.NativeCodeLoader - Loaded the native-hadoop
> > library
> > 2012-05-30 16:06:55,484 [main] INFO
> > org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native library
> > loaded
> > 2012-05-30 16:06:55,486 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> > input paths (combined) to process : 1
> > 2012-05-30 16:06:55,486 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler
> > - number of input files: 1
> > 2012-05-30 16:06:55,516 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 7
> > 2012-05-30 16:06:55,516 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 map-only splittees.
> > 2012-05-30 16:06:55,516 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 out of total 3 MR operators.
> > 2012-05-30 16:06:55,517 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 map-only splittees.
> > 2012-05-30 16:06:55,517 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 out of total 3 MR operators.
> > 2012-05-30 16:06:55,517 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 0 out of total 2 MR operators.
> > 2012-05-30 16:06:55,517 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 3
> > 2012-05-30 16:06:56,131 [main] INFO
> > org.apache.pig.tools.pigstats.ScriptState - Pig script settings are
> > added to the job
> > 2012-05-30 16:06:56,138 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
> > 2012-05-30 16:07:00,432 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
> > 2012-05-30 16:07:00,468 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission
> > 2012-05-30 16:07:00,969 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
> > 2012-05-30 16:07:01,460 [Thread-4] INFO
> > org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
> > paths to process : 1
> > 2012-05-30 16:07:01,460 [Thread-4] INFO
> > org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> > input paths to process : 1
> > 2012-05-30 16:07:01,462 [Thread-4] INFO
> > org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> > input paths (combined) to process : 1
> > 2012-05-30 16:07:02,353 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201205080931_0178
> > 2012-05-30 16:07:02,353 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: http://ourJobtracker:50030/jobdetails.jsp?jobid=job_201205080931_0178
> > 2012-05-30 16:07:12,421 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 16% complete
> > 2012-05-30 16:07:13,929 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 33% complete
> > 2012-05-30 16:07:16,969 [main] INFO
> > org.apache.pig.tools.pigstats.ScriptState - Pig script settings are
> > added to the job
> > 2012-05-30 16:07:16,970 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
> > 2012-05-30 16:07:26,888 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
> > 2012-05-30 16:07:30,886 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=463325937621
> > 2012-05-30 16:07:30,886 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Neither PARALLEL nor default parallelism is set for this job. Setting number of reducers to 464
> > 2012-05-30 16:07:30,947 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
> > 2012-05-30 16:15:38,022 [Thread-16] INFO
> > org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
> > paths to process : 50353
> > 2012-05-30 16:15:54,421 [Thread-16] INFO
> > org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> > input paths (combined) to process : 6899
> > 2012-05-30 16:15:57,844 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201205080931_0179
> >
> >
> >
> >
> > Am Donnerstag, den 31.05.2012, 02:57 -0700 schrieb Prashant Kommireddi:
> >> Hi Markus,
> >>
> >> Have you checked the JobTracker at the time of launching the job that Map
> >> slots were available?
> >>
> >> Looks like the input dataset size is ~464 GB. Since you mentioned 10 GB
> >> jobs are running fine, there should be no reason a larger dataset should be
> >> stuck, atleast not on Pig side. I can't think of a good reason why the job
> >> does not take off other than the fact that cluster was busy running some
> >> other job.
> >>
> >> I see that the number of files being processed is large, 50353. That could
> >> be a reason for slowness, but ~8 minutes as shown in the logs seems to be
> >> on the higher end for that.
> >>
> >> May be also post your script here.
> >>
> >> On Thu, May 31, 2012 at 2:38 AM, Markus Resch<ma...@adtech.de>wrote:
> >>
> >>> Hi all,
> >>>
> >>> when we're running a pig job for aggregating some amount of slightly
> >>> compressed avro data (~160GByte), the time until the first actual mapred
> >>> job starts takes ages:
> >>> 15:27:21,052 [main] INFO  org.apache.pig.Main - Logging error messages
> >>> to:
> >>> [...]
> >>> 15:57:27,816 [main] INFO  org.apache.pig.tools.pigstats.ScriptState -
> >>> Pig features used in the script:
> >>> [...]
> >>> 16:07:00,969 [main] INFO
> >>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> >>> - 0% complete
> >>> [...]
> >>> 16:07:30,886 [main] INFO
> >>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> >>> - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=463325937621
> >>> [...]
> >>> 16:15:38,022 [Thread-16] INFO
> >>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
> >>> paths to process : 50353
> >>>
> >>> This log messages are from our test cluster which has a dedicated
> >>> jobtracker and namenode each and 5 data nodes with a map task capacity
> >>> of 15 and a reduce task capacity of 10. There were 6899 map tasks and
> >>> 464 reduce tasks set up.
> >>>
> >>> During the initialisation phase we were observing the work load and
> >>> memory usage of jobtracker, namenode and some data nodes using top.
> >>> Those were nearly all the time kind of bored (e.g. 30% cpu load on the
> >>> namenode, total idle on he data nodes). When the jobs were running most
> >>> of the tasks where in "waiting for IO" most of the time. It seemed there
> >>> was some swapping space reserved but rarely used in those times.
> >>>
> >>> In our eyes it looks like a hadoop config issue but we have no idea what
> >>> it exaclty could be. Jobs with about 10GBytes of input data were running
> >>> quite well.
> >>>
> >>> Any hint where to tweak will be appreciated
> >>>
> >>> Thanks
> >>> Markus
> >>>
> >>>
> >
> 

-- 
Markus Resch
Software Developer 
P: +49 6103-5715-236 | F: +49 6103-5715-111 |
ADTECH GmbH | Robert-Bosch-Str. 32 | 63303 Dreieich | Germany
www.adtech.com<http://www.adtech.com>

ADTECH | A Division of Advertising.com Group - Residence of the Company:
Dreieich, Germany - Registration Office: Offenbach, HRB 46021
Management Board: Erhard Neumann, Mark Thielen

This message contains privileged and confidential information. Any
dissemination, distribution, copying or other use of this
message or any of its content (data, prices...) to any third parties may
only occur with ADTECH's prior consent.

RE: Job setup for a pig run takes ages

Posted by Danfeng Li <dl...@operasolutions.com>.

I took some snapshots, the results are attached.

I use pig 0.8.1
[dli@hmaster run]$ pig -version
Apache Pig version 0.8.1-cdh3u2 (rexported)
compiled Oct 13 2011, 22:35:57

and default pig loader

A = load 'a.csv' USING PIGSTORAGE('|') AS ( ...);

Thanks.
Dan

-----Original Message-----
From: Dmitriy Ryaboy [mailto:dvryaboy@gmail.com]
Sent: Monday, June 18, 2012 4:26 PM
To: user@pig.apache.org
Subject: Re: Job setup for a pig run takes ages

Can you do a few successive ones?
Also, please let us know which version of pig you are using, and which loaders.

D

On Mon, Jun 18, 2012 at 2:51 PM, Danfeng Li <dl...@operasolutions.com> wrote:
> This is the jstack output during the setup time, not exactly sure how to interoperate it.
>
> Thanks.
> Dan
>
> [dli@hmaster run]$ jstack 15640
> 2012-06-18 17:32:47
> Full thread dump Java HotSpot(TM) 64-Bit Server VM (17.0-b17 mixed mode):
>
> "Attach Listener" daemon prio=10 tid=0x0000000055dcb800 nid=0x431d
> waiting on condition [0x0000000000000000]
>   java.lang.Thread.State: RUNNABLE
>
> "Low Memory Detector" daemon prio=10 tid=0x0000000055105000 nid=0x3d3b
> runnable [0x0000000000000000]
>   java.lang.Thread.State: RUNNABLE
>
> "CompilerThread1" daemon prio=10 tid=0x0000000055103000 nid=0x3d3a
> waiting on condition [0x0000000000000000]
>   java.lang.Thread.State: RUNNABLE
>
> "CompilerThread0" daemon prio=10 tid=0x0000000055100000 nid=0x3d39
> waiting on condition [0x0000000000000000]
>   java.lang.Thread.State: RUNNABLE
>
> "Signal Dispatcher" daemon prio=10 tid=0x00000000550fe000 nid=0x3d38
> runnable [0x0000000000000000]
>   java.lang.Thread.State: RUNNABLE
>
> "Finalizer" daemon prio=10 tid=0x00000000550de800 nid=0x3d37 in
> Object.wait() [0x0000000041d7a000]
>   java.lang.Thread.State: WAITING (on object monitor)
>        at java.lang.Object.wait(Native Method)
>        - waiting on <0x00002aaab48a3cf8> (a
> java.lang.ref.ReferenceQueue$Lock)
>        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
>        - locked <0x00002aaab48a3cf8> (a
> java.lang.ref.ReferenceQueue$Lock)
>        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
>        at
> java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
>
> "Reference Handler" daemon prio=10 tid=0x00000000550dc800 nid=0x3d36
> in Object.wait() [0x0000000041093000]
>   java.lang.Thread.State: WAITING (on object monitor)
>        at java.lang.Object.wait(Native Method)
>        - waiting on <0x00002aaab48a3cb0> (a
> java.lang.ref.Reference$Lock)
>        at java.lang.Object.wait(Object.java:485)
>        at
> java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
>        - locked <0x00002aaab48a3cb0> (a java.lang.ref.Reference$Lock)
>
> "main" prio=10 tid=0x0000000055065800 nid=0x3d25 runnable
> [0x0000000041653000]
>   java.lang.Thread.State: RUNNABLE
>        at
> org.apache.pig.newplan.logical.expression.ProjectExpression.getFieldSc
> hema(ProjectExpression.java:164)
>        at
> org.apache.pig.newplan.logical.relational.LOInnerLoad.getSchema(LOInne
> rLoad.java:59)
>        at
> org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaRe
> setter.java:114)
>        at
> org.apache.pig.newplan.logical.relational.LOInnerLoad.accept(LOInnerLo
> ad.java:109)
>        at
> org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalke
> r.java:75)
>        at
> org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaRe
> setter.java:94)
>        at
> org.apache.pig.newplan.logical.relational.LOForEach.accept(LOForEach.j
> ava:71)
>        at
> org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalke
> r.java:75)
>        at
> org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50)
>        at
> org.apache.pig.newplan.logical.optimizer.SchemaPatcher.transformed(Sch
> emaPatcher.java:43)
>        at
> org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.
> java:113)
>        at
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile
> (HExecutionEngine.java:281)
>        at org.apache.pig.PigServer.compilePp(PigServer.java:1365)
>        at
> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:120
> 7)
>        at org.apache.pig.PigServer.execute(PigServer.java:1201)
>        at org.apache.pig.PigServer.access$100(PigServer.java:129)
>        at org.apache.pig.PigServer$Graph.execute(PigServer.java:1528)
>        at org.apache.pig.PigServer.executeBatchEx(PigServer.java:373)
>        at org.apache.pig.PigServer.executeBatch(PigServer.java:340)
>        at
> org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:1
> 15)
>        at
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.ja
> va:172)
>        at
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.ja
> va:144)
>        at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90)
>        at org.apache.pig.Main.run(Main.java:396)
>        at org.apache.pig.Main.main(Main.java:107)
>
> "VM Thread" prio=10 tid=0x00000000550d8800 nid=0x3d35 runnable
>
> "GC task thread#0 (ParallelGC)" prio=10 tid=0x0000000055078800
> nid=0x3d26 runnable
>
> "GC task thread#1 (ParallelGC)" prio=10 tid=0x000000005507a800
> nid=0x3d27 runnable
>
> "GC task thread#2 (ParallelGC)" prio=10 tid=0x000000005507c000
> nid=0x3d28 runnable
>
> "GC task thread#3 (ParallelGC)" prio=10 tid=0x000000005507e000
> nid=0x3d29 runnable
>
> "GC task thread#4 (ParallelGC)" prio=10 tid=0x0000000055080000
> nid=0x3d2a runnable
>
> "GC task thread#5 (ParallelGC)" prio=10 tid=0x0000000055081800
> nid=0x3d2b runnable
>
> "GC task thread#6 (ParallelGC)" prio=10 tid=0x0000000055083800
> nid=0x3d2c runnable
>
> "GC task thread#7 (ParallelGC)" prio=10 tid=0x0000000055085800
> nid=0x3d2d runnable
>
> "GC task thread#8 (ParallelGC)" prio=10 tid=0x0000000055087000
> nid=0x3d2e runnable
>
> "GC task thread#9 (ParallelGC)" prio=10 tid=0x0000000055089000
> nid=0x3d2f runnable
>
> "GC task thread#10 (ParallelGC)" prio=10 tid=0x000000005508b000
> nid=0x3d30 runnable
>
> "GC task thread#11 (ParallelGC)" prio=10 tid=0x000000005508c800
> nid=0x3d31 runnable
>
> "GC task thread#12 (ParallelGC)" prio=10 tid=0x000000005508e800
> nid=0x3d32 runnable
>
> "GC task thread#13 (ParallelGC)" prio=10 tid=0x0000000055090800
> nid=0x3d33 runnable
>
> "GC task thread#14 (ParallelGC)" prio=10 tid=0x0000000055092800
> nid=0x3d34 runnable
>
> "VM Periodic Task Thread" prio=10 tid=0x0000000055110000 nid=0x3d3c
> waiting on condition
>
> JNI global references: 1463
>
>
> -----Original Message-----
> From: Dmitriy Ryaboy [mailto:dvryaboy@gmail.com]
> Sent: Saturday, June 16, 2012 8:24 AM
> To: user@pig.apache.org
> Subject: Re: Job setup for a pig run takes ages
>
> What loader are you using? Jt is not the place to look at, try jstacking your pig process. Most likely it's talking to the NamaNode most of the time because the loader is doing some per-file lookups.
>
> On Jun 13, 2012, at 11:24 AM, Danfeng Li <dl...@operasolutions.com> wrote:
>
>> We also run into the long setup time issue, but our problem is
>> different
>>
>> 1. The setup time takes about 20minutes, we can't see anything on the jobtracker during this setup time.
>> 2. Our data is saved in flat file, uncompressed.
>> 3. Our code consists of many small pig files, they are used in the
>> following way in the main pig file
>> data_1 = load ...
>> data_2 = load ...
>> ...
>> data_n = load ...
>>
>> run -param ... pigfile1.pig
>> run -param ... pigfile2.pig
>> ...
>>
>> store out1 ..
>> store out2 ..
>> ...
>> 4. here's the part of the log file during the setup time, notice the time difference between "13:46:42" to "14:05:23", during that time, we can't see anything on the jobtracker.
>> ...
>> 2012-06-13 13:46:30,488 [main] INFO  org.apache.pig.Main - Logging
>> error messages to: pig_1339609590477.log
>> 2012-06-13 13:46:30,796 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>> Connecting to hadoop file system at: hdfs://master:9000
>> 2012-06-13 13:46:30,950 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>> Connecting to map-reduce job tracker at: master:9001
>> 2012-06-13 13:46:32,766 [main] WARN
>> org.apache.pig.tools.parameters.PreprocessorContext - Warning :
>> Multiple values found for rationale_fir. Using value : Account
>> position (\\$
>> 2012-06-13 13:46:32,766 [main] WARN
>> org.apache.pig.tools.parameters.PreprocessorContext - Warning :
>> Multiple values found for rationale_sec. Using value K,
>> 2012-06-13 13:46:32,766 [main] WARN
>> org.apache.pig.tools.parameters.PreprocessorContext - Warning :
>> Multiple values found for rationale_thi. Using value %)
>> 2012-06-13 13:46:32,767 [main] WARN
>> org.apache.pig.tools.parameters.PreprocessorContext - Warning :
>> Multiple values found for detail_statment_pre. Using value  - matures
>> on
>> 2012-06-13 13:46:32,767 [main] WARN  org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for detail_statment_post. Using value .
>> 2012-06-13 13:46:32,767 [main] WARN
>> org.apache.pig.tools.parameters.PreprocessorContext - Warning :
>> Multiple values found for rationale_fir. Using value : Maturity date
>> 2012-06-13 13:46:32,767 [main] WARN
>> org.apache.pig.tools.parameters.PreprocessorContext - Warning :
>> Multiple values found for rationale_sec. Using value  Account
>> position (\\$
>> 2012-06-13 13:46:32,767 [main] WARN
>> org.apache.pig.tools.parameters.PreprocessorContext - Warning :
>> Multiple values found for rationale_thi. Using value K,
>> 2012-06-13 13:46:32,767 [main] WARN
>> org.apache.pig.tools.parameters.PreprocessorContext - Warning :
>> Multiple values found for catalyst_pre. Using value  matures on
>> 2012-06-13 13:46:32,767 [main] WARN  org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for catalyst_post. Using value .
>> 2012-06-13 13:46:42,749 [main] INFO
>> org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
>> script:
>> REPLICATED_JOIN,HASH_JOIN,COGROUP,GROUP_BY,ORDER_BY,DISTINCT,STREAMIN
>> G
>> ,FILTER,CROSS,UNION
>> 2012-06-13 13:46:42,749 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - pig.usenewlogicalplan is set to true. New logical plan will be used.
>> 2012-06-13 14:05:23,460 [main] INFO
>> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns
>> pruned for var_raw: $0, $1, $2, $6, $7, $8, $9, $10
>> 2012-06-13 14:05:23,474 [main] INFO
>> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns
>> pruned for var_mf: $5, $6, $7, $8, $9, $11, $12, $14, $15, $16, $17,
>> $18, $19, $21, $23, $24, $25, $26, $27, $28, $29, $30, $31, $32, $33,
>> $34, $35, $36, $37, $38, $39, $40, $41, $42, $43, $44, $45
>> 2012-06-13 14:05:23,475 [main] INFO
>> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for starmine: $0, $3, $4, $5, $6, $9, $10, $11 ...
>>
>> Any help will be appreciated.
>>
>> Thanks.
>> Dan
>>
>> -----Original Message-----
>> From: Markus Resch [mailto:markus.resch@adtech.de]
>> Sent: Wednesday, June 13, 2012 2:24 AM
>> To: user@pig.apache.org
>> Subject: Re: Job setup for a pig run takes ages
>>
>> Hey Alex,
>>
>> On one side I think you're right but we need to keep in mind that the
>> schema could change within some files of a glob (e.g. schema
>> extension
>> update) the Avro Storage should check at least some hash of the schema to verify all schemas of all input files are the same and/or to split them into groups of different schemas if required.
>>
>> I'm currently about to check this issue with the cloudera cdh4 pig version. I'll let you know if we get significant different behavior.
>>
>> Best
>> Markus
>>
>> Am Dienstag, den 12.06.2012, 19:16 -0400 schrieb Alex Rovner:
>>> Thejas,
>>>
>>> Why can't Pig cache results of getschema instead of repeatedly calling the load function? This is causing us to put in lots of unnecessary workarounds in place since during those calls one cant rely on the udf context as it gets cleared somewhere inbetween those calls?
>>>
>>> Sent from my iPhone
>>>
>>> On Jun 1, 2012, at 12:45 PM, Thejas Nair <th...@hortonworks.com> wrote:
>>>
>>>> Can you do a jstack <pid> on the pig client process id a few times and see what it is doing when it is taking so long and send the results ?
>>>> With that we should be able to easily identify what might be happening and suggest a fix.
>>>>
>>>> I think pig calls getSchema() on load functions more times than actually necessary during the query optimization phase (each time after a transform). One theory I have is that your avro load function does a stat on all the 50k files each time to determine the common schema. This might be causing the delay to generate query plan.
>>>>
>>>> About the second large lag before queue seems to be time spend in input split calculations, not within pig logic.
>>>>
>>>> These are just theories I have. The jstack output would help determine if this is actually the case.
>>>>
>>>> Also, I think this delay is likely to be caused by the number of input files, and not the actual data size. You might also want to look at ways to reduce the actual number of input files. Is it generated by a previous MR job ? If yes, given your cluster size, it works against you to have so many maps or reducers in the previous MR job. If pig is generating the data,  you should consider setting the parallelism (reducers) or the number of maps (see http://pig.apache.org/docs/r0.10.0/perf.html#combine-files) .
>>>> While reducing number of input files will most likely solve your problem, I am still interested in knowing where the delays are coming from and if we can fix something in pig to improve the situation.
>>>>
>>>> Thanks,
>>>> Thejas
>>>>
>>>>
>>>> On 6/1/12 6:34 AM, Markus Resch wrote:
>>>>> Hi Prashant, Hi Thejas,
>>>>>
>>>>> thanks for your very quick answer.
>>>>> No, this is not a typo. Those time stamps are true and as I said
>>>>> the machines are not very busy during this time.
>>>>>
>>>>> As this is our test cluster I am sure I am the only one who is
>>>>> running jobs on it. Another issue we have is that we are currently
>>>>> only able to run one job at a time but this shouldn't be the topic of this request.
>>>>> We even have no continuous input stream to that cluster but copied
>>>>> a bunch of data to it some time ago.
>>>>> From my perspective the 464 GB of input data you are mentioned is
>>>>> the uncompressed amount of the 160GByte compressed files. Which I
>>>>> get when I use hadoop -f dus on that folder.
>>>>>
>>>>> Another interesting fact for you could be that we're running the
>>>>> cloudera CDH3 Update 3 version on our systems.
>>>>>
>>>>> I suspect this could be due to some fancy avro schema validation
>>>>> implicitly executed by the avro storage? If so, can this be avoided?
>>>>>
>>>>> Sadly I'm currently not able to provide you the actual script
>>>>> currently as it contains confidential information but I will try
>>>>> to provide you a version as soon as possible. But I'd rather think
>>>>> of a configuration problem to the hadoop or pig anyways as the
>>>>> script works fine with a smaller amount of input data
>>>>>
>>>>> I would ask the hadoop mailing list if this issue would occur
>>>>> during the actual mapred run but as this occur even before a
>>>>> single mapred job is launched I suspect pig to have a problem.
>>>>>
>>>>> Thanks
>>>>> Markus
>>>>>
>>>>> This is the full log until the main work job starts:
>>>>> mapred@ournamenode$ pig OurScript.pig
>>>>> 2012-05-30 15:27:21,052 [main] INFO  org.apache.pig.Main - Logging
>>>>> error messages to: /tmp/pig_1338384441037.log
>>>>> 2012-05-30 15:27:21,368 [main] INFO
>>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>>>>> Connecting to hadoop file system at: hdfs://OurNamenode:9000
>>>>> 2012-05-30 15:27:21,609 [main] INFO
>>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>>>>> Connecting to map-reduce job tracker at:
>>>>> dev-jobtracker001.eu-fra.adtech.com:54311
>>>>> 2012-05-30 15:57:27,814 [main] WARN  org.apache.pig.PigServer -
>>>>> Encountered Warning IMPLICIT_CAST_TO_LONG 1 time(s).
>>>>> 2012-05-30 15:57:27,816 [main] INFO
>>>>> org.apache.pig.tools.pigstats.ScriptState - Pig features used in
>>>>> the
>>>>> script: REPLICATED_JOIN,COGROUP,GROUP_BY,FILTER
>>>>> 2012-05-30 15:57:27,816 [main] INFO
>>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>>>>> pig.usenewlogicalplan is set to true. New logical plan will be used.
>>>>> 2012-05-30 16:06:55,304 [main] INFO
>>>>> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns
>>>>> pruned for CampaignInfo: $0, $1, $2, $4, $5, $6, $8, $9
>>>>> 2012-05-30 16:06:55,308 [main] INFO
>>>>> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns
>>>>> pruned for dataImport: $2, $3, $4
>>>>> 2012-05-30 16:06:55,441 [main] INFO
>>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
>>>>> OutputData1:
>>>>> Store(SomeOutputFile1.csv:org.apache.pig.builtin.PigStorage) -
>>>>> scope-521 Operator Key: scope-521)
>>>>> 2012-05-30 16:06:55,441 [main] INFO
>>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
>>>>> OutputData2:
>>>>> Store(/SomeOutputFile2.csv:org.apache.pig.builtin.PigStorage) -
>>>>> scope-524 Operator Key: scope-524)
>>>>> 2012-05-30 16:06:55,441 [main] INFO
>>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
>>>>> OutputData2:
>>>>> Store(/SomeOutputFile3.csv:org.apache.pig.builtin.PigStorage) -
>>>>> scope-483 Operator Key: scope-483)
>>>>> 2012-05-30 16:06:55,453 [main] INFO
>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCom
>>>>> p
>>>>> iler
>>>>> - File concatenation threshold: 100 optimistic? false
>>>>> 2012-05-30 16:06:55,467 [main] INFO
>>>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total
>>>>> input paths to process : 1
>>>>> 2012-05-30 16:06:55,471 [main] INFO
>>>>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil -
>>>>> Total input paths to process : 1
>>>>> 2012-05-30 16:06:55,483 [main] WARN
>>>>> org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native
>>>>> library is available
>>>>> 2012-05-30 16:06:55,484 [main] INFO
>>>>> org.apache.hadoop.util.NativeCodeLoader - Loaded the native-hadoop
>>>>> library
>>>>> 2012-05-30 16:06:55,484 [main] INFO
>>>>> org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native
>>>>> library loaded
>>>>> 2012-05-30 16:06:55,486 [main] INFO
>>>>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil -
>>>>> Total input paths (combined) to process : 1
>>>>> 2012-05-30 16:06:55,486 [main] INFO
>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCom
>>>>> p
>>>>> iler
>>>>> - number of input files: 1
>>>>> 2012-05-30 16:06:55,516 [main] INFO
>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Multi
>>>>> Q ueryOptimizer - MR plan size before optimization: 7
>>>>> 2012-05-30 16:06:55,516 [main] INFO
>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 map-only splittees.
>>>>> 20

Re: Job setup for a pig run takes ages

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Can you do a few successive ones?
Also, please let us know which version of pig you are using, and which loaders.

D

On Mon, Jun 18, 2012 at 2:51 PM, Danfeng Li <dl...@operasolutions.com> wrote:
> This is the jstack output during the setup time, not exactly sure how to interoperate it.
>
> Thanks.
> Dan
>
> [dli@hmaster run]$ jstack 15640
> 2012-06-18 17:32:47
> Full thread dump Java HotSpot(TM) 64-Bit Server VM (17.0-b17 mixed mode):
>
> "Attach Listener" daemon prio=10 tid=0x0000000055dcb800 nid=0x431d waiting on condition [0x0000000000000000]
>   java.lang.Thread.State: RUNNABLE
>
> "Low Memory Detector" daemon prio=10 tid=0x0000000055105000 nid=0x3d3b runnable [0x0000000000000000]
>   java.lang.Thread.State: RUNNABLE
>
> "CompilerThread1" daemon prio=10 tid=0x0000000055103000 nid=0x3d3a waiting on condition [0x0000000000000000]
>   java.lang.Thread.State: RUNNABLE
>
> "CompilerThread0" daemon prio=10 tid=0x0000000055100000 nid=0x3d39 waiting on condition [0x0000000000000000]
>   java.lang.Thread.State: RUNNABLE
>
> "Signal Dispatcher" daemon prio=10 tid=0x00000000550fe000 nid=0x3d38 runnable [0x0000000000000000]
>   java.lang.Thread.State: RUNNABLE
>
> "Finalizer" daemon prio=10 tid=0x00000000550de800 nid=0x3d37 in Object.wait() [0x0000000041d7a000]
>   java.lang.Thread.State: WAITING (on object monitor)
>        at java.lang.Object.wait(Native Method)
>        - waiting on <0x00002aaab48a3cf8> (a java.lang.ref.ReferenceQueue$Lock)
>        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
>        - locked <0x00002aaab48a3cf8> (a java.lang.ref.ReferenceQueue$Lock)
>        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
>        at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
>
> "Reference Handler" daemon prio=10 tid=0x00000000550dc800 nid=0x3d36 in Object.wait() [0x0000000041093000]
>   java.lang.Thread.State: WAITING (on object monitor)
>        at java.lang.Object.wait(Native Method)
>        - waiting on <0x00002aaab48a3cb0> (a java.lang.ref.Reference$Lock)
>        at java.lang.Object.wait(Object.java:485)
>        at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
>        - locked <0x00002aaab48a3cb0> (a java.lang.ref.Reference$Lock)
>
> "main" prio=10 tid=0x0000000055065800 nid=0x3d25 runnable [0x0000000041653000]
>   java.lang.Thread.State: RUNNABLE
>        at org.apache.pig.newplan.logical.expression.ProjectExpression.getFieldSchema(ProjectExpression.java:164)
>        at org.apache.pig.newplan.logical.relational.LOInnerLoad.getSchema(LOInnerLoad.java:59)
>        at org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:114)
>        at org.apache.pig.newplan.logical.relational.LOInnerLoad.accept(LOInnerLoad.java:109)
>        at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
>        at org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:94)
>        at org.apache.pig.newplan.logical.relational.LOForEach.accept(LOForEach.java:71)
>        at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
>        at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50)
>        at org.apache.pig.newplan.logical.optimizer.SchemaPatcher.transformed(SchemaPatcher.java:43)
>        at org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:113)
>        at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:281)
>        at org.apache.pig.PigServer.compilePp(PigServer.java:1365)
>        at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1207)
>        at org.apache.pig.PigServer.execute(PigServer.java:1201)
>        at org.apache.pig.PigServer.access$100(PigServer.java:129)
>        at org.apache.pig.PigServer$Graph.execute(PigServer.java:1528)
>        at org.apache.pig.PigServer.executeBatchEx(PigServer.java:373)
>        at org.apache.pig.PigServer.executeBatch(PigServer.java:340)
>        at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:115)
>        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:172)
>        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
>        at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90)
>        at org.apache.pig.Main.run(Main.java:396)
>        at org.apache.pig.Main.main(Main.java:107)
>
> "VM Thread" prio=10 tid=0x00000000550d8800 nid=0x3d35 runnable
>
> "GC task thread#0 (ParallelGC)" prio=10 tid=0x0000000055078800 nid=0x3d26 runnable
>
> "GC task thread#1 (ParallelGC)" prio=10 tid=0x000000005507a800 nid=0x3d27 runnable
>
> "GC task thread#2 (ParallelGC)" prio=10 tid=0x000000005507c000 nid=0x3d28 runnable
>
> "GC task thread#3 (ParallelGC)" prio=10 tid=0x000000005507e000 nid=0x3d29 runnable
>
> "GC task thread#4 (ParallelGC)" prio=10 tid=0x0000000055080000 nid=0x3d2a runnable
>
> "GC task thread#5 (ParallelGC)" prio=10 tid=0x0000000055081800 nid=0x3d2b runnable
>
> "GC task thread#6 (ParallelGC)" prio=10 tid=0x0000000055083800 nid=0x3d2c runnable
>
> "GC task thread#7 (ParallelGC)" prio=10 tid=0x0000000055085800 nid=0x3d2d runnable
>
> "GC task thread#8 (ParallelGC)" prio=10 tid=0x0000000055087000 nid=0x3d2e runnable
>
> "GC task thread#9 (ParallelGC)" prio=10 tid=0x0000000055089000 nid=0x3d2f runnable
>
> "GC task thread#10 (ParallelGC)" prio=10 tid=0x000000005508b000 nid=0x3d30 runnable
>
> "GC task thread#11 (ParallelGC)" prio=10 tid=0x000000005508c800 nid=0x3d31 runnable
>
> "GC task thread#12 (ParallelGC)" prio=10 tid=0x000000005508e800 nid=0x3d32 runnable
>
> "GC task thread#13 (ParallelGC)" prio=10 tid=0x0000000055090800 nid=0x3d33 runnable
>
> "GC task thread#14 (ParallelGC)" prio=10 tid=0x0000000055092800 nid=0x3d34 runnable
>
> "VM Periodic Task Thread" prio=10 tid=0x0000000055110000 nid=0x3d3c waiting on condition
>
> JNI global references: 1463
>
>
> -----Original Message-----
> From: Dmitriy Ryaboy [mailto:dvryaboy@gmail.com]
> Sent: Saturday, June 16, 2012 8:24 AM
> To: user@pig.apache.org
> Subject: Re: Job setup for a pig run takes ages
>
> What loader are you using? Jt is not the place to look at, try jstacking your pig process. Most likely it's talking to the NamaNode most of the time because the loader is doing some per-file lookups.
>
> On Jun 13, 2012, at 11:24 AM, Danfeng Li <dl...@operasolutions.com> wrote:
>
>> We also run into the long setup time issue, but our problem is
>> different
>>
>> 1. The setup time takes about 20minutes, we can't see anything on the jobtracker during this setup time.
>> 2. Our data is saved in flat file, uncompressed.
>> 3. Our code consists of many small pig files, they are used in the
>> following way in the main pig file
>> data_1 = load ...
>> data_2 = load ...
>> ...
>> data_n = load ...
>>
>> run -param ... pigfile1.pig
>> run -param ... pigfile2.pig
>> ...
>>
>> store out1 ..
>> store out2 ..
>> ...
>> 4. here's the part of the log file during the setup time, notice the time difference between "13:46:42" to "14:05:23", during that time, we can't see anything on the jobtracker.
>> ...
>> 2012-06-13 13:46:30,488 [main] INFO  org.apache.pig.Main - Logging
>> error messages to: pig_1339609590477.log
>> 2012-06-13 13:46:30,796 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>> Connecting to hadoop file system at: hdfs://master:9000
>> 2012-06-13 13:46:30,950 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>> Connecting to map-reduce job tracker at: master:9001
>> 2012-06-13 13:46:32,766 [main] WARN
>> org.apache.pig.tools.parameters.PreprocessorContext - Warning :
>> Multiple values found for rationale_fir. Using value : Account
>> position (\\$
>> 2012-06-13 13:46:32,766 [main] WARN
>> org.apache.pig.tools.parameters.PreprocessorContext - Warning :
>> Multiple values found for rationale_sec. Using value K,
>> 2012-06-13 13:46:32,766 [main] WARN
>> org.apache.pig.tools.parameters.PreprocessorContext - Warning :
>> Multiple values found for rationale_thi. Using value %)
>> 2012-06-13 13:46:32,767 [main] WARN
>> org.apache.pig.tools.parameters.PreprocessorContext - Warning :
>> Multiple values found for detail_statment_pre. Using value  - matures
>> on
>> 2012-06-13 13:46:32,767 [main] WARN  org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for detail_statment_post. Using value .
>> 2012-06-13 13:46:32,767 [main] WARN
>> org.apache.pig.tools.parameters.PreprocessorContext - Warning :
>> Multiple values found for rationale_fir. Using value : Maturity date
>> 2012-06-13 13:46:32,767 [main] WARN
>> org.apache.pig.tools.parameters.PreprocessorContext - Warning :
>> Multiple values found for rationale_sec. Using value  Account position
>> (\\$
>> 2012-06-13 13:46:32,767 [main] WARN
>> org.apache.pig.tools.parameters.PreprocessorContext - Warning :
>> Multiple values found for rationale_thi. Using value K,
>> 2012-06-13 13:46:32,767 [main] WARN
>> org.apache.pig.tools.parameters.PreprocessorContext - Warning :
>> Multiple values found for catalyst_pre. Using value  matures on
>> 2012-06-13 13:46:32,767 [main] WARN  org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for catalyst_post. Using value .
>> 2012-06-13 13:46:42,749 [main] INFO
>> org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
>> script:
>> REPLICATED_JOIN,HASH_JOIN,COGROUP,GROUP_BY,ORDER_BY,DISTINCT,STREAMING
>> ,FILTER,CROSS,UNION
>> 2012-06-13 13:46:42,749 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - pig.usenewlogicalplan is set to true. New logical plan will be used.
>> 2012-06-13 14:05:23,460 [main] INFO
>> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns
>> pruned for var_raw: $0, $1, $2, $6, $7, $8, $9, $10
>> 2012-06-13 14:05:23,474 [main] INFO
>> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns
>> pruned for var_mf: $5, $6, $7, $8, $9, $11, $12, $14, $15, $16, $17,
>> $18, $19, $21, $23, $24, $25, $26, $27, $28, $29, $30, $31, $32, $33,
>> $34, $35, $36, $37, $38, $39, $40, $41, $42, $43, $44, $45
>> 2012-06-13 14:05:23,475 [main] INFO
>> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for starmine: $0, $3, $4, $5, $6, $9, $10, $11 ...
>>
>> Any help will be appreciated.
>>
>> Thanks.
>> Dan
>>
>> -----Original Message-----
>> From: Markus Resch [mailto:markus.resch@adtech.de]
>> Sent: Wednesday, June 13, 2012 2:24 AM
>> To: user@pig.apache.org
>> Subject: Re: Job setup for a pig run takes ages
>>
>> Hey Alex,
>>
>> On one side I think you're right but we need to keep in mind that the
>> schema could change within some files of a glob (e.g. schema extension
>> update) the Avro Storage should check at least some hash of the schema to verify all schemas of all input files are the same and/or to split them into groups of different schemas if required.
>>
>> I'm currently about to check this issue with the cloudera cdh4 pig version. I'll let you know if we get significant different behavior.
>>
>> Best
>> Markus
>>
>> Am Dienstag, den 12.06.2012, 19:16 -0400 schrieb Alex Rovner:
>>> Thejas,
>>>
>>> Why can't Pig cache results of getschema instead of repeatedly calling the load function? This is causing us to put in lots of unnecessary workarounds in place since during those calls one cant rely on the udf context as it gets cleared somewhere inbetween those calls?
>>>
>>> Sent from my iPhone
>>>
>>> On Jun 1, 2012, at 12:45 PM, Thejas Nair <th...@hortonworks.com> wrote:
>>>
>>>> Can you do a jstack <pid> on the pig client process id a few times and see what it is doing when it is taking so long and send the results ?
>>>> With that we should be able to easily identify what might be happening and suggest a fix.
>>>>
>>>> I think pig calls getSchema() on load functions more times than actually necessary during the query optimization phase (each time after a transform). One theory I have is that your avro load function does a stat on all the 50k files each time to determine the common schema. This might be causing the delay to generate query plan.
>>>>
>>>> About the second large lag before queue seems to be time spend in input split calculations, not within pig logic.
>>>>
>>>> These are just theories I have. The jstack output would help determine if this is actually the case.
>>>>
>>>> Also, I think this delay is likely to be caused by the number of input files, and not the actual data size. You might also want to look at ways to reduce the actual number of input files. Is it generated by a previous MR job ? If yes, given your cluster size, it works against you to have so many maps or reducers in the previous MR job. If pig is generating the data,  you should consider setting the parallelism (reducers) or the number of maps (see http://pig.apache.org/docs/r0.10.0/perf.html#combine-files) .
>>>> While reducing number of input files will most likely solve your problem, I am still interested in knowing where the delays are coming from and if we can fix something in pig to improve the situation.
>>>>
>>>> Thanks,
>>>> Thejas
>>>>
>>>>
>>>> On 6/1/12 6:34 AM, Markus Resch wrote:
>>>>> Hi Prashant, Hi Thejas,
>>>>>
>>>>> thanks for your very quick answer.
>>>>> No, this is not a typo. Those time stamps are true and as I said
>>>>> the machines are not very busy during this time.
>>>>>
>>>>> As this is our test cluster I am sure I am the only one who is
>>>>> running jobs on it. Another issue we have is that we are currently
>>>>> only able to run one job at a time but this shouldn't be the topic of this request.
>>>>> We even have no continuous input stream to that cluster but copied
>>>>> a bunch of data to it some time ago.
>>>>> From my perspective the 464 GB of input data you are mentioned is
>>>>> the uncompressed amount of the 160GByte compressed files. Which I
>>>>> get when I use hadoop -f dus on that folder.
>>>>>
>>>>> Another interesting fact for you could be that we're running the
>>>>> cloudera CDH3 Update 3 version on our systems.
>>>>>
>>>>> I suspect this could be due to some fancy avro schema validation
>>>>> implicitly executed by the avro storage? If so, can this be avoided?
>>>>>
>>>>> Sadly I'm currently not able to provide you the actual script
>>>>> currently as it contains confidential information but I will try to
>>>>> provide you a version as soon as possible. But I'd rather think of
>>>>> a configuration problem to the hadoop or pig anyways as the script
>>>>> works fine with a smaller amount of input data
>>>>>
>>>>> I would ask the hadoop mailing list if this issue would occur
>>>>> during the actual mapred run but as this occur even before a single
>>>>> mapred job is launched I suspect pig to have a problem.
>>>>>
>>>>> Thanks
>>>>> Markus
>>>>>
>>>>> This is the full log until the main work job starts:
>>>>> mapred@ournamenode$ pig OurScript.pig
>>>>> 2012-05-30 15:27:21,052 [main] INFO  org.apache.pig.Main - Logging
>>>>> error messages to: /tmp/pig_1338384441037.log
>>>>> 2012-05-30 15:27:21,368 [main] INFO
>>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>>>>> Connecting to hadoop file system at: hdfs://OurNamenode:9000
>>>>> 2012-05-30 15:27:21,609 [main] INFO
>>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>>>>> Connecting to map-reduce job tracker at:
>>>>> dev-jobtracker001.eu-fra.adtech.com:54311
>>>>> 2012-05-30 15:57:27,814 [main] WARN  org.apache.pig.PigServer -
>>>>> Encountered Warning IMPLICIT_CAST_TO_LONG 1 time(s).
>>>>> 2012-05-30 15:57:27,816 [main] INFO
>>>>> org.apache.pig.tools.pigstats.ScriptState - Pig features used in
>>>>> the
>>>>> script: REPLICATED_JOIN,COGROUP,GROUP_BY,FILTER
>>>>> 2012-05-30 15:57:27,816 [main] INFO
>>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>>>>> pig.usenewlogicalplan is set to true. New logical plan will be used.
>>>>> 2012-05-30 16:06:55,304 [main] INFO
>>>>> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns
>>>>> pruned for CampaignInfo: $0, $1, $2, $4, $5, $6, $8, $9
>>>>> 2012-05-30 16:06:55,308 [main] INFO
>>>>> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns
>>>>> pruned for dataImport: $2, $3, $4
>>>>> 2012-05-30 16:06:55,441 [main] INFO
>>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
>>>>> OutputData1:
>>>>> Store(SomeOutputFile1.csv:org.apache.pig.builtin.PigStorage) -
>>>>> scope-521 Operator Key: scope-521)
>>>>> 2012-05-30 16:06:55,441 [main] INFO
>>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
>>>>> OutputData2:
>>>>> Store(/SomeOutputFile2.csv:org.apache.pig.builtin.PigStorage) -
>>>>> scope-524 Operator Key: scope-524)
>>>>> 2012-05-30 16:06:55,441 [main] INFO
>>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
>>>>> OutputData2:
>>>>> Store(/SomeOutputFile3.csv:org.apache.pig.builtin.PigStorage) -
>>>>> scope-483 Operator Key: scope-483)
>>>>> 2012-05-30 16:06:55,453 [main] INFO
>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRComp
>>>>> iler
>>>>> - File concatenation threshold: 100 optimistic? false
>>>>> 2012-05-30 16:06:55,467 [main] INFO
>>>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
>>>>> paths to process : 1
>>>>> 2012-05-30 16:06:55,471 [main] INFO
>>>>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil -
>>>>> Total input paths to process : 1
>>>>> 2012-05-30 16:06:55,483 [main] WARN
>>>>> org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native
>>>>> library is available
>>>>> 2012-05-30 16:06:55,484 [main] INFO
>>>>> org.apache.hadoop.util.NativeCodeLoader - Loaded the native-hadoop
>>>>> library
>>>>> 2012-05-30 16:06:55,484 [main] INFO
>>>>> org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native
>>>>> library loaded
>>>>> 2012-05-30 16:06:55,486 [main] INFO
>>>>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil -
>>>>> Total input paths (combined) to process : 1
>>>>> 2012-05-30 16:06:55,486 [main] INFO
>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRComp
>>>>> iler
>>>>> - number of input files: 1
>>>>> 2012-05-30 16:06:55,516 [main] INFO
>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQ
>>>>> ueryOptimizer - MR plan size before optimization: 7
>>>>> 2012-05-30 16:06:55,516 [main] INFO
>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 map-only splittees.
>>>>> 20

RE: Job setup for a pig run takes ages

Posted by Danfeng Li <dl...@operasolutions.com>.

This is the jstack output during the setup time, not exactly sure how to interoperate it.

Thanks.
Dan

[dli@hmaster run]$ jstack 15640
2012-06-18 17:32:47
Full thread dump Java HotSpot(TM) 64-Bit Server VM (17.0-b17 mixed mode):

"Attach Listener" daemon prio=10 tid=0x0000000055dcb800 nid=0x431d waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Low Memory Detector" daemon prio=10 tid=0x0000000055105000 nid=0x3d3b runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"CompilerThread1" daemon prio=10 tid=0x0000000055103000 nid=0x3d3a waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"CompilerThread0" daemon prio=10 tid=0x0000000055100000 nid=0x3d39 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" daemon prio=10 tid=0x00000000550fe000 nid=0x3d38 runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Finalizer" daemon prio=10 tid=0x00000000550de800 nid=0x3d37 in Object.wait() [0x0000000041d7a000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x00002aaab48a3cf8> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
        - locked <0x00002aaab48a3cf8> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
        at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)

"Reference Handler" daemon prio=10 tid=0x00000000550dc800 nid=0x3d36 in Object.wait() [0x0000000041093000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x00002aaab48a3cb0> (a java.lang.ref.Reference$Lock)
        at java.lang.Object.wait(Object.java:485)
        at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
        - locked <0x00002aaab48a3cb0> (a java.lang.ref.Reference$Lock)

"main" prio=10 tid=0x0000000055065800 nid=0x3d25 runnable [0x0000000041653000]
   java.lang.Thread.State: RUNNABLE
        at org.apache.pig.newplan.logical.expression.ProjectExpression.getFieldSchema(ProjectExpression.java:164)
        at org.apache.pig.newplan.logical.relational.LOInnerLoad.getSchema(LOInnerLoad.java:59)
        at org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:114)
        at org.apache.pig.newplan.logical.relational.LOInnerLoad.accept(LOInnerLoad.java:109)
        at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
        at org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:94)
        at org.apache.pig.newplan.logical.relational.LOForEach.accept(LOForEach.java:71)
        at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
        at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50)
        at org.apache.pig.newplan.logical.optimizer.SchemaPatcher.transformed(SchemaPatcher.java:43)
        at org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:113)
        at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:281)
        at org.apache.pig.PigServer.compilePp(PigServer.java:1365)
        at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1207)
        at org.apache.pig.PigServer.execute(PigServer.java:1201)
        at org.apache.pig.PigServer.access$100(PigServer.java:129)
        at org.apache.pig.PigServer$Graph.execute(PigServer.java:1528)
        at org.apache.pig.PigServer.executeBatchEx(PigServer.java:373)
        at org.apache.pig.PigServer.executeBatch(PigServer.java:340)
        at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:115)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:172)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
        at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90)
        at org.apache.pig.Main.run(Main.java:396)
        at org.apache.pig.Main.main(Main.java:107)

"VM Thread" prio=10 tid=0x00000000550d8800 nid=0x3d35 runnable

"GC task thread#0 (ParallelGC)" prio=10 tid=0x0000000055078800 nid=0x3d26 runnable

"GC task thread#1 (ParallelGC)" prio=10 tid=0x000000005507a800 nid=0x3d27 runnable

"GC task thread#2 (ParallelGC)" prio=10 tid=0x000000005507c000 nid=0x3d28 runnable

"GC task thread#3 (ParallelGC)" prio=10 tid=0x000000005507e000 nid=0x3d29 runnable

"GC task thread#4 (ParallelGC)" prio=10 tid=0x0000000055080000 nid=0x3d2a runnable

"GC task thread#5 (ParallelGC)" prio=10 tid=0x0000000055081800 nid=0x3d2b runnable

"GC task thread#6 (ParallelGC)" prio=10 tid=0x0000000055083800 nid=0x3d2c runnable

"GC task thread#7 (ParallelGC)" prio=10 tid=0x0000000055085800 nid=0x3d2d runnable

"GC task thread#8 (ParallelGC)" prio=10 tid=0x0000000055087000 nid=0x3d2e runnable

"GC task thread#9 (ParallelGC)" prio=10 tid=0x0000000055089000 nid=0x3d2f runnable

"GC task thread#10 (ParallelGC)" prio=10 tid=0x000000005508b000 nid=0x3d30 runnable

"GC task thread#11 (ParallelGC)" prio=10 tid=0x000000005508c800 nid=0x3d31 runnable

"GC task thread#12 (ParallelGC)" prio=10 tid=0x000000005508e800 nid=0x3d32 runnable

"GC task thread#13 (ParallelGC)" prio=10 tid=0x0000000055090800 nid=0x3d33 runnable

"GC task thread#14 (ParallelGC)" prio=10 tid=0x0000000055092800 nid=0x3d34 runnable

"VM Periodic Task Thread" prio=10 tid=0x0000000055110000 nid=0x3d3c waiting on condition

JNI global references: 1463


-----Original Message-----
From: Dmitriy Ryaboy [mailto:dvryaboy@gmail.com]
Sent: Saturday, June 16, 2012 8:24 AM
To: user@pig.apache.org
Subject: Re: Job setup for a pig run takes ages

What loader are you using? Jt is not the place to look at, try jstacking your pig process. Most likely it's talking to the NamaNode most of the time because the loader is doing some per-file lookups.

On Jun 13, 2012, at 11:24 AM, Danfeng Li <dl...@operasolutions.com> wrote:

> We also run into the long setup time issue, but our problem is
> different
>
> 1. The setup time takes about 20minutes, we can't see anything on the jobtracker during this setup time.
> 2. Our data is saved in flat file, uncompressed.
> 3. Our code consists of many small pig files, they are used in the
> following way in the main pig file
> data_1 = load ...
> data_2 = load ...
> ...
> data_n = load ...
>
> run -param ... pigfile1.pig
> run -param ... pigfile2.pig
> ...
>
> store out1 ..
> store out2 ..
> ...
> 4. here's the part of the log file during the setup time, notice the time difference between "13:46:42" to "14:05:23", during that time, we can't see anything on the jobtracker.
> ...
> 2012-06-13 13:46:30,488 [main] INFO  org.apache.pig.Main - Logging
> error messages to: pig_1339609590477.log
> 2012-06-13 13:46:30,796 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> Connecting to hadoop file system at: hdfs://master:9000
> 2012-06-13 13:46:30,950 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> Connecting to map-reduce job tracker at: master:9001
> 2012-06-13 13:46:32,766 [main] WARN
> org.apache.pig.tools.parameters.PreprocessorContext - Warning :
> Multiple values found for rationale_fir. Using value : Account
> position (\\$
> 2012-06-13 13:46:32,766 [main] WARN
> org.apache.pig.tools.parameters.PreprocessorContext - Warning :
> Multiple values found for rationale_sec. Using value K,
> 2012-06-13 13:46:32,766 [main] WARN
> org.apache.pig.tools.parameters.PreprocessorContext - Warning :
> Multiple values found for rationale_thi. Using value %)
> 2012-06-13 13:46:32,767 [main] WARN
> org.apache.pig.tools.parameters.PreprocessorContext - Warning :
> Multiple values found for detail_statment_pre. Using value  - matures
> on
> 2012-06-13 13:46:32,767 [main] WARN  org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for detail_statment_post. Using value .
> 2012-06-13 13:46:32,767 [main] WARN
> org.apache.pig.tools.parameters.PreprocessorContext - Warning :
> Multiple values found for rationale_fir. Using value : Maturity date
> 2012-06-13 13:46:32,767 [main] WARN
> org.apache.pig.tools.parameters.PreprocessorContext - Warning :
> Multiple values found for rationale_sec. Using value  Account position
> (\\$
> 2012-06-13 13:46:32,767 [main] WARN
> org.apache.pig.tools.parameters.PreprocessorContext - Warning :
> Multiple values found for rationale_thi. Using value K,
> 2012-06-13 13:46:32,767 [main] WARN
> org.apache.pig.tools.parameters.PreprocessorContext - Warning :
> Multiple values found for catalyst_pre. Using value  matures on
> 2012-06-13 13:46:32,767 [main] WARN  org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for catalyst_post. Using value .
> 2012-06-13 13:46:42,749 [main] INFO
> org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
> script:
> REPLICATED_JOIN,HASH_JOIN,COGROUP,GROUP_BY,ORDER_BY,DISTINCT,STREAMING
> ,FILTER,CROSS,UNION
> 2012-06-13 13:46:42,749 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - pig.usenewlogicalplan is set to true. New logical plan will be used.
> 2012-06-13 14:05:23,460 [main] INFO
> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns
> pruned for var_raw: $0, $1, $2, $6, $7, $8, $9, $10
> 2012-06-13 14:05:23,474 [main] INFO
> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns
> pruned for var_mf: $5, $6, $7, $8, $9, $11, $12, $14, $15, $16, $17,
> $18, $19, $21, $23, $24, $25, $26, $27, $28, $29, $30, $31, $32, $33,
> $34, $35, $36, $37, $38, $39, $40, $41, $42, $43, $44, $45
> 2012-06-13 14:05:23,475 [main] INFO
> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for starmine: $0, $3, $4, $5, $6, $9, $10, $11 ...
>
> Any help will be appreciated.
>
> Thanks.
> Dan
>
> -----Original Message-----
> From: Markus Resch [mailto:markus.resch@adtech.de]
> Sent: Wednesday, June 13, 2012 2:24 AM
> To: user@pig.apache.org
> Subject: Re: Job setup for a pig run takes ages
>
> Hey Alex,
>
> On one side I think you're right but we need to keep in mind that the
> schema could change within some files of a glob (e.g. schema extension
> update) the Avro Storage should check at least some hash of the schema to verify all schemas of all input files are the same and/or to split them into groups of different schemas if required.
>
> I'm currently about to check this issue with the cloudera cdh4 pig version. I'll let you know if we get significant different behavior.
>
> Best
> Markus
>
> Am Dienstag, den 12.06.2012, 19:16 -0400 schrieb Alex Rovner:
>> Thejas,
>>
>> Why can't Pig cache results of getschema instead of repeatedly calling the load function? This is causing us to put in lots of unnecessary workarounds in place since during those calls one cant rely on the udf context as it gets cleared somewhere inbetween those calls?
>>
>> Sent from my iPhone
>>
>> On Jun 1, 2012, at 12:45 PM, Thejas Nair <th...@hortonworks.com> wrote:
>>
>>> Can you do a jstack <pid> on the pig client process id a few times and see what it is doing when it is taking so long and send the results ?
>>> With that we should be able to easily identify what might be happening and suggest a fix.
>>>
>>> I think pig calls getSchema() on load functions more times than actually necessary during the query optimization phase (each time after a transform). One theory I have is that your avro load function does a stat on all the 50k files each time to determine the common schema. This might be causing the delay to generate query plan.
>>>
>>> About the second large lag before queue seems to be time spend in input split calculations, not within pig logic.
>>>
>>> These are just theories I have. The jstack output would help determine if this is actually the case.
>>>
>>> Also, I think this delay is likely to be caused by the number of input files, and not the actual data size. You might also want to look at ways to reduce the actual number of input files. Is it generated by a previous MR job ? If yes, given your cluster size, it works against you to have so many maps or reducers in the previous MR job. If pig is generating the data,  you should consider setting the parallelism (reducers) or the number of maps (see http://pig.apache.org/docs/r0.10.0/perf.html#combine-files) .
>>> While reducing number of input files will most likely solve your problem, I am still interested in knowing where the delays are coming from and if we can fix something in pig to improve the situation.
>>>
>>> Thanks,
>>> Thejas
>>>
>>>
>>> On 6/1/12 6:34 AM, Markus Resch wrote:
>>>> Hi Prashant, Hi Thejas,
>>>>
>>>> thanks for your very quick answer.
>>>> No, this is not a typo. Those time stamps are true and as I said
>>>> the machines are not very busy during this time.
>>>>
>>>> As this is our test cluster I am sure I am the only one who is
>>>> running jobs on it. Another issue we have is that we are currently
>>>> only able to run one job at a time but this shouldn't be the topic of this request.
>>>> We even have no continuous input stream to that cluster but copied
>>>> a bunch of data to it some time ago.
>>>> From my perspective the 464 GB of input data you are mentioned is
>>>> the uncompressed amount of the 160GByte compressed files. Which I
>>>> get when I use hadoop -f dus on that folder.
>>>>
>>>> Another interesting fact for you could be that we're running the
>>>> cloudera CDH3 Update 3 version on our systems.
>>>>
>>>> I suspect this could be due to some fancy avro schema validation
>>>> implicitly executed by the avro storage? If so, can this be avoided?
>>>>
>>>> Sadly I'm currently not able to provide you the actual script
>>>> currently as it contains confidential information but I will try to
>>>> provide you a version as soon as possible. But I'd rather think of
>>>> a configuration problem to the hadoop or pig anyways as the script
>>>> works fine with a smaller amount of input data
>>>>
>>>> I would ask the hadoop mailing list if this issue would occur
>>>> during the actual mapred run but as this occur even before a single
>>>> mapred job is launched I suspect pig to have a problem.
>>>>
>>>> Thanks
>>>> Markus
>>>>
>>>> This is the full log until the main work job starts:
>>>> mapred@ournamenode$ pig OurScript.pig
>>>> 2012-05-30 15:27:21,052 [main] INFO  org.apache.pig.Main - Logging
>>>> error messages to: /tmp/pig_1338384441037.log
>>>> 2012-05-30 15:27:21,368 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>>>> Connecting to hadoop file system at: hdfs://OurNamenode:9000
>>>> 2012-05-30 15:27:21,609 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>>>> Connecting to map-reduce job tracker at:
>>>> dev-jobtracker001.eu-fra.adtech.com:54311
>>>> 2012-05-30 15:57:27,814 [main] WARN  org.apache.pig.PigServer -
>>>> Encountered Warning IMPLICIT_CAST_TO_LONG 1 time(s).
>>>> 2012-05-30 15:57:27,816 [main] INFO
>>>> org.apache.pig.tools.pigstats.ScriptState - Pig features used in
>>>> the
>>>> script: REPLICATED_JOIN,COGROUP,GROUP_BY,FILTER
>>>> 2012-05-30 15:57:27,816 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>>>> pig.usenewlogicalplan is set to true. New logical plan will be used.
>>>> 2012-05-30 16:06:55,304 [main] INFO
>>>> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns
>>>> pruned for CampaignInfo: $0, $1, $2, $4, $5, $6, $8, $9
>>>> 2012-05-30 16:06:55,308 [main] INFO
>>>> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns
>>>> pruned for dataImport: $2, $3, $4
>>>> 2012-05-30 16:06:55,441 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
>>>> OutputData1:
>>>> Store(SomeOutputFile1.csv:org.apache.pig.builtin.PigStorage) -
>>>> scope-521 Operator Key: scope-521)
>>>> 2012-05-30 16:06:55,441 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
>>>> OutputData2:
>>>> Store(/SomeOutputFile2.csv:org.apache.pig.builtin.PigStorage) -
>>>> scope-524 Operator Key: scope-524)
>>>> 2012-05-30 16:06:55,441 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
>>>> OutputData2:
>>>> Store(/SomeOutputFile3.csv:org.apache.pig.builtin.PigStorage) -
>>>> scope-483 Operator Key: scope-483)
>>>> 2012-05-30 16:06:55,453 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRComp
>>>> iler
>>>> - File concatenation threshold: 100 optimistic? false
>>>> 2012-05-30 16:06:55,467 [main] INFO
>>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
>>>> paths to process : 1
>>>> 2012-05-30 16:06:55,471 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil -
>>>> Total input paths to process : 1
>>>> 2012-05-30 16:06:55,483 [main] WARN
>>>> org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native
>>>> library is available
>>>> 2012-05-30 16:06:55,484 [main] INFO
>>>> org.apache.hadoop.util.NativeCodeLoader - Loaded the native-hadoop
>>>> library
>>>> 2012-05-30 16:06:55,484 [main] INFO
>>>> org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native
>>>> library loaded
>>>> 2012-05-30 16:06:55,486 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil -
>>>> Total input paths (combined) to process : 1
>>>> 2012-05-30 16:06:55,486 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRComp
>>>> iler
>>>> - number of input files: 1
>>>> 2012-05-30 16:06:55,516 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQ
>>>> ueryOptimizer - MR plan size before optimization: 7
>>>> 2012-05-30 16:06:55,516 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 map-only splittees.
>>>> 20

Re: Job setup for a pig run takes ages

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

What loader are you using? Jt is not the place to look at, try jstacking your pig process. Most likely it's talking to the NamaNode most of the time because the loader is doing some per-file lookups. 

On Jun 13, 2012, at 11:24 AM, Danfeng Li <dl...@operasolutions.com> wrote:

> We also run into the long setup time issue, but our problem is different
> 
> 1. The setup time takes about 20minutes, we can't see anything on the jobtracker during this setup time.
> 2. Our data is saved in flat file, uncompressed.
> 3. Our code consists of many small pig files, they are used in the following way in the main pig file
> data_1 = load ...
> data_2 = load ...
> ...
> data_n = load ...
> 
> run -param ... pigfile1.pig
> run -param ... pigfile2.pig
> ...
> 
> store out1 ..
> store out2 ..
> ...
> 4. here's the part of the log file during the setup time, notice the time difference between "13:46:42" to "14:05:23", during that time, we can't see anything on the jobtracker.
> ...
> 2012-06-13 13:46:30,488 [main] INFO  org.apache.pig.Main - Logging error messages to: pig_1339609590477.log
> 2012-06-13 13:46:30,796 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://master:9000
> 2012-06-13 13:46:30,950 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: master:9001
> 2012-06-13 13:46:32,766 [main] WARN  org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for rationale_fir. Using value : Account position (\\$
> 2012-06-13 13:46:32,766 [main] WARN  org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for rationale_sec. Using value K,
> 2012-06-13 13:46:32,766 [main] WARN  org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for rationale_thi. Using value %)
> 2012-06-13 13:46:32,767 [main] WARN  org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for detail_statment_pre. Using value  - matures on
> 2012-06-13 13:46:32,767 [main] WARN  org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for detail_statment_post. Using value .
> 2012-06-13 13:46:32,767 [main] WARN  org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for rationale_fir. Using value : Maturity date
> 2012-06-13 13:46:32,767 [main] WARN  org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for rationale_sec. Using value  Account position (\\$
> 2012-06-13 13:46:32,767 [main] WARN  org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for rationale_thi. Using value K,
> 2012-06-13 13:46:32,767 [main] WARN  org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for catalyst_pre. Using value  matures on
> 2012-06-13 13:46:32,767 [main] WARN  org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for catalyst_post. Using value .
> 2012-06-13 13:46:42,749 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: REPLICATED_JOIN,HASH_JOIN,COGROUP,GROUP_BY,ORDER_BY,DISTINCT,STREAMING,FILTER,CROSS,UNION
> 2012-06-13 13:46:42,749 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - pig.usenewlogicalplan is set to true. New logical plan will be used.
> 2012-06-13 14:05:23,460 [main] INFO  org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for var_raw: $0, $1, $2, $6, $7, $8, $9, $10
> 2012-06-13 14:05:23,474 [main] INFO  org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for var_mf: $5, $6, $7, $8, $9, $11, $12, $14, $15, $16, $17, $18, $19, $21, $23, $24, $25, $26, $27, $28, $29, $30, $31, $32, $33, $34, $35, $36, $37, $38, $39, $40, $41, $42, $43, $44, $45
> 2012-06-13 14:05:23,475 [main] INFO  org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for starmine: $0, $3, $4, $5, $6, $9, $10, $11
> ...
> 
> Any help will be appreciated.
> 
> Thanks.
> Dan
> 
> -----Original Message-----
> From: Markus Resch [mailto:markus.resch@adtech.de]
> Sent: Wednesday, June 13, 2012 2:24 AM
> To: user@pig.apache.org
> Subject: Re: Job setup for a pig run takes ages
> 
> Hey Alex,
> 
> On one side I think you're right but we need to keep in mind that the schema could change within some files of a glob (e.g. schema extension
> update) the Avro Storage should check at least some hash of the schema to verify all schemas of all input files are the same and/or to split them into groups of different schemas if required.
> 
> I'm currently about to check this issue with the cloudera cdh4 pig version. I'll let you know if we get significant different behavior.
> 
> Best
> Markus
> 
> Am Dienstag, den 12.06.2012, 19:16 -0400 schrieb Alex Rovner:
>> Thejas,
>> 
>> Why can't Pig cache results of getschema instead of repeatedly calling the load function? This is causing us to put in lots of unnecessary workarounds in place since during those calls one cant rely on the udf context as it gets cleared somewhere inbetween those calls?
>> 
>> Sent from my iPhone
>> 
>> On Jun 1, 2012, at 12:45 PM, Thejas Nair <th...@hortonworks.com> wrote:
>> 
>>> Can you do a jstack <pid> on the pig client process id a few times and see what it is doing when it is taking so long and send the results ?
>>> With that we should be able to easily identify what might be happening and suggest a fix.
>>> 
>>> I think pig calls getSchema() on load functions more times than actually necessary during the query optimization phase (each time after a transform). One theory I have is that your avro load function does a stat on all the 50k files each time to determine the common schema. This might be causing the delay to generate query plan.
>>> 
>>> About the second large lag before queue seems to be time spend in input split calculations, not within pig logic.
>>> 
>>> These are just theories I have. The jstack output would help determine if this is actually the case.
>>> 
>>> Also, I think this delay is likely to be caused by the number of input files, and not the actual data size. You might also want to look at ways to reduce the actual number of input files. Is it generated by a previous MR job ? If yes, given your cluster size, it works against you to have so many maps or reducers in the previous MR job. If pig is generating the data,  you should consider setting the parallelism (reducers) or the number of maps (see http://pig.apache.org/docs/r0.10.0/perf.html#combine-files) .
>>> While reducing number of input files will most likely solve your problem, I am still interested in knowing where the delays are coming from and if we can fix something in pig to improve the situation.
>>> 
>>> Thanks,
>>> Thejas
>>> 
>>> 
>>> On 6/1/12 6:34 AM, Markus Resch wrote:
>>>> Hi Prashant, Hi Thejas,
>>>> 
>>>> thanks for your very quick answer.
>>>> No, this is not a typo. Those time stamps are true and as I said
>>>> the machines are not very busy during this time.
>>>> 
>>>> As this is our test cluster I am sure I am the only one who is
>>>> running jobs on it. Another issue we have is that we are currently
>>>> only able to run one job at a time but this shouldn't be the topic of this request.
>>>> We even have no continuous input stream to that cluster but copied
>>>> a bunch of data to it some time ago.
>>>> From my perspective the 464 GB of input data you are mentioned is
>>>> the uncompressed amount of the 160GByte compressed files. Which I
>>>> get when I use hadoop -f dus on that folder.
>>>> 
>>>> Another interesting fact for you could be that we're running the
>>>> cloudera CDH3 Update 3 version on our systems.
>>>> 
>>>> I suspect this could be due to some fancy avro schema validation
>>>> implicitly executed by the avro storage? If so, can this be avoided?
>>>> 
>>>> Sadly I'm currently not able to provide you the actual script
>>>> currently as it contains confidential information but I will try to
>>>> provide you a version as soon as possible. But I'd rather think of
>>>> a configuration problem to the hadoop or pig anyways as the script
>>>> works fine with a smaller amount of input data
>>>> 
>>>> I would ask the hadoop mailing list if this issue would occur
>>>> during the actual mapred run but as this occur even before a single
>>>> mapred job is launched I suspect pig to have a problem.
>>>> 
>>>> Thanks
>>>> Markus
>>>> 
>>>> This is the full log until the main work job starts:
>>>> mapred@ournamenode$ pig OurScript.pig
>>>> 2012-05-30 15:27:21,052 [main] INFO  org.apache.pig.Main - Logging
>>>> error messages to: /tmp/pig_1338384441037.log
>>>> 2012-05-30 15:27:21,368 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>>>> Connecting to hadoop file system at: hdfs://OurNamenode:9000
>>>> 2012-05-30 15:27:21,609 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>>>> Connecting to map-reduce job tracker at:
>>>> dev-jobtracker001.eu-fra.adtech.com:54311
>>>> 2012-05-30 15:57:27,814 [main] WARN  org.apache.pig.PigServer -
>>>> Encountered Warning IMPLICIT_CAST_TO_LONG 1 time(s).
>>>> 2012-05-30 15:57:27,816 [main] INFO
>>>> org.apache.pig.tools.pigstats.ScriptState - Pig features used in
>>>> the
>>>> script: REPLICATED_JOIN,COGROUP,GROUP_BY,FILTER
>>>> 2012-05-30 15:57:27,816 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>>>> pig.usenewlogicalplan is set to true. New logical plan will be used.
>>>> 2012-05-30 16:06:55,304 [main] INFO
>>>> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns
>>>> pruned for CampaignInfo: $0, $1, $2, $4, $5, $6, $8, $9
>>>> 2012-05-30 16:06:55,308 [main] INFO
>>>> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns
>>>> pruned for dataImport: $2, $3, $4
>>>> 2012-05-30 16:06:55,441 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
>>>> OutputData1:
>>>> Store(SomeOutputFile1.csv:org.apache.pig.builtin.PigStorage) -
>>>> scope-521 Operator Key: scope-521)
>>>> 2012-05-30 16:06:55,441 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
>>>> OutputData2:
>>>> Store(/SomeOutputFile2.csv:org.apache.pig.builtin.PigStorage) -
>>>> scope-524 Operator Key: scope-524)
>>>> 2012-05-30 16:06:55,441 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
>>>> OutputData2:
>>>> Store(/SomeOutputFile3.csv:org.apache.pig.builtin.PigStorage) -
>>>> scope-483 Operator Key: scope-483)
>>>> 2012-05-30 16:06:55,453 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRComp
>>>> iler
>>>> - File concatenation threshold: 100 optimistic? false
>>>> 2012-05-30 16:06:55,467 [main] INFO
>>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
>>>> paths to process : 1
>>>> 2012-05-30 16:06:55,471 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil -
>>>> Total input paths to process : 1
>>>> 2012-05-30 16:06:55,483 [main] WARN
>>>> org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native
>>>> library is available
>>>> 2012-05-30 16:06:55,484 [main] INFO
>>>> org.apache.hadoop.util.NativeCodeLoader - Loaded the native-hadoop
>>>> library
>>>> 2012-05-30 16:06:55,484 [main] INFO
>>>> org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native
>>>> library loaded
>>>> 2012-05-30 16:06:55,486 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil -
>>>> Total input paths (combined) to process : 1
>>>> 2012-05-30 16:06:55,486 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRComp
>>>> iler
>>>> - number of input files: 1
>>>> 2012-05-30 16:06:55,516 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQ
>>>> ueryOptimizer - MR plan size before optimization: 7
>>>> 2012-05-30 16:06:55,516 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 map-only splittees.
>>>> 20

RE: Job setup for a pig run takes ages

Posted by Danfeng Li <dl...@operasolutions.com>.

We also run into the long setup time issue, but our problem is different

1. The setup time takes about 20minutes, we can't see anything on the jobtracker during this setup time.
2. Our data is saved in flat file, uncompressed.
3. Our code consists of many small pig files, they are used in the following way in the main pig file
data_1 = load ...
data_2 = load ...
...
data_n = load ...

run -param ... pigfile1.pig
run -param ... pigfile2.pig
...

store out1 ..
store out2 ..
...
4. here's the part of the log file during the setup time, notice the time difference between "13:46:42" to "14:05:23", during that time, we can't see anything on the jobtracker.
...
2012-06-13 13:46:30,488 [main] INFO  org.apache.pig.Main - Logging error messages to: pig_1339609590477.log
2012-06-13 13:46:30,796 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://master:9000
2012-06-13 13:46:30,950 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: master:9001
2012-06-13 13:46:32,766 [main] WARN  org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for rationale_fir. Using value : Account position (\\$
2012-06-13 13:46:32,766 [main] WARN  org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for rationale_sec. Using value K,
2012-06-13 13:46:32,766 [main] WARN  org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for rationale_thi. Using value %)
2012-06-13 13:46:32,767 [main] WARN  org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for detail_statment_pre. Using value  - matures on
2012-06-13 13:46:32,767 [main] WARN  org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for detail_statment_post. Using value .
2012-06-13 13:46:32,767 [main] WARN  org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for rationale_fir. Using value : Maturity date
2012-06-13 13:46:32,767 [main] WARN  org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for rationale_sec. Using value  Account position (\\$
2012-06-13 13:46:32,767 [main] WARN  org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for rationale_thi. Using value K,
2012-06-13 13:46:32,767 [main] WARN  org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for catalyst_pre. Using value  matures on
2012-06-13 13:46:32,767 [main] WARN  org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for catalyst_post. Using value .
2012-06-13 13:46:42,749 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: REPLICATED_JOIN,HASH_JOIN,COGROUP,GROUP_BY,ORDER_BY,DISTINCT,STREAMING,FILTER,CROSS,UNION
2012-06-13 13:46:42,749 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - pig.usenewlogicalplan is set to true. New logical plan will be used.
2012-06-13 14:05:23,460 [main] INFO  org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for var_raw: $0, $1, $2, $6, $7, $8, $9, $10
2012-06-13 14:05:23,474 [main] INFO  org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for var_mf: $5, $6, $7, $8, $9, $11, $12, $14, $15, $16, $17, $18, $19, $21, $23, $24, $25, $26, $27, $28, $29, $30, $31, $32, $33, $34, $35, $36, $37, $38, $39, $40, $41, $42, $43, $44, $45
2012-06-13 14:05:23,475 [main] INFO  org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for starmine: $0, $3, $4, $5, $6, $9, $10, $11
...

Any help will be appreciated.

Thanks.
Dan

-----Original Message-----
From: Markus Resch [mailto:markus.resch@adtech.de]
Sent: Wednesday, June 13, 2012 2:24 AM
To: user@pig.apache.org
Subject: Re: Job setup for a pig run takes ages

Hey Alex,

On one side I think you're right but we need to keep in mind that the schema could change within some files of a glob (e.g. schema extension
update) the Avro Storage should check at least some hash of the schema to verify all schemas of all input files are the same and/or to split them into groups of different schemas if required.

I'm currently about to check this issue with the cloudera cdh4 pig version. I'll let you know if we get significant different behavior.

Best
Markus

Am Dienstag, den 12.06.2012, 19:16 -0400 schrieb Alex Rovner:
> Thejas,
>
> Why can't Pig cache results of getschema instead of repeatedly calling the load function? This is causing us to put in lots of unnecessary workarounds in place since during those calls one cant rely on the udf context as it gets cleared somewhere inbetween those calls?
>
> Sent from my iPhone
>
> On Jun 1, 2012, at 12:45 PM, Thejas Nair <th...@hortonworks.com> wrote:
>
> > Can you do a jstack <pid> on the pig client process id a few times and see what it is doing when it is taking so long and send the results ?
> > With that we should be able to easily identify what might be happening and suggest a fix.
> >
> > I think pig calls getSchema() on load functions more times than actually necessary during the query optimization phase (each time after a transform). One theory I have is that your avro load function does a stat on all the 50k files each time to determine the common schema. This might be causing the delay to generate query plan.
> >
> > About the second large lag before queue seems to be time spend in input split calculations, not within pig logic.
> >
> > These are just theories I have. The jstack output would help determine if this is actually the case.
> >
> > Also, I think this delay is likely to be caused by the number of input files, and not the actual data size. You might also want to look at ways to reduce the actual number of input files. Is it generated by a previous MR job ? If yes, given your cluster size, it works against you to have so many maps or reducers in the previous MR job. If pig is generating the data,  you should consider setting the parallelism (reducers) or the number of maps (see http://pig.apache.org/docs/r0.10.0/perf.html#combine-files) .
> > While reducing number of input files will most likely solve your problem, I am still interested in knowing where the delays are coming from and if we can fix something in pig to improve the situation.
> >
> > Thanks,
> > Thejas
> >
> >
> > On 6/1/12 6:34 AM, Markus Resch wrote:
> >> Hi Prashant, Hi Thejas,
> >>
> >> thanks for your very quick answer.
> >> No, this is not a typo. Those time stamps are true and as I said
> >> the machines are not very busy during this time.
> >>
> >> As this is our test cluster I am sure I am the only one who is
> >> running jobs on it. Another issue we have is that we are currently
> >> only able to run one job at a time but this shouldn't be the topic of this request.
> >> We even have no continuous input stream to that cluster but copied
> >> a bunch of data to it some time ago.
> >> From my perspective the 464 GB of input data you are mentioned is
> >> the uncompressed amount of the 160GByte compressed files. Which I
> >> get when I use hadoop -f dus on that folder.
> >>
> >> Another interesting fact for you could be that we're running the
> >> cloudera CDH3 Update 3 version on our systems.
> >>
> >> I suspect this could be due to some fancy avro schema validation
> >> implicitly executed by the avro storage? If so, can this be avoided?
> >>
> >> Sadly I'm currently not able to provide you the actual script
> >> currently as it contains confidential information but I will try to
> >> provide you a version as soon as possible. But I'd rather think of
> >> a configuration problem to the hadoop or pig anyways as the script
> >> works fine with a smaller amount of input data
> >>
> >> I would ask the hadoop mailing list if this issue would occur
> >> during the actual mapred run but as this occur even before a single
> >> mapred job is launched I suspect pig to have a problem.
> >>
> >> Thanks
> >> Markus
> >>
> >> This is the full log until the main work job starts:
> >> mapred@ournamenode$ pig OurScript.pig
> >> 2012-05-30 15:27:21,052 [main] INFO  org.apache.pig.Main - Logging
> >> error messages to: /tmp/pig_1338384441037.log
> >> 2012-05-30 15:27:21,368 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> >> Connecting to hadoop file system at: hdfs://OurNamenode:9000
> >> 2012-05-30 15:27:21,609 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> >> Connecting to map-reduce job tracker at:
> >> dev-jobtracker001.eu-fra.adtech.com:54311
> >> 2012-05-30 15:57:27,814 [main] WARN  org.apache.pig.PigServer -
> >> Encountered Warning IMPLICIT_CAST_TO_LONG 1 time(s).
> >> 2012-05-30 15:57:27,816 [main] INFO
> >> org.apache.pig.tools.pigstats.ScriptState - Pig features used in
> >> the
> >> script: REPLICATED_JOIN,COGROUP,GROUP_BY,FILTER
> >> 2012-05-30 15:57:27,816 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> >> pig.usenewlogicalplan is set to true. New logical plan will be used.
> >> 2012-05-30 16:06:55,304 [main] INFO
> >> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns
> >> pruned for CampaignInfo: $0, $1, $2, $4, $5, $6, $8, $9
> >> 2012-05-30 16:06:55,308 [main] INFO
> >> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns
> >> pruned for dataImport: $2, $3, $4
> >> 2012-05-30 16:06:55,441 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
> >> OutputData1:
> >> Store(SomeOutputFile1.csv:org.apache.pig.builtin.PigStorage) -
> >> scope-521 Operator Key: scope-521)
> >> 2012-05-30 16:06:55,441 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
> >> OutputData2:
> >> Store(/SomeOutputFile2.csv:org.apache.pig.builtin.PigStorage) -
> >> scope-524 Operator Key: scope-524)
> >> 2012-05-30 16:06:55,441 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
> >> OutputData2:
> >> Store(/SomeOutputFile3.csv:org.apache.pig.builtin.PigStorage) -
> >> scope-483 Operator Key: scope-483)
> >> 2012-05-30 16:06:55,453 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRComp
> >> iler
> >> - File concatenation threshold: 100 optimistic? false
> >> 2012-05-30 16:06:55,467 [main] INFO
> >> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
> >> paths to process : 1
> >> 2012-05-30 16:06:55,471 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil -
> >> Total input paths to process : 1
> >> 2012-05-30 16:06:55,483 [main] WARN
> >> org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native
> >> library is available
> >> 2012-05-30 16:06:55,484 [main] INFO
> >> org.apache.hadoop.util.NativeCodeLoader - Loaded the native-hadoop
> >> library
> >> 2012-05-30 16:06:55,484 [main] INFO
> >> org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native
> >> library loaded
> >> 2012-05-30 16:06:55,486 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil -
> >> Total input paths (combined) to process : 1
> >> 2012-05-30 16:06:55,486 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRComp
> >> iler
> >> - number of input files: 1
> >> 2012-05-30 16:06:55,516 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQ
> >> ueryOptimizer - MR plan size before optimization: 7
> >> 2012-05-30 16:06:55,516 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 map-only splittees.
> >> 2012-05-30 16:06:55,516 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 out of total 3 MR operators.
> >> 2012-05-30 16:06:55,517 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 map-only splittees.
> >> 2012-05-30 16:06:55,517 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 out of total 3 MR operators.
> >> 2012-05-30 16:06:55,517 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 0 out of total 2 MR operators.
> >> 2012-05-30 16:06:55,517 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQ
> >> ueryOptimizer - MR plan size after optimization: 3
> >> 2012-05-30 16:06:56,131 [main] INFO
> >> org.apache.pig.tools.pigstats.ScriptState - Pig script settings are
> >> added to the job
> >> 2012-05-30 16:06:56,138 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobCon
> >> trolCompiler - mapred.job.reduce.markreset.buffer.percent is not
> >> set, set to default 0.3
> >> 2012-05-30 16:07:00,432 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobCon
> >> trolCompiler - Setting up single store job
> >> 2012-05-30 16:07:00,468 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapRed
> >> uceLauncher - 1 map-reduce job(s) waiting for submission
> >> 2012-05-30 16:07:00,969 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapRed
> >> uceLauncher - 0% complete
> >> 2012-05-30 16:07:01,460 [Thread-4] INFO
> >> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
> >> paths to process : 1
> >> 2012-05-30 16:07:01,460 [Thread-4] INFO
> >> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil -
> >> Total input paths to process : 1
> >> 2012-05-30 16:07:01,462 [Thread-4] INFO
> >> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil -
> >> Total input paths (combined) to process : 1
> >> 2012-05-30 16:07:02,353 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapRed
> >> uceLauncher - HadoopJobId: job_201205080931_0178
> >> 2012-05-30 16:07:02,353 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapRed
> >> uceLauncher - More information at:
> >> http://ourJobtracker:50030/jobdetails.jsp?jobid=job_201205080931_01
> >> 78
> >> 2012-05-30 16:07:12,421 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapRed
> >> uceLauncher - 16% complete
> >> 2012-05-30 16:07:13,929 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapRed
> >> uceLauncher - 33% complete
> >> 2012-05-30 16:07:16,969 [main] INFO
> >> org.apache.pig.tools.pigstats.ScriptState - Pig script settings are
> >> added to the job
> >> 2012-05-30 16:07:16,970 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobCon
> >> trolCompiler - mapred.job.reduce.markreset.buffer.percent is not
> >> set, set to default 0.3
> >> 2012-05-30 16:07:26,888 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobCon
> >> trolCompiler - Setting up single store job
> >> 2012-05-30 16:07:30,886 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobCon
> >> trolCompiler - BytesPerReducer=1000000000 maxReducers=999
> >> totalInputFileSize=463325937621
> >> 2012-05-30 16:07:30,886 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobCon
> >> trolCompiler - Neither PARALLEL nor default parallelism is set for
> >> this job. Setting number of reducers to 464
> >> 2012-05-30 16:07:30,947 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
> >> 2012-05-30 16:15:38,022 [Thread-16] INFO
> >> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
> >> paths to process : 50353
> >> 2012-05-30 16:15:54,421 [Thread-16] INFO
> >> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil -
> >> Total input paths (combined) to process : 6899
> >> 2012-05-30 16:15:57,844 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapRed
> >> uceLauncher - HadoopJobId: job_201205080931_0179
> >>
> >>
> >>
> >>
> >> Am Donnerstag, den 31.05.2012, 02:57 -0700 schrieb Prashant Kommireddi:
> >>> Hi Markus,
> >>>
> >>> Have you checked the JobTracker at the time of launching the job
> >>> that Map slots were available?
> >>>
> >>> Looks like the input dataset size is ~464 GB. Since you mentioned
> >>> 10 GB jobs are running fine, there should be no reason a larger
> >>> dataset should be stuck, atleast not on Pig side. I can't think of
> >>> a good reason why the job does not take off other than the fact
> >>> that cluster was busy running some other job.
> >>>
> >>> I see that the number of files being processed is large, 50353.
> >>> That could be a reason for slowness, but ~8 minutes as shown in
> >>> the logs seems to be on the higher end for that.
> >>>
> >>> May be also post your script here.
> >>>
> >>> On Thu, May 31, 2012 at 2:38 AM, Markus Resch<ma...@adtech.de>wrote:
> >>>
> >>>> Hi all,
> >>>>
> >>>> when we're running a pig job for aggregating some amount of
> >>>> slightly compressed avro data (~160GByte), the time until the
> >>>> first actual mapred job starts takes ages:
> >>>> 15:27:21,052 [main] INFO  org.apache.pig.Main - Logging error
> >>>> messages
> >>>> to:
> >>>> [...]
> >>>> 15:57:27,816 [main] INFO
> >>>> org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script:
> >>>> [...]
> >>>> 16:07:00,969 [main] INFO
> >>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapR
> >>>> educeLauncher
> >>>> - 0% complete
> >>>> [...]
> >>>> 16:07:30,886 [main] INFO
> >>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobC
> >>>> ontrolCompiler
> >>>> - BytesPerReducer=1000000000 maxReducers=999
> >>>> totalInputFileSize=463325937621 [...]
> >>>> 16:15:38,022 [Thread-16] INFO
> >>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total
> >>>> input paths to process : 50353
> >>>>
> >>>> This log messages are from our test cluster which has a dedicated
> >>>> jobtracker and namenode each and 5 data nodes with a map task
> >>>> capacity of 15 and a reduce task capacity of 10. There were 6899
> >>>> map tasks and
> >>>> 464 reduce tasks set up.
> >>>>
> >>>> During the initialisation phase we were observing the work load
> >>>> and memory usage of jobtracker, namenode and some data nodes using top.
> >>>> Those were nearly all the time kind of bored (e.g. 30% cpu load
> >>>> on the namenode, total idle on he data nodes). When the jobs were
> >>>> running most of the tasks where in "waiting for IO" most of the
> >>>> time. It seemed there was some swapping space reserved but rarely used in those times.
> >>>>
> >>>> In our eyes it looks like a hadoop config issue but we have no
> >>>> idea what it exaclty could be. Jobs with about 10GBytes of input
> >>>> data were running quite well.
> >>>>
> >>>> Any hint where to tweak will be appreciated
> >>>>
> >>>> Thanks
> >>>> Markus
> >>>>
> >>>>
> >>
> >

--
Markus Resch
Software Developer
P: +49 6103-5715-236 | F: +49 6103-5715-111 | ADTECH GmbH | Robert-Bosch-Str. 32 | 63303 Dreieich | Germany www.adtech.com<http://www.adtech.com>

ADTECH | A Division of Advertising.com Group - Residence of the Company:
Dreieich, Germany - Registration Office: Offenbach, HRB 46021 Management Board: Erhard Neumann, Mark Thielen

This message contains privileged and confidential information. Any dissemination, distribution, copying or other use of this message or any of its content (data, prices...) to any third parties may only occur with ADTECH's prior consent.

Re: Job setup for a pig run takes ages

Posted by Markus Resch <ma...@adtech.de>.

Hey Alex,

On one side I think you're right but we need to keep in mind that the
schema could change within some files of a glob (e.g. schema extension
update) the Avro Storage should check at least some hash of the schema
to verify all schemas of all input files are the same and/or to split
them into groups of different schemas if required. 

I'm currently about to check this issue with the cloudera cdh4 pig
version. I'll let you know if we get significant different behavior.

Best
Markus

Am Dienstag, den 12.06.2012, 19:16 -0400 schrieb Alex Rovner:
> Thejas,
> 
> Why can't Pig cache results of getschema instead of repeatedly calling the load function? This is causing us to put in lots of unnecessary workarounds in place since during those calls one cant rely on the udf context as it gets cleared somewhere inbetween those calls?
> 
> Sent from my iPhone
> 
> On Jun 1, 2012, at 12:45 PM, Thejas Nair <th...@hortonworks.com> wrote:
> 
> > Can you do a jstack <pid> on the pig client process id a few times and see what it is doing when it is taking so long and send the results ?
> > With that we should be able to easily identify what might be happening and suggest a fix.
> > 
> > I think pig calls getSchema() on load functions more times than actually necessary during the query optimization phase (each time after a transform). One theory I have is that your avro load function does a stat on all the 50k files each time to determine the common schema. This might be causing the delay to generate query plan.
> > 
> > About the second large lag before queue seems to be time spend in input split calculations, not within pig logic.
> > 
> > These are just theories I have. The jstack output would help determine if this is actually the case.
> > 
> > Also, I think this delay is likely to be caused by the number of input files, and not the actual data size. You might also want to look at ways to reduce the actual number of input files. Is it generated by a previous MR job ? If yes, given your cluster size, it works against you to have so many maps or reducers in the previous MR job. If pig is generating the data,  you should consider setting the parallelism (reducers) or the number of maps (see http://pig.apache.org/docs/r0.10.0/perf.html#combine-files) .
> > While reducing number of input files will most likely solve your problem, I am still interested in knowing where the delays are coming from and if we can fix something in pig to improve the situation.
> > 
> > Thanks,
> > Thejas
> > 
> > 
> > On 6/1/12 6:34 AM, Markus Resch wrote:
> >> Hi Prashant, Hi Thejas,
> >> 
> >> thanks for your very quick answer.
> >> No, this is not a typo. Those time stamps are true and as I said the
> >> machines are not very busy during this time.
> >> 
> >> As this is our test cluster I am sure I am the only one who is running
> >> jobs on it. Another issue we have is that we are currently only able to
> >> run one job at a time but this shouldn't be the topic of this request.
> >> We even have no continuous input stream to that cluster but copied a
> >> bunch of data to it some time ago.
> >> From my perspective the 464 GB of input data you are mentioned is the
> >> uncompressed amount of the 160GByte compressed files. Which I get when I
> >> use hadoop -f dus on that folder.
> >> 
> >> Another interesting fact for you could be that we're running the
> >> cloudera CDH3 Update 3 version on our systems.
> >> 
> >> I suspect this could be due to some fancy avro schema validation
> >> implicitly executed by the avro storage? If so, can this be avoided?
> >> 
> >> Sadly I'm currently not able to provide you the actual script currently
> >> as it contains confidential information but I will try to provide you a
> >> version as soon as possible. But I'd rather think of a configuration
> >> problem to the hadoop or pig anyways as the script works fine with a
> >> smaller amount of input data
> >> 
> >> I would ask the hadoop mailing list if this issue would occur during the
> >> actual mapred run but as this occur even before a single mapred job is
> >> launched I suspect pig to have a problem.
> >> 
> >> Thanks
> >> Markus
> >> 
> >> This is the full log until the main work job starts:
> >> mapred@ournamenode$ pig OurScript.pig
> >> 2012-05-30 15:27:21,052 [main] INFO  org.apache.pig.Main - Logging error
> >> messages to: /tmp/pig_1338384441037.log
> >> 2012-05-30 15:27:21,368 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> >> Connecting to hadoop file system at: hdfs://OurNamenode:9000
> >> 2012-05-30 15:27:21,609 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> >> Connecting to map-reduce job tracker at:
> >> dev-jobtracker001.eu-fra.adtech.com:54311
> >> 2012-05-30 15:57:27,814 [main] WARN  org.apache.pig.PigServer -
> >> Encountered Warning IMPLICIT_CAST_TO_LONG 1 time(s).
> >> 2012-05-30 15:57:27,816 [main] INFO
> >> org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
> >> script: REPLICATED_JOIN,COGROUP,GROUP_BY,FILTER
> >> 2012-05-30 15:57:27,816 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> >> pig.usenewlogicalplan is set to true. New logical plan will be used.
> >> 2012-05-30 16:06:55,304 [main] INFO
> >> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned
> >> for CampaignInfo: $0, $1, $2, $4, $5, $6, $8, $9
> >> 2012-05-30 16:06:55,308 [main] INFO
> >> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned
> >> for dataImport: $2, $3, $4
> >> 2012-05-30 16:06:55,441 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
> >> OutputData1:
> >> Store(SomeOutputFile1.csv:org.apache.pig.builtin.PigStorage) - scope-521
> >> Operator Key: scope-521)
> >> 2012-05-30 16:06:55,441 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
> >> OutputData2:
> >> Store(/SomeOutputFile2.csv:org.apache.pig.builtin.PigStorage) -
> >> scope-524 Operator Key: scope-524)
> >> 2012-05-30 16:06:55,441 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
> >> OutputData2:
> >> Store(/SomeOutputFile3.csv:org.apache.pig.builtin.PigStorage) -
> >> scope-483 Operator Key: scope-483)
> >> 2012-05-30 16:06:55,453 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler
> >> - File concatenation threshold: 100 optimistic? false
> >> 2012-05-30 16:06:55,467 [main] INFO
> >> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
> >> paths to process : 1
> >> 2012-05-30 16:06:55,471 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> >> input paths to process : 1
> >> 2012-05-30 16:06:55,483 [main] WARN
> >> org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native library
> >> is available
> >> 2012-05-30 16:06:55,484 [main] INFO
> >> org.apache.hadoop.util.NativeCodeLoader - Loaded the native-hadoop
> >> library
> >> 2012-05-30 16:06:55,484 [main] INFO
> >> org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native library
> >> loaded
> >> 2012-05-30 16:06:55,486 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> >> input paths (combined) to process : 1
> >> 2012-05-30 16:06:55,486 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler
> >> - number of input files: 1
> >> 2012-05-30 16:06:55,516 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 7
> >> 2012-05-30 16:06:55,516 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 map-only splittees.
> >> 2012-05-30 16:06:55,516 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 out of total 3 MR operators.
> >> 2012-05-30 16:06:55,517 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 map-only splittees.
> >> 2012-05-30 16:06:55,517 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 out of total 3 MR operators.
> >> 2012-05-30 16:06:55,517 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 0 out of total 2 MR operators.
> >> 2012-05-30 16:06:55,517 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 3
> >> 2012-05-30 16:06:56,131 [main] INFO
> >> org.apache.pig.tools.pigstats.ScriptState - Pig script settings are
> >> added to the job
> >> 2012-05-30 16:06:56,138 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
> >> 2012-05-30 16:07:00,432 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
> >> 2012-05-30 16:07:00,468 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission
> >> 2012-05-30 16:07:00,969 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
> >> 2012-05-30 16:07:01,460 [Thread-4] INFO
> >> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
> >> paths to process : 1
> >> 2012-05-30 16:07:01,460 [Thread-4] INFO
> >> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> >> input paths to process : 1
> >> 2012-05-30 16:07:01,462 [Thread-4] INFO
> >> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> >> input paths (combined) to process : 1
> >> 2012-05-30 16:07:02,353 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201205080931_0178
> >> 2012-05-30 16:07:02,353 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: http://ourJobtracker:50030/jobdetails.jsp?jobid=job_201205080931_0178
> >> 2012-05-30 16:07:12,421 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 16% complete
> >> 2012-05-30 16:07:13,929 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 33% complete
> >> 2012-05-30 16:07:16,969 [main] INFO
> >> org.apache.pig.tools.pigstats.ScriptState - Pig script settings are
> >> added to the job
> >> 2012-05-30 16:07:16,970 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
> >> 2012-05-30 16:07:26,888 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
> >> 2012-05-30 16:07:30,886 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=463325937621
> >> 2012-05-30 16:07:30,886 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Neither PARALLEL nor default parallelism is set for this job. Setting number of reducers to 464
> >> 2012-05-30 16:07:30,947 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
> >> 2012-05-30 16:15:38,022 [Thread-16] INFO
> >> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
> >> paths to process : 50353
> >> 2012-05-30 16:15:54,421 [Thread-16] INFO
> >> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> >> input paths (combined) to process : 6899
> >> 2012-05-30 16:15:57,844 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201205080931_0179
> >> 
> >> 
> >> 
> >> 
> >> Am Donnerstag, den 31.05.2012, 02:57 -0700 schrieb Prashant Kommireddi:
> >>> Hi Markus,
> >>> 
> >>> Have you checked the JobTracker at the time of launching the job that Map
> >>> slots were available?
> >>> 
> >>> Looks like the input dataset size is ~464 GB. Since you mentioned 10 GB
> >>> jobs are running fine, there should be no reason a larger dataset should be
> >>> stuck, atleast not on Pig side. I can't think of a good reason why the job
> >>> does not take off other than the fact that cluster was busy running some
> >>> other job.
> >>> 
> >>> I see that the number of files being processed is large, 50353. That could
> >>> be a reason for slowness, but ~8 minutes as shown in the logs seems to be
> >>> on the higher end for that.
> >>> 
> >>> May be also post your script here.
> >>> 
> >>> On Thu, May 31, 2012 at 2:38 AM, Markus Resch<ma...@adtech.de>wrote:
> >>> 
> >>>> Hi all,
> >>>> 
> >>>> when we're running a pig job for aggregating some amount of slightly
> >>>> compressed avro data (~160GByte), the time until the first actual mapred
> >>>> job starts takes ages:
> >>>> 15:27:21,052 [main] INFO  org.apache.pig.Main - Logging error messages
> >>>> to:
> >>>> [...]
> >>>> 15:57:27,816 [main] INFO  org.apache.pig.tools.pigstats.ScriptState -
> >>>> Pig features used in the script:
> >>>> [...]
> >>>> 16:07:00,969 [main] INFO
> >>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> >>>> - 0% complete
> >>>> [...]
> >>>> 16:07:30,886 [main] INFO
> >>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> >>>> - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=463325937621
> >>>> [...]
> >>>> 16:15:38,022 [Thread-16] INFO
> >>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
> >>>> paths to process : 50353
> >>>> 
> >>>> This log messages are from our test cluster which has a dedicated
> >>>> jobtracker and namenode each and 5 data nodes with a map task capacity
> >>>> of 15 and a reduce task capacity of 10. There were 6899 map tasks and
> >>>> 464 reduce tasks set up.
> >>>> 
> >>>> During the initialisation phase we were observing the work load and
> >>>> memory usage of jobtracker, namenode and some data nodes using top.
> >>>> Those were nearly all the time kind of bored (e.g. 30% cpu load on the
> >>>> namenode, total idle on he data nodes). When the jobs were running most
> >>>> of the tasks where in "waiting for IO" most of the time. It seemed there
> >>>> was some swapping space reserved but rarely used in those times.
> >>>> 
> >>>> In our eyes it looks like a hadoop config issue but we have no idea what
> >>>> it exaclty could be. Jobs with about 10GBytes of input data were running
> >>>> quite well.
> >>>> 
> >>>> Any hint where to tweak will be appreciated
> >>>> 
> >>>> Thanks
> >>>> Markus
> >>>> 
> >>>> 
> >> 
> > 

-- 
Markus Resch
Software Developer 
P: +49 6103-5715-236 | F: +49 6103-5715-111 |
ADTECH GmbH | Robert-Bosch-Str. 32 | 63303 Dreieich | Germany
www.adtech.com<http://www.adtech.com>

ADTECH | A Division of Advertising.com Group - Residence of the Company:
Dreieich, Germany - Registration Office: Offenbach, HRB 46021
Management Board: Erhard Neumann, Mark Thielen

This message contains privileged and confidential information. Any
dissemination, distribution, copying or other use of this
message or any of its content (data, prices...) to any third parties may
only occur with ADTECH's prior consent.

Re: Job setup for a pig run takes ages

Posted by Julien Le Dem <ju...@ledem.net>.

It is now in trunk (future 0.11)
Julien

On Jun 19, 2012, at 9:35 AM, Thejas Nair wrote:

> Yes, pig should cache the results of getschema(). that is someting to be fixed.
> Julien is working on some patches to fix the issues related to udfcontext, including reducing the number of times load function gets instantiated.
> 
> -Thejas
> 
> 
> On 6/12/12 4:16 PM, Alex Rovner wrote:
>> Thejas,
>> 
>> Why can't Pig cache results of getschema instead of repeatedly calling the load function? This is causing us to put in lots of unnecessary workarounds in place since during those calls one cant rely on the udf context as it gets cleared somewhere inbetween those calls?
>> 
>> Sent from my iPhone
>> 
>> On Jun 1, 2012, at 12:45 PM, Thejas Nair<th...@hortonworks.com>  wrote:
>> 
>>> Can you do a jstack<pid>  on the pig client process id a few times and see what it is doing when it is taking so long and send the results ?
>>> With that we should be able to easily identify what might be happening and suggest a fix.
>>> 
>>> I think pig calls getSchema() on load functions more times than actually necessary during the query optimization phase (each time after a transform). One theory I have is that your avro load function does a stat on all the 50k files each time to determine the common schema. This might be causing the delay to generate query plan.
>>> 
>>> About the second large lag before queue seems to be time spend in input split calculations, not within pig logic.
>>> 
>>> These are just theories I have. The jstack output would help determine if this is actually the case.
>>> 
>>> Also, I think this delay is likely to be caused by the number of input files, and not the actual data size. You might also want to look at ways to reduce the actual number of input files. Is it generated by a previous MR job ? If yes, given your cluster size, it works against you to have so many maps or reducers in the previous MR job. If pig is generating the data,  you should consider setting the parallelism (reducers) or the number of maps (see http://pig.apache.org/docs/r0.10.0/perf.html#combine-files) .
>>> While reducing number of input files will most likely solve your problem, I am still interested in knowing where the delays are coming from and if we can fix something in pig to improve the situation.
>>> 
>>> Thanks,
>>> Thejas
>>> 
>>> 
>>> On 6/1/12 6:34 AM, Markus Resch wrote:
>>>> Hi Prashant, Hi Thejas,
>>>> 
>>>> thanks for your very quick answer.
>>>> No, this is not a typo. Those time stamps are true and as I said the
>>>> machines are not very busy during this time.
>>>> 
>>>> As this is our test cluster I am sure I am the only one who is running
>>>> jobs on it. Another issue we have is that we are currently only able to
>>>> run one job at a time but this shouldn't be the topic of this request.
>>>> We even have no continuous input stream to that cluster but copied a
>>>> bunch of data to it some time ago.
>>>> From my perspective the 464 GB of input data you are mentioned is the
>>>> uncompressed amount of the 160GByte compressed files. Which I get when I
>>>> use hadoop -f dus on that folder.
>>>> 
>>>> Another interesting fact for you could be that we're running the
>>>> cloudera CDH3 Update 3 version on our systems.
>>>> 
>>>> I suspect this could be due to some fancy avro schema validation
>>>> implicitly executed by the avro storage? If so, can this be avoided?
>>>> 
>>>> Sadly I'm currently not able to provide you the actual script currently
>>>> as it contains confidential information but I will try to provide you a
>>>> version as soon as possible. But I'd rather think of a configuration
>>>> problem to the hadoop or pig anyways as the script works fine with a
>>>> smaller amount of input data
>>>> 
>>>> I would ask the hadoop mailing list if this issue would occur during the
>>>> actual mapred run but as this occur even before a single mapred job is
>>>> launched I suspect pig to have a problem.
>>>> 
>>>> Thanks
>>>> Markus
>>>> 
>>>> This is the full log until the main work job starts:
>>>> mapred@ournamenode$ pig OurScript.pig
>>>> 2012-05-30 15:27:21,052 [main] INFO  org.apache.pig.Main - Logging error
>>>> messages to: /tmp/pig_1338384441037.log
>>>> 2012-05-30 15:27:21,368 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>>>> Connecting to hadoop file system at: hdfs://OurNamenode:9000
>>>> 2012-05-30 15:27:21,609 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>>>> Connecting to map-reduce job tracker at:
>>>> dev-jobtracker001.eu-fra.adtech.com:54311
>>>> 2012-05-30 15:57:27,814 [main] WARN  org.apache.pig.PigServer -
>>>> Encountered Warning IMPLICIT_CAST_TO_LONG 1 time(s).
>>>> 2012-05-30 15:57:27,816 [main] INFO
>>>> org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
>>>> script: REPLICATED_JOIN,COGROUP,GROUP_BY,FILTER
>>>> 2012-05-30 15:57:27,816 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>>>> pig.usenewlogicalplan is set to true. New logical plan will be used.
>>>> 2012-05-30 16:06:55,304 [main] INFO
>>>> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned
>>>> for CampaignInfo: $0, $1, $2, $4, $5, $6, $8, $9
>>>> 2012-05-30 16:06:55,308 [main] INFO
>>>> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned
>>>> for dataImport: $2, $3, $4
>>>> 2012-05-30 16:06:55,441 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
>>>> OutputData1:
>>>> Store(SomeOutputFile1.csv:org.apache.pig.builtin.PigStorage) - scope-521
>>>> Operator Key: scope-521)
>>>> 2012-05-30 16:06:55,441 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
>>>> OutputData2:
>>>> Store(/SomeOutputFile2.csv:org.apache.pig.builtin.PigStorage) -
>>>> scope-524 Operator Key: scope-524)
>>>> 2012-05-30 16:06:55,441 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
>>>> OutputData2:
>>>> Store(/SomeOutputFile3.csv:org.apache.pig.builtin.PigStorage) -
>>>> scope-483 Operator Key: scope-483)
>>>> 2012-05-30 16:06:55,453 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler
>>>> - File concatenation threshold: 100 optimistic? false
>>>> 2012-05-30 16:06:55,467 [main] INFO
>>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
>>>> paths to process : 1
>>>> 2012-05-30 16:06:55,471 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
>>>> input paths to process : 1
>>>> 2012-05-30 16:06:55,483 [main] WARN
>>>> org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native library
>>>> is available
>>>> 2012-05-30 16:06:55,484 [main] INFO
>>>> org.apache.hadoop.util.NativeCodeLoader - Loaded the native-hadoop
>>>> library
>>>> 2012-05-30 16:06:55,484 [main] INFO
>>>> org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native library
>>>> loaded
>>>> 2012-05-30 16:06:55,486 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
>>>> input paths (combined) to process : 1
>>>> 2012-05-30 16:06:55,486 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler
>>>> - number of input files: 1
>>>> 2012-05-30 16:06:55,516 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 7
>>>> 2012-05-30 16:06:55,516 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 map-only splittees.
>>>> 2012-05-30 16:06:55,516 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 out of total 3 MR operators.
>>>> 2012-05-30 16:06:55,517 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 map-only splittees.
>>>> 2012-05-30 16:06:55,517 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 out of total 3 MR operators.
>>>> 2012-05-30 16:06:55,517 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 0 out of total 2 MR operators.
>>>> 2012-05-30 16:06:55,517 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 3
>>>> 2012-05-30 16:06:56,131 [main] INFO
>>>> org.apache.pig.tools.pigstats.ScriptState - Pig script settings are
>>>> added to the job
>>>> 2012-05-30 16:06:56,138 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
>>>> 2012-05-30 16:07:00,432 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
>>>> 2012-05-30 16:07:00,468 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission
>>>> 2012-05-30 16:07:00,969 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
>>>> 2012-05-30 16:07:01,460 [Thread-4] INFO
>>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
>>>> paths to process : 1
>>>> 2012-05-30 16:07:01,460 [Thread-4] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
>>>> input paths to process : 1
>>>> 2012-05-30 16:07:01,462 [Thread-4] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
>>>> input paths (combined) to process : 1
>>>> 2012-05-30 16:07:02,353 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201205080931_0178
>>>> 2012-05-30 16:07:02,353 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: http://ourJobtracker:50030/jobdetails.jsp?jobid=job_201205080931_0178
>>>> 2012-05-30 16:07:12,421 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 16% complete
>>>> 2012-05-30 16:07:13,929 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 33% complete
>>>> 2012-05-30 16:07:16,969 [main] INFO
>>>> org.apache.pig.tools.pigstats.ScriptState - Pig script settings are
>>>> added to the job
>>>> 2012-05-30 16:07:16,970 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
>>>> 2012-05-30 16:07:26,888 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
>>>> 2012-05-30 16:07:30,886 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=463325937621
>>>> 2012-05-30 16:07:30,886 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Neither PARALLEL nor default parallelism is set for this job. Setting number of reducers to 464
>>>> 2012-05-30 16:07:30,947 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
>>>> 2012-05-30 16:15:38,022 [Thread-16] INFO
>>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
>>>> paths to process : 50353
>>>> 2012-05-30 16:15:54,421 [Thread-16] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
>>>> input paths (combined) to process : 6899
>>>> 2012-05-30 16:15:57,844 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201205080931_0179
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Am Donnerstag, den 31.05.2012, 02:57 -0700 schrieb Prashant Kommireddi:
>>>>> Hi Markus,
>>>>> 
>>>>> Have you checked the JobTracker at the time of launching the job that Map
>>>>> slots were available?
>>>>> 
>>>>> Looks like the input dataset size is ~464 GB. Since you mentioned 10 GB
>>>>> jobs are running fine, there should be no reason a larger dataset should be
>>>>> stuck, atleast not on Pig side. I can't think of a good reason why the job
>>>>> does not take off other than the fact that cluster was busy running some
>>>>> other job.
>>>>> 
>>>>> I see that the number of files being processed is large, 50353. That could
>>>>> be a reason for slowness, but ~8 minutes as shown in the logs seems to be
>>>>> on the higher end for that.
>>>>> 
>>>>> May be also post your script here.
>>>>> 
>>>>> On Thu, May 31, 2012 at 2:38 AM, Markus Resch<ma...@adtech.de>wrote:
>>>>> 
>>>>>> Hi all,
>>>>>> 
>>>>>> when we're running a pig job for aggregating some amount of slightly
>>>>>> compressed avro data (~160GByte), the time until the first actual mapred
>>>>>> job starts takes ages:
>>>>>> 15:27:21,052 [main] INFO  org.apache.pig.Main - Logging error messages
>>>>>> to:
>>>>>> [...]
>>>>>> 15:57:27,816 [main] INFO  org.apache.pig.tools.pigstats.ScriptState -
>>>>>> Pig features used in the script:
>>>>>> [...]
>>>>>> 16:07:00,969 [main] INFO
>>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>>>>>> - 0% complete
>>>>>> [...]
>>>>>> 16:07:30,886 [main] INFO
>>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
>>>>>> - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=463325937621
>>>>>> [...]
>>>>>> 16:15:38,022 [Thread-16] INFO
>>>>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
>>>>>> paths to process : 50353
>>>>>> 
>>>>>> This log messages are from our test cluster which has a dedicated
>>>>>> jobtracker and namenode each and 5 data nodes with a map task capacity
>>>>>> of 15 and a reduce task capacity of 10. There were 6899 map tasks and
>>>>>> 464 reduce tasks set up.
>>>>>> 
>>>>>> During the initialisation phase we were observing the work load and
>>>>>> memory usage of jobtracker, namenode and some data nodes using top.
>>>>>> Those were nearly all the time kind of bored (e.g. 30% cpu load on the
>>>>>> namenode, total idle on he data nodes). When the jobs were running most
>>>>>> of the tasks where in "waiting for IO" most of the time. It seemed there
>>>>>> was some swapping space reserved but rarely used in those times.
>>>>>> 
>>>>>> In our eyes it looks like a hadoop config issue but we have no idea what
>>>>>> it exaclty could be. Jobs with about 10GBytes of input data were running
>>>>>> quite well.
>>>>>> 
>>>>>> Any hint where to tweak will be appreciated
>>>>>> 
>>>>>> Thanks
>>>>>> Markus
>>>>>> 
>>>>>> 
>>>> 
>>> 
>

Re: Job setup for a pig run takes ages

Posted by Thejas Nair <th...@hortonworks.com>.

Yes, pig should cache the results of getschema(). that is someting to be 
fixed.
Julien is working on some patches to fix the issues related to 
udfcontext, including reducing the number of times load function gets 
instantiated.

-Thejas


On 6/12/12 4:16 PM, Alex Rovner wrote:
> Thejas,
>
> Why can't Pig cache results of getschema instead of repeatedly calling the load function? This is causing us to put in lots of unnecessary workarounds in place since during those calls one cant rely on the udf context as it gets cleared somewhere inbetween those calls?
>
> Sent from my iPhone
>
> On Jun 1, 2012, at 12:45 PM, Thejas Nair<th...@hortonworks.com>  wrote:
>
>> Can you do a jstack<pid>  on the pig client process id a few times and see what it is doing when it is taking so long and send the results ?
>> With that we should be able to easily identify what might be happening and suggest a fix.
>>
>> I think pig calls getSchema() on load functions more times than actually necessary during the query optimization phase (each time after a transform). One theory I have is that your avro load function does a stat on all the 50k files each time to determine the common schema. This might be causing the delay to generate query plan.
>>
>> About the second large lag before queue seems to be time spend in input split calculations, not within pig logic.
>>
>> These are just theories I have. The jstack output would help determine if this is actually the case.
>>
>> Also, I think this delay is likely to be caused by the number of input files, and not the actual data size. You might also want to look at ways to reduce the actual number of input files. Is it generated by a previous MR job ? If yes, given your cluster size, it works against you to have so many maps or reducers in the previous MR job. If pig is generating the data,  you should consider setting the parallelism (reducers) or the number of maps (see http://pig.apache.org/docs/r0.10.0/perf.html#combine-files) .
>> While reducing number of input files will most likely solve your problem, I am still interested in knowing where the delays are coming from and if we can fix something in pig to improve the situation.
>>
>> Thanks,
>> Thejas
>>
>>
>> On 6/1/12 6:34 AM, Markus Resch wrote:
>>> Hi Prashant, Hi Thejas,
>>>
>>> thanks for your very quick answer.
>>> No, this is not a typo. Those time stamps are true and as I said the
>>> machines are not very busy during this time.
>>>
>>> As this is our test cluster I am sure I am the only one who is running
>>> jobs on it. Another issue we have is that we are currently only able to
>>> run one job at a time but this shouldn't be the topic of this request.
>>> We even have no continuous input stream to that cluster but copied a
>>> bunch of data to it some time ago.
>>>  From my perspective the 464 GB of input data you are mentioned is the
>>> uncompressed amount of the 160GByte compressed files. Which I get when I
>>> use hadoop -f dus on that folder.
>>>
>>> Another interesting fact for you could be that we're running the
>>> cloudera CDH3 Update 3 version on our systems.
>>>
>>> I suspect this could be due to some fancy avro schema validation
>>> implicitly executed by the avro storage? If so, can this be avoided?
>>>
>>> Sadly I'm currently not able to provide you the actual script currently
>>> as it contains confidential information but I will try to provide you a
>>> version as soon as possible. But I'd rather think of a configuration
>>> problem to the hadoop or pig anyways as the script works fine with a
>>> smaller amount of input data
>>>
>>> I would ask the hadoop mailing list if this issue would occur during the
>>> actual mapred run but as this occur even before a single mapred job is
>>> launched I suspect pig to have a problem.
>>>
>>> Thanks
>>> Markus
>>>
>>> This is the full log until the main work job starts:
>>> mapred@ournamenode$ pig OurScript.pig
>>> 2012-05-30 15:27:21,052 [main] INFO  org.apache.pig.Main - Logging error
>>> messages to: /tmp/pig_1338384441037.log
>>> 2012-05-30 15:27:21,368 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>>> Connecting to hadoop file system at: hdfs://OurNamenode:9000
>>> 2012-05-30 15:27:21,609 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>>> Connecting to map-reduce job tracker at:
>>> dev-jobtracker001.eu-fra.adtech.com:54311
>>> 2012-05-30 15:57:27,814 [main] WARN  org.apache.pig.PigServer -
>>> Encountered Warning IMPLICIT_CAST_TO_LONG 1 time(s).
>>> 2012-05-30 15:57:27,816 [main] INFO
>>> org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
>>> script: REPLICATED_JOIN,COGROUP,GROUP_BY,FILTER
>>> 2012-05-30 15:57:27,816 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>>> pig.usenewlogicalplan is set to true. New logical plan will be used.
>>> 2012-05-30 16:06:55,304 [main] INFO
>>> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned
>>> for CampaignInfo: $0, $1, $2, $4, $5, $6, $8, $9
>>> 2012-05-30 16:06:55,308 [main] INFO
>>> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned
>>> for dataImport: $2, $3, $4
>>> 2012-05-30 16:06:55,441 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
>>> OutputData1:
>>> Store(SomeOutputFile1.csv:org.apache.pig.builtin.PigStorage) - scope-521
>>> Operator Key: scope-521)
>>> 2012-05-30 16:06:55,441 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
>>> OutputData2:
>>> Store(/SomeOutputFile2.csv:org.apache.pig.builtin.PigStorage) -
>>> scope-524 Operator Key: scope-524)
>>> 2012-05-30 16:06:55,441 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
>>> OutputData2:
>>> Store(/SomeOutputFile3.csv:org.apache.pig.builtin.PigStorage) -
>>> scope-483 Operator Key: scope-483)
>>> 2012-05-30 16:06:55,453 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler
>>> - File concatenation threshold: 100 optimistic? false
>>> 2012-05-30 16:06:55,467 [main] INFO
>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
>>> paths to process : 1
>>> 2012-05-30 16:06:55,471 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
>>> input paths to process : 1
>>> 2012-05-30 16:06:55,483 [main] WARN
>>> org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native library
>>> is available
>>> 2012-05-30 16:06:55,484 [main] INFO
>>> org.apache.hadoop.util.NativeCodeLoader - Loaded the native-hadoop
>>> library
>>> 2012-05-30 16:06:55,484 [main] INFO
>>> org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native library
>>> loaded
>>> 2012-05-30 16:06:55,486 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
>>> input paths (combined) to process : 1
>>> 2012-05-30 16:06:55,486 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler
>>> - number of input files: 1
>>> 2012-05-30 16:06:55,516 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 7
>>> 2012-05-30 16:06:55,516 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 map-only splittees.
>>> 2012-05-30 16:06:55,516 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 out of total 3 MR operators.
>>> 2012-05-30 16:06:55,517 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 map-only splittees.
>>> 2012-05-30 16:06:55,517 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 out of total 3 MR operators.
>>> 2012-05-30 16:06:55,517 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 0 out of total 2 MR operators.
>>> 2012-05-30 16:06:55,517 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 3
>>> 2012-05-30 16:06:56,131 [main] INFO
>>> org.apache.pig.tools.pigstats.ScriptState - Pig script settings are
>>> added to the job
>>> 2012-05-30 16:06:56,138 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
>>> 2012-05-30 16:07:00,432 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
>>> 2012-05-30 16:07:00,468 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission
>>> 2012-05-30 16:07:00,969 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
>>> 2012-05-30 16:07:01,460 [Thread-4] INFO
>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
>>> paths to process : 1
>>> 2012-05-30 16:07:01,460 [Thread-4] INFO
>>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
>>> input paths to process : 1
>>> 2012-05-30 16:07:01,462 [Thread-4] INFO
>>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
>>> input paths (combined) to process : 1
>>> 2012-05-30 16:07:02,353 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201205080931_0178
>>> 2012-05-30 16:07:02,353 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: http://ourJobtracker:50030/jobdetails.jsp?jobid=job_201205080931_0178
>>> 2012-05-30 16:07:12,421 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 16% complete
>>> 2012-05-30 16:07:13,929 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 33% complete
>>> 2012-05-30 16:07:16,969 [main] INFO
>>> org.apache.pig.tools.pigstats.ScriptState - Pig script settings are
>>> added to the job
>>> 2012-05-30 16:07:16,970 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
>>> 2012-05-30 16:07:26,888 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
>>> 2012-05-30 16:07:30,886 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=463325937621
>>> 2012-05-30 16:07:30,886 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Neither PARALLEL nor default parallelism is set for this job. Setting number of reducers to 464
>>> 2012-05-30 16:07:30,947 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
>>> 2012-05-30 16:15:38,022 [Thread-16] INFO
>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
>>> paths to process : 50353
>>> 2012-05-30 16:15:54,421 [Thread-16] INFO
>>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
>>> input paths (combined) to process : 6899
>>> 2012-05-30 16:15:57,844 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201205080931_0179
>>>
>>>
>>>
>>>
>>> Am Donnerstag, den 31.05.2012, 02:57 -0700 schrieb Prashant Kommireddi:
>>>> Hi Markus,
>>>>
>>>> Have you checked the JobTracker at the time of launching the job that Map
>>>> slots were available?
>>>>
>>>> Looks like the input dataset size is ~464 GB. Since you mentioned 10 GB
>>>> jobs are running fine, there should be no reason a larger dataset should be
>>>> stuck, atleast not on Pig side. I can't think of a good reason why the job
>>>> does not take off other than the fact that cluster was busy running some
>>>> other job.
>>>>
>>>> I see that the number of files being processed is large, 50353. That could
>>>> be a reason for slowness, but ~8 minutes as shown in the logs seems to be
>>>> on the higher end for that.
>>>>
>>>> May be also post your script here.
>>>>
>>>> On Thu, May 31, 2012 at 2:38 AM, Markus Resch<ma...@adtech.de>wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> when we're running a pig job for aggregating some amount of slightly
>>>>> compressed avro data (~160GByte), the time until the first actual mapred
>>>>> job starts takes ages:
>>>>> 15:27:21,052 [main] INFO  org.apache.pig.Main - Logging error messages
>>>>> to:
>>>>> [...]
>>>>> 15:57:27,816 [main] INFO  org.apache.pig.tools.pigstats.ScriptState -
>>>>> Pig features used in the script:
>>>>> [...]
>>>>> 16:07:00,969 [main] INFO
>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>>>>> - 0% complete
>>>>> [...]
>>>>> 16:07:30,886 [main] INFO
>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
>>>>> - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=463325937621
>>>>> [...]
>>>>> 16:15:38,022 [Thread-16] INFO
>>>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
>>>>> paths to process : 50353
>>>>>
>>>>> This log messages are from our test cluster which has a dedicated
>>>>> jobtracker and namenode each and 5 data nodes with a map task capacity
>>>>> of 15 and a reduce task capacity of 10. There were 6899 map tasks and
>>>>> 464 reduce tasks set up.
>>>>>
>>>>> During the initialisation phase we were observing the work load and
>>>>> memory usage of jobtracker, namenode and some data nodes using top.
>>>>> Those were nearly all the time kind of bored (e.g. 30% cpu load on the
>>>>> namenode, total idle on he data nodes). When the jobs were running most
>>>>> of the tasks where in "waiting for IO" most of the time. It seemed there
>>>>> was some swapping space reserved but rarely used in those times.
>>>>>
>>>>> In our eyes it looks like a hadoop config issue but we have no idea what
>>>>> it exaclty could be. Jobs with about 10GBytes of input data were running
>>>>> quite well.
>>>>>
>>>>> Any hint where to tweak will be appreciated
>>>>>
>>>>> Thanks
>>>>> Markus
>>>>>
>>>>>
>>>
>>

Re: Job setup for a pig run takes ages

Posted by Alex Rovner <al...@gmail.com>.

Thejas,

Why can't Pig cache results of getschema instead of repeatedly calling the load function? This is causing us to put in lots of unnecessary workarounds in place since during those calls one cant rely on the udf context as it gets cleared somewhere inbetween those calls?

Sent from my iPhone

On Jun 1, 2012, at 12:45 PM, Thejas Nair <th...@hortonworks.com> wrote:

> Can you do a jstack <pid> on the pig client process id a few times and see what it is doing when it is taking so long and send the results ?
> With that we should be able to easily identify what might be happening and suggest a fix.
> 
> I think pig calls getSchema() on load functions more times than actually necessary during the query optimization phase (each time after a transform). One theory I have is that your avro load function does a stat on all the 50k files each time to determine the common schema. This might be causing the delay to generate query plan.
> 
> About the second large lag before queue seems to be time spend in input split calculations, not within pig logic.
> 
> These are just theories I have. The jstack output would help determine if this is actually the case.
> 
> Also, I think this delay is likely to be caused by the number of input files, and not the actual data size. You might also want to look at ways to reduce the actual number of input files. Is it generated by a previous MR job ? If yes, given your cluster size, it works against you to have so many maps or reducers in the previous MR job. If pig is generating the data,  you should consider setting the parallelism (reducers) or the number of maps (see http://pig.apache.org/docs/r0.10.0/perf.html#combine-files) .
> While reducing number of input files will most likely solve your problem, I am still interested in knowing where the delays are coming from and if we can fix something in pig to improve the situation.
> 
> Thanks,
> Thejas
> 
> 
> On 6/1/12 6:34 AM, Markus Resch wrote:
>> Hi Prashant, Hi Thejas,
>> 
>> thanks for your very quick answer.
>> No, this is not a typo. Those time stamps are true and as I said the
>> machines are not very busy during this time.
>> 
>> As this is our test cluster I am sure I am the only one who is running
>> jobs on it. Another issue we have is that we are currently only able to
>> run one job at a time but this shouldn't be the topic of this request.
>> We even have no continuous input stream to that cluster but copied a
>> bunch of data to it some time ago.
>> From my perspective the 464 GB of input data you are mentioned is the
>> uncompressed amount of the 160GByte compressed files. Which I get when I
>> use hadoop -f dus on that folder.
>> 
>> Another interesting fact for you could be that we're running the
>> cloudera CDH3 Update 3 version on our systems.
>> 
>> I suspect this could be due to some fancy avro schema validation
>> implicitly executed by the avro storage? If so, can this be avoided?
>> 
>> Sadly I'm currently not able to provide you the actual script currently
>> as it contains confidential information but I will try to provide you a
>> version as soon as possible. But I'd rather think of a configuration
>> problem to the hadoop or pig anyways as the script works fine with a
>> smaller amount of input data
>> 
>> I would ask the hadoop mailing list if this issue would occur during the
>> actual mapred run but as this occur even before a single mapred job is
>> launched I suspect pig to have a problem.
>> 
>> Thanks
>> Markus
>> 
>> This is the full log until the main work job starts:
>> mapred@ournamenode$ pig OurScript.pig
>> 2012-05-30 15:27:21,052 [main] INFO  org.apache.pig.Main - Logging error
>> messages to: /tmp/pig_1338384441037.log
>> 2012-05-30 15:27:21,368 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>> Connecting to hadoop file system at: hdfs://OurNamenode:9000
>> 2012-05-30 15:27:21,609 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>> Connecting to map-reduce job tracker at:
>> dev-jobtracker001.eu-fra.adtech.com:54311
>> 2012-05-30 15:57:27,814 [main] WARN  org.apache.pig.PigServer -
>> Encountered Warning IMPLICIT_CAST_TO_LONG 1 time(s).
>> 2012-05-30 15:57:27,816 [main] INFO
>> org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
>> script: REPLICATED_JOIN,COGROUP,GROUP_BY,FILTER
>> 2012-05-30 15:57:27,816 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>> pig.usenewlogicalplan is set to true. New logical plan will be used.
>> 2012-05-30 16:06:55,304 [main] INFO
>> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned
>> for CampaignInfo: $0, $1, $2, $4, $5, $6, $8, $9
>> 2012-05-30 16:06:55,308 [main] INFO
>> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned
>> for dataImport: $2, $3, $4
>> 2012-05-30 16:06:55,441 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
>> OutputData1:
>> Store(SomeOutputFile1.csv:org.apache.pig.builtin.PigStorage) - scope-521
>> Operator Key: scope-521)
>> 2012-05-30 16:06:55,441 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
>> OutputData2:
>> Store(/SomeOutputFile2.csv:org.apache.pig.builtin.PigStorage) -
>> scope-524 Operator Key: scope-524)
>> 2012-05-30 16:06:55,441 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
>> OutputData2:
>> Store(/SomeOutputFile3.csv:org.apache.pig.builtin.PigStorage) -
>> scope-483 Operator Key: scope-483)
>> 2012-05-30 16:06:55,453 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler
>> - File concatenation threshold: 100 optimistic? false
>> 2012-05-30 16:06:55,467 [main] INFO
>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
>> paths to process : 1
>> 2012-05-30 16:06:55,471 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
>> input paths to process : 1
>> 2012-05-30 16:06:55,483 [main] WARN
>> org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native library
>> is available
>> 2012-05-30 16:06:55,484 [main] INFO
>> org.apache.hadoop.util.NativeCodeLoader - Loaded the native-hadoop
>> library
>> 2012-05-30 16:06:55,484 [main] INFO
>> org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native library
>> loaded
>> 2012-05-30 16:06:55,486 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
>> input paths (combined) to process : 1
>> 2012-05-30 16:06:55,486 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler
>> - number of input files: 1
>> 2012-05-30 16:06:55,516 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 7
>> 2012-05-30 16:06:55,516 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 map-only splittees.
>> 2012-05-30 16:06:55,516 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 out of total 3 MR operators.
>> 2012-05-30 16:06:55,517 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 map-only splittees.
>> 2012-05-30 16:06:55,517 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 out of total 3 MR operators.
>> 2012-05-30 16:06:55,517 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 0 out of total 2 MR operators.
>> 2012-05-30 16:06:55,517 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 3
>> 2012-05-30 16:06:56,131 [main] INFO
>> org.apache.pig.tools.pigstats.ScriptState - Pig script settings are
>> added to the job
>> 2012-05-30 16:06:56,138 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
>> 2012-05-30 16:07:00,432 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
>> 2012-05-30 16:07:00,468 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission
>> 2012-05-30 16:07:00,969 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
>> 2012-05-30 16:07:01,460 [Thread-4] INFO
>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
>> paths to process : 1
>> 2012-05-30 16:07:01,460 [Thread-4] INFO
>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
>> input paths to process : 1
>> 2012-05-30 16:07:01,462 [Thread-4] INFO
>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
>> input paths (combined) to process : 1
>> 2012-05-30 16:07:02,353 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201205080931_0178
>> 2012-05-30 16:07:02,353 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: http://ourJobtracker:50030/jobdetails.jsp?jobid=job_201205080931_0178
>> 2012-05-30 16:07:12,421 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 16% complete
>> 2012-05-30 16:07:13,929 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 33% complete
>> 2012-05-30 16:07:16,969 [main] INFO
>> org.apache.pig.tools.pigstats.ScriptState - Pig script settings are
>> added to the job
>> 2012-05-30 16:07:16,970 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
>> 2012-05-30 16:07:26,888 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
>> 2012-05-30 16:07:30,886 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=463325937621
>> 2012-05-30 16:07:30,886 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Neither PARALLEL nor default parallelism is set for this job. Setting number of reducers to 464
>> 2012-05-30 16:07:30,947 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
>> 2012-05-30 16:15:38,022 [Thread-16] INFO
>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
>> paths to process : 50353
>> 2012-05-30 16:15:54,421 [Thread-16] INFO
>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
>> input paths (combined) to process : 6899
>> 2012-05-30 16:15:57,844 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201205080931_0179
>> 
>> 
>> 
>> 
>> Am Donnerstag, den 31.05.2012, 02:57 -0700 schrieb Prashant Kommireddi:
>>> Hi Markus,
>>> 
>>> Have you checked the JobTracker at the time of launching the job that Map
>>> slots were available?
>>> 
>>> Looks like the input dataset size is ~464 GB. Since you mentioned 10 GB
>>> jobs are running fine, there should be no reason a larger dataset should be
>>> stuck, atleast not on Pig side. I can't think of a good reason why the job
>>> does not take off other than the fact that cluster was busy running some
>>> other job.
>>> 
>>> I see that the number of files being processed is large, 50353. That could
>>> be a reason for slowness, but ~8 minutes as shown in the logs seems to be
>>> on the higher end for that.
>>> 
>>> May be also post your script here.
>>> 
>>> On Thu, May 31, 2012 at 2:38 AM, Markus Resch<ma...@adtech.de>wrote:
>>> 
>>>> Hi all,
>>>> 
>>>> when we're running a pig job for aggregating some amount of slightly
>>>> compressed avro data (~160GByte), the time until the first actual mapred
>>>> job starts takes ages:
>>>> 15:27:21,052 [main] INFO  org.apache.pig.Main - Logging error messages
>>>> to:
>>>> [...]
>>>> 15:57:27,816 [main] INFO  org.apache.pig.tools.pigstats.ScriptState -
>>>> Pig features used in the script:
>>>> [...]
>>>> 16:07:00,969 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>>>> - 0% complete
>>>> [...]
>>>> 16:07:30,886 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
>>>> - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=463325937621
>>>> [...]
>>>> 16:15:38,022 [Thread-16] INFO
>>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
>>>> paths to process : 50353
>>>> 
>>>> This log messages are from our test cluster which has a dedicated
>>>> jobtracker and namenode each and 5 data nodes with a map task capacity
>>>> of 15 and a reduce task capacity of 10. There were 6899 map tasks and
>>>> 464 reduce tasks set up.
>>>> 
>>>> During the initialisation phase we were observing the work load and
>>>> memory usage of jobtracker, namenode and some data nodes using top.
>>>> Those were nearly all the time kind of bored (e.g. 30% cpu load on the
>>>> namenode, total idle on he data nodes). When the jobs were running most
>>>> of the tasks where in "waiting for IO" most of the time. It seemed there
>>>> was some swapping space reserved but rarely used in those times.
>>>> 
>>>> In our eyes it looks like a hadoop config issue but we have no idea what
>>>> it exaclty could be. Jobs with about 10GBytes of input data were running
>>>> quite well.
>>>> 
>>>> Any hint where to tweak will be appreciated
>>>> 
>>>> Thanks
>>>> Markus
>>>> 
>>>> 
>> 
>

Re: Job setup for a pig run takes ages

Posted by Thejas Nair <th...@hortonworks.com>.

Can you do a jstack <pid> on the pig client process id a few times and 
see what it is doing when it is taking so long and send the results ?
With that we should be able to easily identify what might be happening 
and suggest a fix.

I think pig calls getSchema() on load functions more times than actually 
necessary during the query optimization phase (each time after a 
transform). One theory I have is that your avro load function does a 
stat on all the 50k files each time to determine the common schema. This 
might be causing the delay to generate query plan.

About the second large lag before queue seems to be time spend in input 
split calculations, not within pig logic.

These are just theories I have. The jstack output would help determine 
if this is actually the case.

Also, I think this delay is likely to be caused by the number of input 
files, and not the actual data size. You might also want to look at ways 
to reduce the actual number of input files. Is it generated by a 
previous MR job ? If yes, given your cluster size, it works against you 
to have so many maps or reducers in the previous MR job. If pig is 
generating the data,  you should consider setting the parallelism 
(reducers) or the number of maps (see 
http://pig.apache.org/docs/r0.10.0/perf.html#combine-files) .
While reducing number of input files will most likely solve your 
problem, I am still interested in knowing where the delays are coming 
from and if we can fix something in pig to improve the situation.

Thanks,
Thejas


On 6/1/12 6:34 AM, Markus Resch wrote:
> Hi Prashant, Hi Thejas,
>
> thanks for your very quick answer.
> No, this is not a typo. Those time stamps are true and as I said the
> machines are not very busy during this time.
>
> As this is our test cluster I am sure I am the only one who is running
> jobs on it. Another issue we have is that we are currently only able to
> run one job at a time but this shouldn't be the topic of this request.
> We even have no continuous input stream to that cluster but copied a
> bunch of data to it some time ago.
>  From my perspective the 464 GB of input data you are mentioned is the
> uncompressed amount of the 160GByte compressed files. Which I get when I
> use hadoop -f dus on that folder.
>
> Another interesting fact for you could be that we're running the
> cloudera CDH3 Update 3 version on our systems.
>
> I suspect this could be due to some fancy avro schema validation
> implicitly executed by the avro storage? If so, can this be avoided?
>
> Sadly I'm currently not able to provide you the actual script currently
> as it contains confidential information but I will try to provide you a
> version as soon as possible. But I'd rather think of a configuration
> problem to the hadoop or pig anyways as the script works fine with a
> smaller amount of input data
>
> I would ask the hadoop mailing list if this issue would occur during the
> actual mapred run but as this occur even before a single mapred job is
> launched I suspect pig to have a problem.
>
> Thanks
> Markus
>
> This is the full log until the main work job starts:
> mapred@ournamenode$ pig OurScript.pig
> 2012-05-30 15:27:21,052 [main] INFO  org.apache.pig.Main - Logging error
> messages to: /tmp/pig_1338384441037.log
> 2012-05-30 15:27:21,368 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> Connecting to hadoop file system at: hdfs://OurNamenode:9000
> 2012-05-30 15:27:21,609 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> Connecting to map-reduce job tracker at:
> dev-jobtracker001.eu-fra.adtech.com:54311
> 2012-05-30 15:57:27,814 [main] WARN  org.apache.pig.PigServer -
> Encountered Warning IMPLICIT_CAST_TO_LONG 1 time(s).
> 2012-05-30 15:57:27,816 [main] INFO
> org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
> script: REPLICATED_JOIN,COGROUP,GROUP_BY,FILTER
> 2012-05-30 15:57:27,816 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> pig.usenewlogicalplan is set to true. New logical plan will be used.
> 2012-05-30 16:06:55,304 [main] INFO
> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned
> for CampaignInfo: $0, $1, $2, $4, $5, $6, $8, $9
> 2012-05-30 16:06:55,308 [main] INFO
> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned
> for dataImport: $2, $3, $4
> 2012-05-30 16:06:55,441 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
> OutputData1:
> Store(SomeOutputFile1.csv:org.apache.pig.builtin.PigStorage) - scope-521
> Operator Key: scope-521)
> 2012-05-30 16:06:55,441 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
> OutputData2:
> Store(/SomeOutputFile2.csv:org.apache.pig.builtin.PigStorage) -
> scope-524 Operator Key: scope-524)
> 2012-05-30 16:06:55,441 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
> OutputData2:
> Store(/SomeOutputFile3.csv:org.apache.pig.builtin.PigStorage) -
> scope-483 Operator Key: scope-483)
> 2012-05-30 16:06:55,453 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler
> - File concatenation threshold: 100 optimistic? false
> 2012-05-30 16:06:55,467 [main] INFO
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
> paths to process : 1
> 2012-05-30 16:06:55,471 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> input paths to process : 1
> 2012-05-30 16:06:55,483 [main] WARN
> org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native library
> is available
> 2012-05-30 16:06:55,484 [main] INFO
> org.apache.hadoop.util.NativeCodeLoader - Loaded the native-hadoop
> library
> 2012-05-30 16:06:55,484 [main] INFO
> org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native library
> loaded
> 2012-05-30 16:06:55,486 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> input paths (combined) to process : 1
> 2012-05-30 16:06:55,486 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler
> - number of input files: 1
> 2012-05-30 16:06:55,516 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 7
> 2012-05-30 16:06:55,516 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 map-only splittees.
> 2012-05-30 16:06:55,516 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 out of total 3 MR operators.
> 2012-05-30 16:06:55,517 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 map-only splittees.
> 2012-05-30 16:06:55,517 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 out of total 3 MR operators.
> 2012-05-30 16:06:55,517 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 0 out of total 2 MR operators.
> 2012-05-30 16:06:55,517 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 3
> 2012-05-30 16:06:56,131 [main] INFO
> org.apache.pig.tools.pigstats.ScriptState - Pig script settings are
> added to the job
> 2012-05-30 16:06:56,138 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
> 2012-05-30 16:07:00,432 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
> 2012-05-30 16:07:00,468 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission
> 2012-05-30 16:07:00,969 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
> 2012-05-30 16:07:01,460 [Thread-4] INFO
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
> paths to process : 1
> 2012-05-30 16:07:01,460 [Thread-4] INFO
> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> input paths to process : 1
> 2012-05-30 16:07:01,462 [Thread-4] INFO
> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> input paths (combined) to process : 1
> 2012-05-30 16:07:02,353 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201205080931_0178
> 2012-05-30 16:07:02,353 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: http://ourJobtracker:50030/jobdetails.jsp?jobid=job_201205080931_0178
> 2012-05-30 16:07:12,421 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 16% complete
> 2012-05-30 16:07:13,929 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 33% complete
> 2012-05-30 16:07:16,969 [main] INFO
> org.apache.pig.tools.pigstats.ScriptState - Pig script settings are
> added to the job
> 2012-05-30 16:07:16,970 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
> 2012-05-30 16:07:26,888 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
> 2012-05-30 16:07:30,886 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=463325937621
> 2012-05-30 16:07:30,886 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Neither PARALLEL nor default parallelism is set for this job. Setting number of reducers to 464
> 2012-05-30 16:07:30,947 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
> 2012-05-30 16:15:38,022 [Thread-16] INFO
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
> paths to process : 50353
> 2012-05-30 16:15:54,421 [Thread-16] INFO
> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> input paths (combined) to process : 6899
> 2012-05-30 16:15:57,844 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201205080931_0179
>
>
>
>
> Am Donnerstag, den 31.05.2012, 02:57 -0700 schrieb Prashant Kommireddi:
>> Hi Markus,
>>
>> Have you checked the JobTracker at the time of launching the job that Map
>> slots were available?
>>
>> Looks like the input dataset size is ~464 GB. Since you mentioned 10 GB
>> jobs are running fine, there should be no reason a larger dataset should be
>> stuck, atleast not on Pig side. I can't think of a good reason why the job
>> does not take off other than the fact that cluster was busy running some
>> other job.
>>
>> I see that the number of files being processed is large, 50353. That could
>> be a reason for slowness, but ~8 minutes as shown in the logs seems to be
>> on the higher end for that.
>>
>> May be also post your script here.
>>
>> On Thu, May 31, 2012 at 2:38 AM, Markus Resch<ma...@adtech.de>wrote:
>>
>>> Hi all,
>>>
>>> when we're running a pig job for aggregating some amount of slightly
>>> compressed avro data (~160GByte), the time until the first actual mapred
>>> job starts takes ages:
>>> 15:27:21,052 [main] INFO  org.apache.pig.Main - Logging error messages
>>> to:
>>> [...]
>>> 15:57:27,816 [main] INFO  org.apache.pig.tools.pigstats.ScriptState -
>>> Pig features used in the script:
>>> [...]
>>> 16:07:00,969 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>>> - 0% complete
>>> [...]
>>> 16:07:30,886 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
>>> - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=463325937621
>>> [...]
>>> 16:15:38,022 [Thread-16] INFO
>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
>>> paths to process : 50353
>>>
>>> This log messages are from our test cluster which has a dedicated
>>> jobtracker and namenode each and 5 data nodes with a map task capacity
>>> of 15 and a reduce task capacity of 10. There were 6899 map tasks and
>>> 464 reduce tasks set up.
>>>
>>> During the initialisation phase we were observing the work load and
>>> memory usage of jobtracker, namenode and some data nodes using top.
>>> Those were nearly all the time kind of bored (e.g. 30% cpu load on the
>>> namenode, total idle on he data nodes). When the jobs were running most
>>> of the tasks where in "waiting for IO" most of the time. It seemed there
>>> was some swapping space reserved but rarely used in those times.
>>>
>>> In our eyes it looks like a hadoop config issue but we have no idea what
>>> it exaclty could be. Jobs with about 10GBytes of input data were running
>>> quite well.
>>>
>>> Any hint where to tweak will be appreciated
>>>
>>> Thanks
>>> Markus
>>>
>>>
>

Re: Job setup for a pig run takes ages

Posted by Ashutosh Chauhan <ha...@apache.org>.

Hey Markus,

I am also interested to look at your pig script. I think there is some
insight to be gained here.

Thanks,
Ashutosh
On Fri, Jun 1, 2012 at 6:34 AM, Markus Resch <ma...@adtech.de> wrote:

> Hi Prashant, Hi Thejas,
>
> thanks for your very quick answer.
> No, this is not a typo. Those time stamps are true and as I said the
> machines are not very busy during this time.
>
> As this is our test cluster I am sure I am the only one who is running
> jobs on it. Another issue we have is that we are currently only able to
> run one job at a time but this shouldn't be the topic of this request.
> We even have no continuous input stream to that cluster but copied a
> bunch of data to it some time ago.
> From my perspective the 464 GB of input data you are mentioned is the
> uncompressed amount of the 160GByte compressed files. Which I get when I
> use hadoop -f dus on that folder.
>
> Another interesting fact for you could be that we're running the
> cloudera CDH3 Update 3 version on our systems.
>
> I suspect this could be due to some fancy avro schema validation
> implicitly executed by the avro storage? If so, can this be avoided?
>
> Sadly I'm currently not able to provide you the actual script currently
> as it contains confidential information but I will try to provide you a
> version as soon as possible. But I'd rather think of a configuration
> problem to the hadoop or pig anyways as the script works fine with a
> smaller amount of input data
>
> I would ask the hadoop mailing list if this issue would occur during the
> actual mapred run but as this occur even before a single mapred job is
> launched I suspect pig to have a problem.
>
> Thanks
> Markus
>
> This is the full log until the main work job starts:
> mapred@ournamenode$ pig OurScript.pig
> 2012-05-30 15:27:21,052 [main] INFO  org.apache.pig.Main - Logging error
> messages to: /tmp/pig_1338384441037.log
> 2012-05-30 15:27:21,368 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> Connecting to hadoop file system at: hdfs://OurNamenode:9000
> 2012-05-30 15:27:21,609 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> Connecting to map-reduce job tracker at:
> dev-jobtracker001.eu-fra.adtech.com:54311
> 2012-05-30 15:57:27,814 [main] WARN  org.apache.pig.PigServer -
> Encountered Warning IMPLICIT_CAST_TO_LONG 1 time(s).
> 2012-05-30 15:57:27,816 [main] INFO
> org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
> script: REPLICATED_JOIN,COGROUP,GROUP_BY,FILTER
> 2012-05-30 15:57:27,816 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> pig.usenewlogicalplan is set to true. New logical plan will be used.
> 2012-05-30 16:06:55,304 [main] INFO
> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned
> for CampaignInfo: $0, $1, $2, $4, $5, $6, $8, $9
> 2012-05-30 16:06:55,308 [main] INFO
> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned
> for dataImport: $2, $3, $4
> 2012-05-30 16:06:55,441 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
> OutputData1:
> Store(SomeOutputFile1.csv:org.apache.pig.builtin.PigStorage) - scope-521
> Operator Key: scope-521)
> 2012-05-30 16:06:55,441 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
> OutputData2:
> Store(/SomeOutputFile2.csv:org.apache.pig.builtin.PigStorage) -
> scope-524 Operator Key: scope-524)
> 2012-05-30 16:06:55,441 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
> OutputData2:
> Store(/SomeOutputFile3.csv:org.apache.pig.builtin.PigStorage) -
> scope-483 Operator Key: scope-483)
> 2012-05-30 16:06:55,453 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler
> - File concatenation threshold: 100 optimistic? false
> 2012-05-30 16:06:55,467 [main] INFO
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
> paths to process : 1
> 2012-05-30 16:06:55,471 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> input paths to process : 1
> 2012-05-30 16:06:55,483 [main] WARN
> org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native library
> is available
> 2012-05-30 16:06:55,484 [main] INFO
> org.apache.hadoop.util.NativeCodeLoader - Loaded the native-hadoop
> library
> 2012-05-30 16:06:55,484 [main] INFO
> org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native library
> loaded
> 2012-05-30 16:06:55,486 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> input paths (combined) to process : 1
> 2012-05-30 16:06:55,486 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler
> - number of input files: 1
> 2012-05-30 16:06:55,516 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> - MR plan size before optimization: 7
> 2012-05-30 16:06:55,516 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> - Merged 2 map-only splittees.
> 2012-05-30 16:06:55,516 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> - Merged 2 out of total 3 MR operators.
> 2012-05-30 16:06:55,517 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> - Merged 2 map-only splittees.
> 2012-05-30 16:06:55,517 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> - Merged 2 out of total 3 MR operators.
> 2012-05-30 16:06:55,517 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> - Merged 0 out of total 2 MR operators.
> 2012-05-30 16:06:55,517 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> - MR plan size after optimization: 3
> 2012-05-30 16:06:56,131 [main] INFO
> org.apache.pig.tools.pigstats.ScriptState - Pig script settings are
> added to the job
> 2012-05-30 16:06:56,138 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
> 2012-05-30 16:07:00,432 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> - Setting up single store job
> 2012-05-30 16:07:00,468 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 1 map-reduce job(s) waiting for submission
> 2012-05-30 16:07:00,969 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 0% complete
> 2012-05-30 16:07:01,460 [Thread-4] INFO
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
> paths to process : 1
> 2012-05-30 16:07:01,460 [Thread-4] INFO
> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> input paths to process : 1
> 2012-05-30 16:07:01,462 [Thread-4] INFO
> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> input paths (combined) to process : 1
> 2012-05-30 16:07:02,353 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - HadoopJobId: job_201205080931_0178
> 2012-05-30 16:07:02,353 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - More information at:
> http://ourJobtracker:50030/jobdetails.jsp?jobid=job_201205080931_0178
> 2012-05-30 16:07:12,421 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 16% complete
> 2012-05-30 16:07:13,929 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 33% complete
> 2012-05-30 16:07:16,969 [main] INFO
> org.apache.pig.tools.pigstats.ScriptState - Pig script settings are
> added to the job
> 2012-05-30 16:07:16,970 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
> 2012-05-30 16:07:26,888 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> - Setting up single store job
> 2012-05-30 16:07:30,886 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=463325937621
> 2012-05-30 16:07:30,886 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> - Neither PARALLEL nor default parallelism is set for this job. Setting
> number of reducers to 464
> 2012-05-30 16:07:30,947 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 1 map-reduce job(s) waiting for submission.
> 2012-05-30 16:15:38,022 [Thread-16] INFO
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
> paths to process : 50353
> 2012-05-30 16:15:54,421 [Thread-16] INFO
> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> input paths (combined) to process : 6899
> 2012-05-30 16:15:57,844 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - HadoopJobId: job_201205080931_0179
>
>
>
>
> Am Donnerstag, den 31.05.2012, 02:57 -0700 schrieb Prashant Kommireddi:
> > Hi Markus,
> >
> > Have you checked the JobTracker at the time of launching the job that Map
> > slots were available?
> >
> > Looks like the input dataset size is ~464 GB. Since you mentioned 10 GB
> > jobs are running fine, there should be no reason a larger dataset should
> be
> > stuck, atleast not on Pig side. I can't think of a good reason why the
> job
> > does not take off other than the fact that cluster was busy running some
> > other job.
> >
> > I see that the number of files being processed is large, 50353. That
> could
> > be a reason for slowness, but ~8 minutes as shown in the logs seems to be
> > on the higher end for that.
> >
> > May be also post your script here.
> >
> > On Thu, May 31, 2012 at 2:38 AM, Markus Resch <markus.resch@adtech.de
> >wrote:
> >
> > > Hi all,
> > >
> > > when we're running a pig job for aggregating some amount of slightly
> > > compressed avro data (~160GByte), the time until the first actual
> mapred
> > > job starts takes ages:
> > > 15:27:21,052 [main] INFO  org.apache.pig.Main - Logging error messages
> > > to:
> > > [...]
> > > 15:57:27,816 [main] INFO  org.apache.pig.tools.pigstats.ScriptState -
> > > Pig features used in the script:
> > > [...]
> > > 16:07:00,969 [main] INFO
> > >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > > - 0% complete
> > > [...]
> > > 16:07:30,886 [main] INFO
> > >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> > > - BytesPerReducer=1000000000 maxReducers=999
> totalInputFileSize=463325937621
> > > [...]
> > > 16:15:38,022 [Thread-16] INFO
> > > org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
> > > paths to process : 50353
> > >
> > > This log messages are from our test cluster which has a dedicated
> > > jobtracker and namenode each and 5 data nodes with a map task capacity
> > > of 15 and a reduce task capacity of 10. There were 6899 map tasks and
> > > 464 reduce tasks set up.
> > >
> > > During the initialisation phase we were observing the work load and
> > > memory usage of jobtracker, namenode and some data nodes using top.
> > > Those were nearly all the time kind of bored (e.g. 30% cpu load on the
> > > namenode, total idle on he data nodes). When the jobs were running most
> > > of the tasks where in "waiting for IO" most of the time. It seemed
> there
> > > was some swapping space reserved but rarely used in those times.
> > >
> > > In our eyes it looks like a hadoop config issue but we have no idea
> what
> > > it exaclty could be. Jobs with about 10GBytes of input data were
> running
> > > quite well.
> > >
> > > Any hint where to tweak will be appreciated
> > >
> > > Thanks
> > > Markus
> > >
> > >
>
> --
> Markus Resch
> Software Developer
> P: +49 6103-5715-236 | F: +49 6103-5715-111 |
> ADTECH GmbH | Robert-Bosch-Str. 32 | 63303 Dreieich | Germany
> www.adtech.com<http://www.adtech.com>
>
> ADTECH | A Division of Advertising.com Group - Residence of the Company:
> Dreieich, Germany - Registration Office: Offenbach, HRB 46021
> Management Board: Erhard Neumann, Mark Thielen
>
> This message contains privileged and confidential information. Any
> dissemination, distribution, copying or other use of this
> message or any of its content (data, prices...) to any third parties may
> only occur with ADTECH's prior consent.
>
>

Re: Job setup for a pig run takes ages

Posted by Markus Resch <ma...@adtech.de>.

Hi Prashant, Hi Thejas,

thanks for your very quick answer.
No, this is not a typo. Those time stamps are true and as I said the
machines are not very busy during this time.

As this is our test cluster I am sure I am the only one who is running
jobs on it. Another issue we have is that we are currently only able to
run one job at a time but this shouldn't be the topic of this request.
We even have no continuous input stream to that cluster but copied a
bunch of data to it some time ago. 
>From my perspective the 464 GB of input data you are mentioned is the
uncompressed amount of the 160GByte compressed files. Which I get when I
use hadoop -f dus on that folder.

Another interesting fact for you could be that we're running the
cloudera CDH3 Update 3 version on our systems. 

I suspect this could be due to some fancy avro schema validation
implicitly executed by the avro storage? If so, can this be avoided? 

Sadly I'm currently not able to provide you the actual script currently
as it contains confidential information but I will try to provide you a
version as soon as possible. But I'd rather think of a configuration
problem to the hadoop or pig anyways as the script works fine with a
smaller amount of input data

I would ask the hadoop mailing list if this issue would occur during the
actual mapred run but as this occur even before a single mapred job is
launched I suspect pig to have a problem. 

Thanks
Markus

This is the full log until the main work job starts:
mapred@ournamenode$ pig OurScript.pig
2012-05-30 15:27:21,052 [main] INFO  org.apache.pig.Main - Logging error
messages to: /tmp/pig_1338384441037.log
2012-05-30 15:27:21,368 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
Connecting to hadoop file system at: hdfs://OurNamenode:9000
2012-05-30 15:27:21,609 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
Connecting to map-reduce job tracker at:
dev-jobtracker001.eu-fra.adtech.com:54311
2012-05-30 15:57:27,814 [main] WARN  org.apache.pig.PigServer -
Encountered Warning IMPLICIT_CAST_TO_LONG 1 time(s).
2012-05-30 15:57:27,816 [main] INFO
org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
script: REPLICATED_JOIN,COGROUP,GROUP_BY,FILTER
2012-05-30 15:57:27,816 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
pig.usenewlogicalplan is set to true. New logical plan will be used.
2012-05-30 16:06:55,304 [main] INFO
org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned
for CampaignInfo: $0, $1, $2, $4, $5, $6, $8, $9
2012-05-30 16:06:55,308 [main] INFO
org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned
for dataImport: $2, $3, $4
2012-05-30 16:06:55,441 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
OutputData1:
Store(SomeOutputFile1.csv:org.apache.pig.builtin.PigStorage) - scope-521
Operator Key: scope-521)
2012-05-30 16:06:55,441 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
OutputData2:
Store(/SomeOutputFile2.csv:org.apache.pig.builtin.PigStorage) -
scope-524 Operator Key: scope-524)
2012-05-30 16:06:55,441 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
OutputData2:
Store(/SomeOutputFile3.csv:org.apache.pig.builtin.PigStorage) -
scope-483 Operator Key: scope-483)
2012-05-30 16:06:55,453 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler
- File concatenation threshold: 100 optimistic? false
2012-05-30 16:06:55,467 [main] INFO
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
paths to process : 1
2012-05-30 16:06:55,471 [main] INFO
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
input paths to process : 1
2012-05-30 16:06:55,483 [main] WARN
org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native library
is available
2012-05-30 16:06:55,484 [main] INFO
org.apache.hadoop.util.NativeCodeLoader - Loaded the native-hadoop
library
2012-05-30 16:06:55,484 [main] INFO
org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native library
loaded
2012-05-30 16:06:55,486 [main] INFO
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
input paths (combined) to process : 1
2012-05-30 16:06:55,486 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler
- number of input files: 1
2012-05-30 16:06:55,516 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 7
2012-05-30 16:06:55,516 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 map-only splittees.
2012-05-30 16:06:55,516 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 out of total 3 MR operators.
2012-05-30 16:06:55,517 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 map-only splittees.
2012-05-30 16:06:55,517 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 out of total 3 MR operators.
2012-05-30 16:06:55,517 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 0 out of total 2 MR operators.
2012-05-30 16:06:55,517 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 3
2012-05-30 16:06:56,131 [main] INFO
org.apache.pig.tools.pigstats.ScriptState - Pig script settings are
added to the job
2012-05-30 16:06:56,138 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2012-05-30 16:07:00,432 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2012-05-30 16:07:00,468 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission
2012-05-30 16:07:00,969 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2012-05-30 16:07:01,460 [Thread-4] INFO
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
paths to process : 1
2012-05-30 16:07:01,460 [Thread-4] INFO
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
input paths to process : 1
2012-05-30 16:07:01,462 [Thread-4] INFO
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
input paths (combined) to process : 1
2012-05-30 16:07:02,353 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201205080931_0178
2012-05-30 16:07:02,353 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: http://ourJobtracker:50030/jobdetails.jsp?jobid=job_201205080931_0178
2012-05-30 16:07:12,421 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 16% complete
2012-05-30 16:07:13,929 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 33% complete
2012-05-30 16:07:16,969 [main] INFO
org.apache.pig.tools.pigstats.ScriptState - Pig script settings are
added to the job
2012-05-30 16:07:16,970 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2012-05-30 16:07:26,888 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2012-05-30 16:07:30,886 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=463325937621
2012-05-30 16:07:30,886 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Neither PARALLEL nor default parallelism is set for this job. Setting number of reducers to 464
2012-05-30 16:07:30,947 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2012-05-30 16:15:38,022 [Thread-16] INFO
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
paths to process : 50353
2012-05-30 16:15:54,421 [Thread-16] INFO
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
input paths (combined) to process : 6899
2012-05-30 16:15:57,844 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201205080931_0179




Am Donnerstag, den 31.05.2012, 02:57 -0700 schrieb Prashant Kommireddi:
> Hi Markus,
> 
> Have you checked the JobTracker at the time of launching the job that Map
> slots were available?
> 
> Looks like the input dataset size is ~464 GB. Since you mentioned 10 GB
> jobs are running fine, there should be no reason a larger dataset should be
> stuck, atleast not on Pig side. I can't think of a good reason why the job
> does not take off other than the fact that cluster was busy running some
> other job.
> 
> I see that the number of files being processed is large, 50353. That could
> be a reason for slowness, but ~8 minutes as shown in the logs seems to be
> on the higher end for that.
> 
> May be also post your script here.
> 
> On Thu, May 31, 2012 at 2:38 AM, Markus Resch <ma...@adtech.de>wrote:
> 
> > Hi all,
> >
> > when we're running a pig job for aggregating some amount of slightly
> > compressed avro data (~160GByte), the time until the first actual mapred
> > job starts takes ages:
> > 15:27:21,052 [main] INFO  org.apache.pig.Main - Logging error messages
> > to:
> > [...]
> > 15:57:27,816 [main] INFO  org.apache.pig.tools.pigstats.ScriptState -
> > Pig features used in the script:
> > [...]
> > 16:07:00,969 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > - 0% complete
> > [...]
> > 16:07:30,886 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> > - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=463325937621
> > [...]
> > 16:15:38,022 [Thread-16] INFO
> > org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
> > paths to process : 50353
> >
> > This log messages are from our test cluster which has a dedicated
> > jobtracker and namenode each and 5 data nodes with a map task capacity
> > of 15 and a reduce task capacity of 10. There were 6899 map tasks and
> > 464 reduce tasks set up.
> >
> > During the initialisation phase we were observing the work load and
> > memory usage of jobtracker, namenode and some data nodes using top.
> > Those were nearly all the time kind of bored (e.g. 30% cpu load on the
> > namenode, total idle on he data nodes). When the jobs were running most
> > of the tasks where in "waiting for IO" most of the time. It seemed there
> > was some swapping space reserved but rarely used in those times.
> >
> > In our eyes it looks like a hadoop config issue but we have no idea what
> > it exaclty could be. Jobs with about 10GBytes of input data were running
> > quite well.
> >
> > Any hint where to tweak will be appreciated
> >
> > Thanks
> > Markus
> >
> >

-- 
Markus Resch
Software Developer 
P: +49 6103-5715-236 | F: +49 6103-5715-111 |
ADTECH GmbH | Robert-Bosch-Str. 32 | 63303 Dreieich | Germany
www.adtech.com<http://www.adtech.com>

ADTECH | A Division of Advertising.com Group - Residence of the Company:
Dreieich, Germany - Registration Office: Offenbach, HRB 46021
Management Board: Erhard Neumann, Mark Thielen

This message contains privileged and confidential information. Any
dissemination, distribution, copying or other use of this
message or any of its content (data, prices...) to any third parties may
only occur with ADTECH's prior consent.

Re: Job setup for a pig run takes ages

Posted by Prashant Kommireddi <pr...@gmail.com>.

Hi Markus,

Have you checked the JobTracker at the time of launching the job that Map
slots were available?

Looks like the input dataset size is ~464 GB. Since you mentioned 10 GB
jobs are running fine, there should be no reason a larger dataset should be
stuck, atleast not on Pig side. I can't think of a good reason why the job
does not take off other than the fact that cluster was busy running some
other job.

I see that the number of files being processed is large, 50353. That could
be a reason for slowness, but ~8 minutes as shown in the logs seems to be
on the higher end for that.

May be also post your script here.

On Thu, May 31, 2012 at 2:38 AM, Markus Resch <ma...@adtech.de>wrote:

> Hi all,
>
> when we're running a pig job for aggregating some amount of slightly
> compressed avro data (~160GByte), the time until the first actual mapred
> job starts takes ages:
> 15:27:21,052 [main] INFO  org.apache.pig.Main - Logging error messages
> to:
> [...]
> 15:57:27,816 [main] INFO  org.apache.pig.tools.pigstats.ScriptState -
> Pig features used in the script:
> [...]
> 16:07:00,969 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 0% complete
> [...]
> 16:07:30,886 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=463325937621
> [...]
> 16:15:38,022 [Thread-16] INFO
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
> paths to process : 50353
>
> This log messages are from our test cluster which has a dedicated
> jobtracker and namenode each and 5 data nodes with a map task capacity
> of 15 and a reduce task capacity of 10. There were 6899 map tasks and
> 464 reduce tasks set up.
>
> During the initialisation phase we were observing the work load and
> memory usage of jobtracker, namenode and some data nodes using top.
> Those were nearly all the time kind of bored (e.g. 30% cpu load on the
> namenode, total idle on he data nodes). When the jobs were running most
> of the tasks where in "waiting for IO" most of the time. It seemed there
> was some swapping space reserved but rarely used in those times.
>
> In our eyes it looks like a hadoop config issue but we have no idea what
> it exaclty could be. Jobs with about 10GBytes of input data were running
> quite well.
>
> Any hint where to tweak will be appreciated
>
> Thanks
> Markus
>
>

Re: Job setup for a pig run takes ages

Posted by Thejas Nair <th...@hortonworks.com>.

Markus,
Can you try running jstack a few times on the pig client process to see 
what it is doing and let us know ?

Also, is this part a typo ? Did it actually take 30 mins for compiling 
the query ?

> 15:27:21,052 [main] INFO  org.apache.pig.Main - Logging error messages
> to:
> [...]
> 15:57:27,816 [main] INFO  org.apache.pig.tools.pigstats.ScriptState -

Thanks,
Thejas