You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tez.apache.org by Sandeep Kumar <sa...@gmail.com> on 2015/09/03 09:50:41 UTC

Tuning parameters in Tez for improving performance of PIG script

Hi All,

I'm using Pig-0.14.0 over Tez-0.7.0 for running some basic pig scripts. I'm
not able to see any performance gain using Tez. My pig scripts are taking
same amount of time on mapred executionType as well.

Following are the parameters which are in mapred-site.xml and being read by
Tez and I'm not able to override them even if i mention them in my
tez-site.xml:

 tez.runtime.shuffle.merge.percent=0.66
 tez.runtime.shuffle.fetch.buffer.percent=0.70
 tez.runtime.io.sort.mb=256
 tez.runtime.shuffle.memory.limit.percent=0.25
 tez.runtime.io.sort.factor=64
 tez.runtime.shuffle.connect.timeout=180000
 tez.runtime.internal.sorter.class=org.apache.hadoop.util.QuickSort
 tez.runtime.merge.progress.records=10000
 tez.runtime.compress=true
 tez.runtime.sort.spill.percent=0.8
 tez.runtime.shuffle.ssl.enable=false
 tez.runtime.ifile.readahead=true
 tez.runtime.shuffle.parallel.copies=10
 tez.runtime.ifile.readahead.bytes=4194304
 tez.runtime.task.input.post-merge.buffer.percent=0.0
 tez.runtime.shuffle.read.timeout=180000
 tez.runtime.compress.codec=org.apache.hadoop.io.compress.SnappyCodec



PFA the list of task counter. I can see a lot of data is being spilled but
if i try to increase tez.runtime.io.sort.mb through mapred-site.xml then my
script terminates with OOM exception.

Can you please suggest what parameters i should change to improve the
performance of pig using Tez?

Regards,
Sandeep

Re: Tuning parameters in Tez for improving performance of PIG script

Posted by Rohini Palaniswamy <ro...@gmail.com>.

Sandeep,
   If you are grouping on a lot of keys then you will require
https://issues.apache.org/jira/browse/PIG-4373 which is only in trunk and
not in any release yet. Without that performance for grouping on multiple
keys especially with chararray is bad. If you have good reduction with the
group by keys, you can also call pig with  -Dpig.exec.mapPartAgg=true and
that will boost performance a lot than the regular mapreduce Combiner.

Regards,
Rohini

On Thu, Sep 3, 2015 at 5:38 AM, Sandeep Kumar <sa...@gmail.com>
wrote:

> Hi Rajesh,
>
> In RawPigLoader we are just loading files from HDFS and creating a map of
> elements just like a normal PigLoader.
> In MapSignallingPreProcessor step we are just reading elements from Map
> and creating a tuples out of it.
>
> PFA the DAG created by Tez for our job.
>
> While reading records from HDFS files there are occasions when some fields
> are missing and they led to NumberFormatException. Could there be any
> performance issues because of these exceptions? There are 286832 exceptions.
>
> We are reading 4GB of data split in 20 files of 200MB each. I tried with
> different combination of file size e.g. 64MB, 100MB, 128MB but there is not
> much difference in performance.
>
> Regards,
> Sandeep
>
> On Thu, Sep 3, 2015 at 5:28 PM, Rajesh Balamohan <rb...@apache.org>
> wrote:
>
>> Attaching the job swimlane based on the log you provided.  It appears
>> that "SCOPE_37" itself  takes a lot of time per task attempt. Almost 80% of
>> the time is occupied in processing "SCOPE_37" and there is nothing much the
>> other vertex could do apart from waiting for the data from the previous
>> vertex.
>>
>> Can you plz check if there is anything is expensive in
>> MapSignallingPrePocessor / RawPigLoader?.
>>
>> ~Rajesh.B
>>
>> On Thu, Sep 3, 2015 at 5:15 PM, Sandeep Kumar <sa...@gmail.com>
>> wrote:
>>
>>> @Rohini, Following is the step by step description of my pig script.
>>>
>>>
>>> 1. Loading Data from HDFS.
>>> 2. Flattening Map into tuples.
>>> 3. Grouping data over 15 fields.
>>> 4. Flatten grouped data with some additional information.
>>> 5. Store into HDFS.
>>>
>>> Following is the dummy version of my Pig script
>>>
>>> r360map = LOAD 'input_200MB_each/' using
>>> com.RawPigLoader('conf/R360MapSignalling_new.xml','conf/R360MapSignalling_new.json','csv');
>>>
>>> normalized_map_data = foreach r360map generate
>>> flatten(com.MapSignallingPreProcessor($0..));
>>>
>>> normalized_aggr_data = GROUP normalized_map_data by
>>> (a,b,c,d,e,f,g,h,i,j,k,l,m,n,o);
>>>
>>> normalized_sum_data = foreach normalized_aggr_data generate
>>> flatten(group), COUNT(normalized_map_data),
>>> SUM(normalized_map_data.txn_time);
>>>
>>> store normalized_sum_data into 'tmp/abc' using
>>> com.MapSignallingStorageModel();
>>>
>>> @Rajesh, PFA the output of this command "yarn logs -applicationId appId
>>> | grep "HISTORY" > history.log". Unfortunately the container logs has been
>>> removed by Yarn itself hence i could not find History logs file as they are
>>> created inside container logs. Let me know if you need anything more which
>>> i can provide.
>>>
>>> Can you please tell me how to get AM logs? So, if possible then i'll get
>>> it.
>>>
>>>
>>>
>>> Regards,
>>> Sandeep
>>>
>>>
>>> On Thu, Sep 3, 2015 at 4:37 PM, Rohini Palaniswamy <
>>> rohini.aditya@gmail.com> wrote:
>>>
>>>> Sandeep,
>>>>    What does your pig script do? If the pig script was just launching 1
>>>> mapreduce or map only job doing simple group by, there might not be much
>>>> difference except for container reuse reducing launch overhead and that too
>>>> if parallelism is low, containers might not have to be reused. Can you
>>>> attach a dummy version of your pig script removing/changing all sensitive
>>>> information like paths or field names.
>>>>
>>>> Regards,
>>>> Rohini
>>>>
>>>> On Thu, Sep 3, 2015 at 4:01 AM, Rajesh Balamohan <rbalamohan@apache.org
>>>> > wrote:
>>>>
>>>>> Is it possible to upload the AM logs alone?. That would be helpful.
>>>>>
>>>>> It appears to be a problem with "scope_38_INPUT_scope_37". But without
>>>>> the logs and without knowing the DAG, it would be hard to locate the issue.
>>>>>
>>>>> Otherwise, try "yarn logs -applicationId appId | grep "HISTORY" >
>>>>> history.log".  If you have SimpleHistoryLoggingService (which is the
>>>>> default), check if "history.txt" logs are available which can be shared. If
>>>>> not sure about the location, check  "yarn logs -applicationId appId | |
>>>>> grep 'Initializing SimpleHistoryLoggingService, logFileLocation='".
>>>>>
>>>>> ~Rajesh.B
>>>>>
>>>>> On Thu, Sep 3, 2015 at 3:30 PM, Sandeep Kumar <
>>>>> sandeepdas.cse@gmail.com> wrote:
>>>>>
>>>>>> @Rohini, I used new version of pig i.e. 0.15.0 unfortunately the
>>>>>> performance of my script degraded.
>>>>>> 2015-09-03 15:15:24,698 [main] INFO  org.apache.pig.Main - Pig script
>>>>>> completed in 4 minutes, 1 second and 22 milliseconds (241022 ms)
>>>>>>
>>>>>> whereas earlier it was taking hardly 3 minutes and 27 seconds.
>>>>>>
>>>>>> PFA the task counters. Following are the version of softwares being
>>>>>> used:
>>>>>>
>>>>>> HadoopVersion:
>>>>>> 2.6.0-cdh5.4.4
>>>>>>
>>>>>> PigVersion:
>>>>>> 0.15.1-SNAPSHOT
>>>>>>
>>>>>> TezVersion:
>>>>>> 0.7.0
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Sandeep
>>>>>>
>>>>>> On Thu, Sep 3, 2015 at 2:46 PM, Sandeep Kumar <
>>>>>> sandeepdas.cse@gmail.com> wrote:
>>>>>>
>>>>>>> @Rajesh, PFA the required statistics. Its difficult to share
>>>>>>> application log because they are huge in size(i.e. 167MB). In case you want
>>>>>>> anything specific from those logs then please let me know.
>>>>>>>
>>>>>>> @Rohini,
>>>>>>> Thanks for suggesting regarding new version of Pig. I'll give it a
>>>>>>> try for sure.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Sandeep
>>>>>>>
>>>>>>> On Thu, Sep 3, 2015 at 2:31 PM, Rohini Palaniswamy <
>>>>>>> rohini.aditya@gmail.com> wrote:
>>>>>>>
>>>>>>>> Sandeep,
>>>>>>>>    Can you try with Pig 0.15 first? There is ton of fixes that has
>>>>>>>> gone in for Pig on Tez into that release and many of them are performance
>>>>>>>> fixes.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Rohini
>>>>>>>>
>>>>>>>> On Thu, Sep 3, 2015 at 1:05 AM, Rajesh Balamohan <
>>>>>>>> rbalamohan@apache.org> wrote:
>>>>>>>>
>>>>>>>>> Can you post the application logs?  It would be helpful if you
>>>>>>>>> could run with "tez.task.generate.counters.per.io=true". This
>>>>>>>>> would generate the per IO statistics which can be useful for debugging.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ~Rajesh.B
>>>>>>>>>
>>>>>>>>> On Thu, Sep 3, 2015 at 1:20 PM, Sandeep Kumar <
>>>>>>>>> sandeepdas.cse@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi All,
>>>>>>>>>>
>>>>>>>>>> I'm using Pig-0.14.0 over Tez-0.7.0 for running some basic pig
>>>>>>>>>> scripts. I'm not able to see any performance gain using Tez. My pig scripts
>>>>>>>>>> are taking same amount of time on mapred executionType as well.
>>>>>>>>>>
>>>>>>>>>> Following are the parameters which are in mapred-site.xml and
>>>>>>>>>> being read by Tez and I'm not able to override them even if i mention them
>>>>>>>>>> in my tez-site.xml:
>>>>>>>>>>
>>>>>>>>>>  tez.runtime.shuffle.merge.percent=0.66
>>>>>>>>>>  tez.runtime.shuffle.fetch.buffer.percent=0.70
>>>>>>>>>>  tez.runtime.io.sort.mb=256
>>>>>>>>>>  tez.runtime.shuffle.memory.limit.percent=0.25
>>>>>>>>>>  tez.runtime.io.sort.factor=64
>>>>>>>>>>  tez.runtime.shuffle.connect.timeout=180000
>>>>>>>>>>
>>>>>>>>>>  tez.runtime.internal.sorter.class=org.apache.hadoop.util.QuickSort
>>>>>>>>>>  tez.runtime.merge.progress.records=10000
>>>>>>>>>>  tez.runtime.compress=true
>>>>>>>>>>  tez.runtime.sort.spill.percent=0.8
>>>>>>>>>>  tez.runtime.shuffle.ssl.enable=false
>>>>>>>>>>  tez.runtime.ifile.readahead=true
>>>>>>>>>>  tez.runtime.shuffle.parallel.copies=10
>>>>>>>>>>  tez.runtime.ifile.readahead.bytes=4194304
>>>>>>>>>>  tez.runtime.task.input.post-merge.buffer.percent=0.0
>>>>>>>>>>  tez.runtime.shuffle.read.timeout=180000
>>>>>>>>>>
>>>>>>>>>>  tez.runtime.compress.codec=org.apache.hadoop.io.compress.SnappyCodec
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> PFA the list of task counter. I can see a lot of data is being
>>>>>>>>>> spilled but if i try to increase tez.runtime.io.sort.mb through
>>>>>>>>>> mapred-site.xml then my script terminates with OOM exception.
>>>>>>>>>>
>>>>>>>>>> Can you please suggest what parameters i should change to improve
>>>>>>>>>> the performance of pig using Tez?
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Sandeep
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Tuning parameters in Tez for improving performance of PIG script

Posted by Sandeep Kumar <sa...@gmail.com>.

Hi Rajesh,

In RawPigLoader we are just loading files from HDFS and creating a map of
elements just like a normal PigLoader.
In MapSignallingPreProcessor step we are just reading elements from Map and
creating a tuples out of it.

PFA the DAG created by Tez for our job.

While reading records from HDFS files there are occasions when some fields
are missing and they led to NumberFormatException. Could there be any
performance issues because of these exceptions? There are 286832 exceptions.

We are reading 4GB of data split in 20 files of 200MB each. I tried with
different combination of file size e.g. 64MB, 100MB, 128MB but there is not
much difference in performance.

Regards,
Sandeep

On Thu, Sep 3, 2015 at 5:28 PM, Rajesh Balamohan <rb...@apache.org>
wrote:

> Attaching the job swimlane based on the log you provided.  It appears that
> "SCOPE_37" itself  takes a lot of time per task attempt. Almost 80% of the
> time is occupied in processing "SCOPE_37" and there is nothing much the
> other vertex could do apart from waiting for the data from the previous
> vertex.
>
> Can you plz check if there is anything is expensive in
> MapSignallingPrePocessor / RawPigLoader?.
>
> ~Rajesh.B
>
> On Thu, Sep 3, 2015 at 5:15 PM, Sandeep Kumar <sa...@gmail.com>
> wrote:
>
>> @Rohini, Following is the step by step description of my pig script.
>>
>>
>> 1. Loading Data from HDFS.
>> 2. Flattening Map into tuples.
>> 3. Grouping data over 15 fields.
>> 4. Flatten grouped data with some additional information.
>> 5. Store into HDFS.
>>
>> Following is the dummy version of my Pig script
>>
>> r360map = LOAD 'input_200MB_each/' using
>> com.RawPigLoader('conf/R360MapSignalling_new.xml','conf/R360MapSignalling_new.json','csv');
>>
>> normalized_map_data = foreach r360map generate
>> flatten(com.MapSignallingPreProcessor($0..));
>>
>> normalized_aggr_data = GROUP normalized_map_data by
>> (a,b,c,d,e,f,g,h,i,j,k,l,m,n,o);
>>
>> normalized_sum_data = foreach normalized_aggr_data generate
>> flatten(group), COUNT(normalized_map_data),
>> SUM(normalized_map_data.txn_time);
>>
>> store normalized_sum_data into 'tmp/abc' using
>> com.MapSignallingStorageModel();
>>
>> @Rajesh, PFA the output of this command "yarn logs -applicationId appId |
>> grep "HISTORY" > history.log". Unfortunately the container logs has been
>> removed by Yarn itself hence i could not find History logs file as they are
>> created inside container logs. Let me know if you need anything more which
>> i can provide.
>>
>> Can you please tell me how to get AM logs? So, if possible then i'll get
>> it.
>>
>>
>>
>> Regards,
>> Sandeep
>>
>>
>> On Thu, Sep 3, 2015 at 4:37 PM, Rohini Palaniswamy <
>> rohini.aditya@gmail.com> wrote:
>>
>>> Sandeep,
>>>    What does your pig script do? If the pig script was just launching 1
>>> mapreduce or map only job doing simple group by, there might not be much
>>> difference except for container reuse reducing launch overhead and that too
>>> if parallelism is low, containers might not have to be reused. Can you
>>> attach a dummy version of your pig script removing/changing all sensitive
>>> information like paths or field names.
>>>
>>> Regards,
>>> Rohini
>>>
>>> On Thu, Sep 3, 2015 at 4:01 AM, Rajesh Balamohan <rb...@apache.org>
>>> wrote:
>>>
>>>> Is it possible to upload the AM logs alone?. That would be helpful.
>>>>
>>>> It appears to be a problem with "scope_38_INPUT_scope_37". But without
>>>> the logs and without knowing the DAG, it would be hard to locate the issue.
>>>>
>>>> Otherwise, try "yarn logs -applicationId appId | grep "HISTORY" >
>>>> history.log".  If you have SimpleHistoryLoggingService (which is the
>>>> default), check if "history.txt" logs are available which can be shared. If
>>>> not sure about the location, check  "yarn logs -applicationId appId | |
>>>> grep 'Initializing SimpleHistoryLoggingService, logFileLocation='".
>>>>
>>>> ~Rajesh.B
>>>>
>>>> On Thu, Sep 3, 2015 at 3:30 PM, Sandeep Kumar <sandeepdas.cse@gmail.com
>>>> > wrote:
>>>>
>>>>> @Rohini, I used new version of pig i.e. 0.15.0 unfortunately the
>>>>> performance of my script degraded.
>>>>> 2015-09-03 15:15:24,698 [main] INFO  org.apache.pig.Main - Pig script
>>>>> completed in 4 minutes, 1 second and 22 milliseconds (241022 ms)
>>>>>
>>>>> whereas earlier it was taking hardly 3 minutes and 27 seconds.
>>>>>
>>>>> PFA the task counters. Following are the version of softwares being
>>>>> used:
>>>>>
>>>>> HadoopVersion:
>>>>> 2.6.0-cdh5.4.4
>>>>>
>>>>> PigVersion:
>>>>> 0.15.1-SNAPSHOT
>>>>>
>>>>> TezVersion:
>>>>> 0.7.0
>>>>>
>>>>>
>>>>> Regards,
>>>>> Sandeep
>>>>>
>>>>> On Thu, Sep 3, 2015 at 2:46 PM, Sandeep Kumar <
>>>>> sandeepdas.cse@gmail.com> wrote:
>>>>>
>>>>>> @Rajesh, PFA the required statistics. Its difficult to share
>>>>>> application log because they are huge in size(i.e. 167MB). In case you want
>>>>>> anything specific from those logs then please let me know.
>>>>>>
>>>>>> @Rohini,
>>>>>> Thanks for suggesting regarding new version of Pig. I'll give it a
>>>>>> try for sure.
>>>>>>
>>>>>> Regards,
>>>>>> Sandeep
>>>>>>
>>>>>> On Thu, Sep 3, 2015 at 2:31 PM, Rohini Palaniswamy <
>>>>>> rohini.aditya@gmail.com> wrote:
>>>>>>
>>>>>>> Sandeep,
>>>>>>>    Can you try with Pig 0.15 first? There is ton of fixes that has
>>>>>>> gone in for Pig on Tez into that release and many of them are performance
>>>>>>> fixes.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Rohini
>>>>>>>
>>>>>>> On Thu, Sep 3, 2015 at 1:05 AM, Rajesh Balamohan <
>>>>>>> rbalamohan@apache.org> wrote:
>>>>>>>
>>>>>>>> Can you post the application logs?  It would be helpful if you
>>>>>>>> could run with "tez.task.generate.counters.per.io=true". This
>>>>>>>> would generate the per IO statistics which can be useful for debugging.
>>>>>>>>
>>>>>>>>
>>>>>>>> ~Rajesh.B
>>>>>>>>
>>>>>>>> On Thu, Sep 3, 2015 at 1:20 PM, Sandeep Kumar <
>>>>>>>> sandeepdas.cse@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi All,
>>>>>>>>>
>>>>>>>>> I'm using Pig-0.14.0 over Tez-0.7.0 for running some basic pig
>>>>>>>>> scripts. I'm not able to see any performance gain using Tez. My pig scripts
>>>>>>>>> are taking same amount of time on mapred executionType as well.
>>>>>>>>>
>>>>>>>>> Following are the parameters which are in mapred-site.xml and
>>>>>>>>> being read by Tez and I'm not able to override them even if i mention them
>>>>>>>>> in my tez-site.xml:
>>>>>>>>>
>>>>>>>>>  tez.runtime.shuffle.merge.percent=0.66
>>>>>>>>>  tez.runtime.shuffle.fetch.buffer.percent=0.70
>>>>>>>>>  tez.runtime.io.sort.mb=256
>>>>>>>>>  tez.runtime.shuffle.memory.limit.percent=0.25
>>>>>>>>>  tez.runtime.io.sort.factor=64
>>>>>>>>>  tez.runtime.shuffle.connect.timeout=180000
>>>>>>>>>  tez.runtime.internal.sorter.class=org.apache.hadoop.util.QuickSort
>>>>>>>>>  tez.runtime.merge.progress.records=10000
>>>>>>>>>  tez.runtime.compress=true
>>>>>>>>>  tez.runtime.sort.spill.percent=0.8
>>>>>>>>>  tez.runtime.shuffle.ssl.enable=false
>>>>>>>>>  tez.runtime.ifile.readahead=true
>>>>>>>>>  tez.runtime.shuffle.parallel.copies=10
>>>>>>>>>  tez.runtime.ifile.readahead.bytes=4194304
>>>>>>>>>  tez.runtime.task.input.post-merge.buffer.percent=0.0
>>>>>>>>>  tez.runtime.shuffle.read.timeout=180000
>>>>>>>>>
>>>>>>>>>  tez.runtime.compress.codec=org.apache.hadoop.io.compress.SnappyCodec
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> PFA the list of task counter. I can see a lot of data is being
>>>>>>>>> spilled but if i try to increase tez.runtime.io.sort.mb through
>>>>>>>>> mapred-site.xml then my script terminates with OOM exception.
>>>>>>>>>
>>>>>>>>> Can you please suggest what parameters i should change to improve
>>>>>>>>> the performance of pig using Tez?
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Sandeep
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Tuning parameters in Tez for improving performance of PIG script

Posted by Rajesh Balamohan <rb...@apache.org>.

Attaching the job swimlane based on the log you provided.  It appears that
"SCOPE_37" itself  takes a lot of time per task attempt. Almost 80% of the
time is occupied in processing "SCOPE_37" and there is nothing much the
other vertex could do apart from waiting for the data from the previous
vertex.

Can you plz check if there is anything is expensive in
MapSignallingPrePocessor / RawPigLoader?.

~Rajesh.B

On Thu, Sep 3, 2015 at 5:15 PM, Sandeep Kumar <sa...@gmail.com>
wrote:

> @Rohini, Following is the step by step description of my pig script.
>
>
> 1. Loading Data from HDFS.
> 2. Flattening Map into tuples.
> 3. Grouping data over 15 fields.
> 4. Flatten grouped data with some additional information.
> 5. Store into HDFS.
>
> Following is the dummy version of my Pig script
>
> r360map = LOAD 'input_200MB_each/' using
> com.RawPigLoader('conf/R360MapSignalling_new.xml','conf/R360MapSignalling_new.json','csv');
>
> normalized_map_data = foreach r360map generate
> flatten(com.MapSignallingPreProcessor($0..));
>
> normalized_aggr_data = GROUP normalized_map_data by
> (a,b,c,d,e,f,g,h,i,j,k,l,m,n,o);
>
> normalized_sum_data = foreach normalized_aggr_data generate
> flatten(group), COUNT(normalized_map_data),
> SUM(normalized_map_data.txn_time);
>
> store normalized_sum_data into 'tmp/abc' using
> com.MapSignallingStorageModel();
>
> @Rajesh, PFA the output of this command "yarn logs -applicationId appId |
> grep "HISTORY" > history.log". Unfortunately the container logs has been
> removed by Yarn itself hence i could not find History logs file as they are
> created inside container logs. Let me know if you need anything more which
> i can provide.
>
> Can you please tell me how to get AM logs? So, if possible then i'll get
> it.
>
>
>
> Regards,
> Sandeep
>
>
> On Thu, Sep 3, 2015 at 4:37 PM, Rohini Palaniswamy <
> rohini.aditya@gmail.com> wrote:
>
>> Sandeep,
>>    What does your pig script do? If the pig script was just launching 1
>> mapreduce or map only job doing simple group by, there might not be much
>> difference except for container reuse reducing launch overhead and that too
>> if parallelism is low, containers might not have to be reused. Can you
>> attach a dummy version of your pig script removing/changing all sensitive
>> information like paths or field names.
>>
>> Regards,
>> Rohini
>>
>> On Thu, Sep 3, 2015 at 4:01 AM, Rajesh Balamohan <rb...@apache.org>
>> wrote:
>>
>>> Is it possible to upload the AM logs alone?. That would be helpful.
>>>
>>> It appears to be a problem with "scope_38_INPUT_scope_37". But without
>>> the logs and without knowing the DAG, it would be hard to locate the issue.
>>>
>>> Otherwise, try "yarn logs -applicationId appId | grep "HISTORY" >
>>> history.log".  If you have SimpleHistoryLoggingService (which is the
>>> default), check if "history.txt" logs are available which can be shared. If
>>> not sure about the location, check  "yarn logs -applicationId appId | |
>>> grep 'Initializing SimpleHistoryLoggingService, logFileLocation='".
>>>
>>> ~Rajesh.B
>>>
>>> On Thu, Sep 3, 2015 at 3:30 PM, Sandeep Kumar <sa...@gmail.com>
>>> wrote:
>>>
>>>> @Rohini, I used new version of pig i.e. 0.15.0 unfortunately the
>>>> performance of my script degraded.
>>>> 2015-09-03 15:15:24,698 [main] INFO  org.apache.pig.Main - Pig script
>>>> completed in 4 minutes, 1 second and 22 milliseconds (241022 ms)
>>>>
>>>> whereas earlier it was taking hardly 3 minutes and 27 seconds.
>>>>
>>>> PFA the task counters. Following are the version of softwares being
>>>> used:
>>>>
>>>> HadoopVersion:
>>>> 2.6.0-cdh5.4.4
>>>>
>>>> PigVersion:
>>>> 0.15.1-SNAPSHOT
>>>>
>>>> TezVersion:
>>>> 0.7.0
>>>>
>>>>
>>>> Regards,
>>>> Sandeep
>>>>
>>>> On Thu, Sep 3, 2015 at 2:46 PM, Sandeep Kumar <sandeepdas.cse@gmail.com
>>>> > wrote:
>>>>
>>>>> @Rajesh, PFA the required statistics. Its difficult to share
>>>>> application log because they are huge in size(i.e. 167MB). In case you want
>>>>> anything specific from those logs then please let me know.
>>>>>
>>>>> @Rohini,
>>>>> Thanks for suggesting regarding new version of Pig. I'll give it a try
>>>>> for sure.
>>>>>
>>>>> Regards,
>>>>> Sandeep
>>>>>
>>>>> On Thu, Sep 3, 2015 at 2:31 PM, Rohini Palaniswamy <
>>>>> rohini.aditya@gmail.com> wrote:
>>>>>
>>>>>> Sandeep,
>>>>>>    Can you try with Pig 0.15 first? There is ton of fixes that has
>>>>>> gone in for Pig on Tez into that release and many of them are performance
>>>>>> fixes.
>>>>>>
>>>>>> Regards,
>>>>>> Rohini
>>>>>>
>>>>>> On Thu, Sep 3, 2015 at 1:05 AM, Rajesh Balamohan <
>>>>>> rbalamohan@apache.org> wrote:
>>>>>>
>>>>>>> Can you post the application logs?  It would be helpful if you could
>>>>>>> run with "tez.task.generate.counters.per.io=true". This would
>>>>>>> generate the per IO statistics which can be useful for debugging.
>>>>>>>
>>>>>>>
>>>>>>> ~Rajesh.B
>>>>>>>
>>>>>>> On Thu, Sep 3, 2015 at 1:20 PM, Sandeep Kumar <
>>>>>>> sandeepdas.cse@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> I'm using Pig-0.14.0 over Tez-0.7.0 for running some basic pig
>>>>>>>> scripts. I'm not able to see any performance gain using Tez. My pig scripts
>>>>>>>> are taking same amount of time on mapred executionType as well.
>>>>>>>>
>>>>>>>> Following are the parameters which are in mapred-site.xml and being
>>>>>>>> read by Tez and I'm not able to override them even if i mention them in my
>>>>>>>> tez-site.xml:
>>>>>>>>
>>>>>>>>  tez.runtime.shuffle.merge.percent=0.66
>>>>>>>>  tez.runtime.shuffle.fetch.buffer.percent=0.70
>>>>>>>>  tez.runtime.io.sort.mb=256
>>>>>>>>  tez.runtime.shuffle.memory.limit.percent=0.25
>>>>>>>>  tez.runtime.io.sort.factor=64
>>>>>>>>  tez.runtime.shuffle.connect.timeout=180000
>>>>>>>>  tez.runtime.internal.sorter.class=org.apache.hadoop.util.QuickSort
>>>>>>>>  tez.runtime.merge.progress.records=10000
>>>>>>>>  tez.runtime.compress=true
>>>>>>>>  tez.runtime.sort.spill.percent=0.8
>>>>>>>>  tez.runtime.shuffle.ssl.enable=false
>>>>>>>>  tez.runtime.ifile.readahead=true
>>>>>>>>  tez.runtime.shuffle.parallel.copies=10
>>>>>>>>  tez.runtime.ifile.readahead.bytes=4194304
>>>>>>>>  tez.runtime.task.input.post-merge.buffer.percent=0.0
>>>>>>>>  tez.runtime.shuffle.read.timeout=180000
>>>>>>>>
>>>>>>>>  tez.runtime.compress.codec=org.apache.hadoop.io.compress.SnappyCodec
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> PFA the list of task counter. I can see a lot of data is being
>>>>>>>> spilled but if i try to increase tez.runtime.io.sort.mb through
>>>>>>>> mapred-site.xml then my script terminates with OOM exception.
>>>>>>>>
>>>>>>>> Can you please suggest what parameters i should change to improve
>>>>>>>> the performance of pig using Tez?
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Sandeep
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Tuning parameters in Tez for improving performance of PIG script

Posted by Sandeep Kumar <sa...@gmail.com>.

@Rohini, Following is the step by step description of my pig script.


1. Loading Data from HDFS.
2. Flattening Map into tuples.
3. Grouping data over 15 fields.
4. Flatten grouped data with some additional information.
5. Store into HDFS.

Following is the dummy version of my Pig script

r360map = LOAD 'input_200MB_each/' using
com.RawPigLoader('conf/R360MapSignalling_new.xml','conf/R360MapSignalling_new.json','csv');

normalized_map_data = foreach r360map generate
flatten(com.MapSignallingPreProcessor($0..));

normalized_aggr_data = GROUP normalized_map_data by
(a,b,c,d,e,f,g,h,i,j,k,l,m,n,o);

normalized_sum_data = foreach normalized_aggr_data generate flatten(group),
COUNT(normalized_map_data), SUM(normalized_map_data.txn_time);

store normalized_sum_data into 'tmp/abc' using
com.MapSignallingStorageModel();

@Rajesh, PFA the output of this command "yarn logs -applicationId appId |
grep "HISTORY" > history.log". Unfortunately the container logs has been
removed by Yarn itself hence i could not find History logs file as they are
created inside container logs. Let me know if you need anything more which
i can provide.

Can you please tell me how to get AM logs? So, if possible then i'll get
it.



Regards,
Sandeep


On Thu, Sep 3, 2015 at 4:37 PM, Rohini Palaniswamy <ro...@gmail.com>
wrote:

> Sandeep,
>    What does your pig script do? If the pig script was just launching 1
> mapreduce or map only job doing simple group by, there might not be much
> difference except for container reuse reducing launch overhead and that too
> if parallelism is low, containers might not have to be reused. Can you
> attach a dummy version of your pig script removing/changing all sensitive
> information like paths or field names.
>
> Regards,
> Rohini
>
> On Thu, Sep 3, 2015 at 4:01 AM, Rajesh Balamohan <rb...@apache.org>
> wrote:
>
>> Is it possible to upload the AM logs alone?. That would be helpful.
>>
>> It appears to be a problem with "scope_38_INPUT_scope_37". But without
>> the logs and without knowing the DAG, it would be hard to locate the issue.
>>
>> Otherwise, try "yarn logs -applicationId appId | grep "HISTORY" >
>> history.log".  If you have SimpleHistoryLoggingService (which is the
>> default), check if "history.txt" logs are available which can be shared. If
>> not sure about the location, check  "yarn logs -applicationId appId | |
>> grep 'Initializing SimpleHistoryLoggingService, logFileLocation='".
>>
>> ~Rajesh.B
>>
>> On Thu, Sep 3, 2015 at 3:30 PM, Sandeep Kumar <sa...@gmail.com>
>> wrote:
>>
>>> @Rohini, I used new version of pig i.e. 0.15.0 unfortunately the
>>> performance of my script degraded.
>>> 2015-09-03 15:15:24,698 [main] INFO  org.apache.pig.Main - Pig script
>>> completed in 4 minutes, 1 second and 22 milliseconds (241022 ms)
>>>
>>> whereas earlier it was taking hardly 3 minutes and 27 seconds.
>>>
>>> PFA the task counters. Following are the version of softwares being used:
>>>
>>> HadoopVersion:
>>> 2.6.0-cdh5.4.4
>>>
>>> PigVersion:
>>> 0.15.1-SNAPSHOT
>>>
>>> TezVersion:
>>> 0.7.0
>>>
>>>
>>> Regards,
>>> Sandeep
>>>
>>> On Thu, Sep 3, 2015 at 2:46 PM, Sandeep Kumar <sa...@gmail.com>
>>> wrote:
>>>
>>>> @Rajesh, PFA the required statistics. Its difficult to share
>>>> application log because they are huge in size(i.e. 167MB). In case you want
>>>> anything specific from those logs then please let me know.
>>>>
>>>> @Rohini,
>>>> Thanks for suggesting regarding new version of Pig. I'll give it a try
>>>> for sure.
>>>>
>>>> Regards,
>>>> Sandeep
>>>>
>>>> On Thu, Sep 3, 2015 at 2:31 PM, Rohini Palaniswamy <
>>>> rohini.aditya@gmail.com> wrote:
>>>>
>>>>> Sandeep,
>>>>>    Can you try with Pig 0.15 first? There is ton of fixes that has
>>>>> gone in for Pig on Tez into that release and many of them are performance
>>>>> fixes.
>>>>>
>>>>> Regards,
>>>>> Rohini
>>>>>
>>>>> On Thu, Sep 3, 2015 at 1:05 AM, Rajesh Balamohan <
>>>>> rbalamohan@apache.org> wrote:
>>>>>
>>>>>> Can you post the application logs?  It would be helpful if you could
>>>>>> run with "tez.task.generate.counters.per.io=true". This would
>>>>>> generate the per IO statistics which can be useful for debugging.
>>>>>>
>>>>>>
>>>>>> ~Rajesh.B
>>>>>>
>>>>>> On Thu, Sep 3, 2015 at 1:20 PM, Sandeep Kumar <
>>>>>> sandeepdas.cse@gmail.com> wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> I'm using Pig-0.14.0 over Tez-0.7.0 for running some basic pig
>>>>>>> scripts. I'm not able to see any performance gain using Tez. My pig scripts
>>>>>>> are taking same amount of time on mapred executionType as well.
>>>>>>>
>>>>>>> Following are the parameters which are in mapred-site.xml and being
>>>>>>> read by Tez and I'm not able to override them even if i mention them in my
>>>>>>> tez-site.xml:
>>>>>>>
>>>>>>>  tez.runtime.shuffle.merge.percent=0.66
>>>>>>>  tez.runtime.shuffle.fetch.buffer.percent=0.70
>>>>>>>  tez.runtime.io.sort.mb=256
>>>>>>>  tez.runtime.shuffle.memory.limit.percent=0.25
>>>>>>>  tez.runtime.io.sort.factor=64
>>>>>>>  tez.runtime.shuffle.connect.timeout=180000
>>>>>>>  tez.runtime.internal.sorter.class=org.apache.hadoop.util.QuickSort
>>>>>>>  tez.runtime.merge.progress.records=10000
>>>>>>>  tez.runtime.compress=true
>>>>>>>  tez.runtime.sort.spill.percent=0.8
>>>>>>>  tez.runtime.shuffle.ssl.enable=false
>>>>>>>  tez.runtime.ifile.readahead=true
>>>>>>>  tez.runtime.shuffle.parallel.copies=10
>>>>>>>  tez.runtime.ifile.readahead.bytes=4194304
>>>>>>>  tez.runtime.task.input.post-merge.buffer.percent=0.0
>>>>>>>  tez.runtime.shuffle.read.timeout=180000
>>>>>>>  tez.runtime.compress.codec=org.apache.hadoop.io.compress.SnappyCodec
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> PFA the list of task counter. I can see a lot of data is being
>>>>>>> spilled but if i try to increase tez.runtime.io.sort.mb through
>>>>>>> mapred-site.xml then my script terminates with OOM exception.
>>>>>>>
>>>>>>> Can you please suggest what parameters i should change to improve
>>>>>>> the performance of pig using Tez?
>>>>>>>
>>>>>>> Regards,
>>>>>>> Sandeep
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Tuning parameters in Tez for improving performance of PIG script

Posted by Rohini Palaniswamy <ro...@gmail.com>.

Sandeep,
   What does your pig script do? If the pig script was just launching 1
mapreduce or map only job doing simple group by, there might not be much
difference except for container reuse reducing launch overhead and that too
if parallelism is low, containers might not have to be reused. Can you
attach a dummy version of your pig script removing/changing all sensitive
information like paths or field names.

Regards,
Rohini

On Thu, Sep 3, 2015 at 4:01 AM, Rajesh Balamohan <rb...@apache.org>
wrote:

> Is it possible to upload the AM logs alone?. That would be helpful.
>
> It appears to be a problem with "scope_38_INPUT_scope_37". But without the
> logs and without knowing the DAG, it would be hard to locate the issue.
>
> Otherwise, try "yarn logs -applicationId appId | grep "HISTORY" >
> history.log".  If you have SimpleHistoryLoggingService (which is the
> default), check if "history.txt" logs are available which can be shared. If
> not sure about the location, check  "yarn logs -applicationId appId | |
> grep 'Initializing SimpleHistoryLoggingService, logFileLocation='".
>
> ~Rajesh.B
>
> On Thu, Sep 3, 2015 at 3:30 PM, Sandeep Kumar <sa...@gmail.com>
> wrote:
>
>> @Rohini, I used new version of pig i.e. 0.15.0 unfortunately the
>> performance of my script degraded.
>> 2015-09-03 15:15:24,698 [main] INFO  org.apache.pig.Main - Pig script
>> completed in 4 minutes, 1 second and 22 milliseconds (241022 ms)
>>
>> whereas earlier it was taking hardly 3 minutes and 27 seconds.
>>
>> PFA the task counters. Following are the version of softwares being used:
>>
>> HadoopVersion:
>> 2.6.0-cdh5.4.4
>>
>> PigVersion:
>> 0.15.1-SNAPSHOT
>>
>> TezVersion:
>> 0.7.0
>>
>>
>> Regards,
>> Sandeep
>>
>> On Thu, Sep 3, 2015 at 2:46 PM, Sandeep Kumar <sa...@gmail.com>
>> wrote:
>>
>>> @Rajesh, PFA the required statistics. Its difficult to share application
>>> log because they are huge in size(i.e. 167MB). In case you want anything
>>> specific from those logs then please let me know.
>>>
>>> @Rohini,
>>> Thanks for suggesting regarding new version of Pig. I'll give it a try
>>> for sure.
>>>
>>> Regards,
>>> Sandeep
>>>
>>> On Thu, Sep 3, 2015 at 2:31 PM, Rohini Palaniswamy <
>>> rohini.aditya@gmail.com> wrote:
>>>
>>>> Sandeep,
>>>>    Can you try with Pig 0.15 first? There is ton of fixes that has gone
>>>> in for Pig on Tez into that release and many of them are performance fixes.
>>>>
>>>> Regards,
>>>> Rohini
>>>>
>>>> On Thu, Sep 3, 2015 at 1:05 AM, Rajesh Balamohan <rbalamohan@apache.org
>>>> > wrote:
>>>>
>>>>> Can you post the application logs?  It would be helpful if you could
>>>>> run with "tez.task.generate.counters.per.io=true". This would
>>>>> generate the per IO statistics which can be useful for debugging.
>>>>>
>>>>>
>>>>> ~Rajesh.B
>>>>>
>>>>> On Thu, Sep 3, 2015 at 1:20 PM, Sandeep Kumar <
>>>>> sandeepdas.cse@gmail.com> wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> I'm using Pig-0.14.0 over Tez-0.7.0 for running some basic pig
>>>>>> scripts. I'm not able to see any performance gain using Tez. My pig scripts
>>>>>> are taking same amount of time on mapred executionType as well.
>>>>>>
>>>>>> Following are the parameters which are in mapred-site.xml and being
>>>>>> read by Tez and I'm not able to override them even if i mention them in my
>>>>>> tez-site.xml:
>>>>>>
>>>>>>  tez.runtime.shuffle.merge.percent=0.66
>>>>>>  tez.runtime.shuffle.fetch.buffer.percent=0.70
>>>>>>  tez.runtime.io.sort.mb=256
>>>>>>  tez.runtime.shuffle.memory.limit.percent=0.25
>>>>>>  tez.runtime.io.sort.factor=64
>>>>>>  tez.runtime.shuffle.connect.timeout=180000
>>>>>>  tez.runtime.internal.sorter.class=org.apache.hadoop.util.QuickSort
>>>>>>  tez.runtime.merge.progress.records=10000
>>>>>>  tez.runtime.compress=true
>>>>>>  tez.runtime.sort.spill.percent=0.8
>>>>>>  tez.runtime.shuffle.ssl.enable=false
>>>>>>  tez.runtime.ifile.readahead=true
>>>>>>  tez.runtime.shuffle.parallel.copies=10
>>>>>>  tez.runtime.ifile.readahead.bytes=4194304
>>>>>>  tez.runtime.task.input.post-merge.buffer.percent=0.0
>>>>>>  tez.runtime.shuffle.read.timeout=180000
>>>>>>  tez.runtime.compress.codec=org.apache.hadoop.io.compress.SnappyCodec
>>>>>>
>>>>>>
>>>>>>
>>>>>> PFA the list of task counter. I can see a lot of data is being
>>>>>> spilled but if i try to increase tez.runtime.io.sort.mb through
>>>>>> mapred-site.xml then my script terminates with OOM exception.
>>>>>>
>>>>>> Can you please suggest what parameters i should change to improve the
>>>>>> performance of pig using Tez?
>>>>>>
>>>>>> Regards,
>>>>>> Sandeep
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Tuning parameters in Tez for improving performance of PIG script

Posted by Rajesh Balamohan <rb...@apache.org>.

Is it possible to upload the AM logs alone?. That would be helpful.

It appears to be a problem with "scope_38_INPUT_scope_37". But without the
logs and without knowing the DAG, it would be hard to locate the issue.

Otherwise, try "yarn logs -applicationId appId | grep "HISTORY" >
history.log".  If you have SimpleHistoryLoggingService (which is the
default), check if "history.txt" logs are available which can be shared. If
not sure about the location, check  "yarn logs -applicationId appId | |
grep 'Initializing SimpleHistoryLoggingService, logFileLocation='".

~Rajesh.B

On Thu, Sep 3, 2015 at 3:30 PM, Sandeep Kumar <sa...@gmail.com>
wrote:

> @Rohini, I used new version of pig i.e. 0.15.0 unfortunately the
> performance of my script degraded.
> 2015-09-03 15:15:24,698 [main] INFO  org.apache.pig.Main - Pig script
> completed in 4 minutes, 1 second and 22 milliseconds (241022 ms)
>
> whereas earlier it was taking hardly 3 minutes and 27 seconds.
>
> PFA the task counters. Following are the version of softwares being used:
>
> HadoopVersion:
> 2.6.0-cdh5.4.4
>
> PigVersion:
> 0.15.1-SNAPSHOT
>
> TezVersion:
> 0.7.0
>
>
> Regards,
> Sandeep
>
> On Thu, Sep 3, 2015 at 2:46 PM, Sandeep Kumar <sa...@gmail.com>
> wrote:
>
>> @Rajesh, PFA the required statistics. Its difficult to share application
>> log because they are huge in size(i.e. 167MB). In case you want anything
>> specific from those logs then please let me know.
>>
>> @Rohini,
>> Thanks for suggesting regarding new version of Pig. I'll give it a try
>> for sure.
>>
>> Regards,
>> Sandeep
>>
>> On Thu, Sep 3, 2015 at 2:31 PM, Rohini Palaniswamy <
>> rohini.aditya@gmail.com> wrote:
>>
>>> Sandeep,
>>>    Can you try with Pig 0.15 first? There is ton of fixes that has gone
>>> in for Pig on Tez into that release and many of them are performance fixes.
>>>
>>> Regards,
>>> Rohini
>>>
>>> On Thu, Sep 3, 2015 at 1:05 AM, Rajesh Balamohan <rb...@apache.org>
>>> wrote:
>>>
>>>> Can you post the application logs?  It would be helpful if you could
>>>> run with "tez.task.generate.counters.per.io=true". This would generate
>>>> the per IO statistics which can be useful for debugging.
>>>>
>>>>
>>>> ~Rajesh.B
>>>>
>>>> On Thu, Sep 3, 2015 at 1:20 PM, Sandeep Kumar <sandeepdas.cse@gmail.com
>>>> > wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> I'm using Pig-0.14.0 over Tez-0.7.0 for running some basic pig
>>>>> scripts. I'm not able to see any performance gain using Tez. My pig scripts
>>>>> are taking same amount of time on mapred executionType as well.
>>>>>
>>>>> Following are the parameters which are in mapred-site.xml and being
>>>>> read by Tez and I'm not able to override them even if i mention them in my
>>>>> tez-site.xml:
>>>>>
>>>>>  tez.runtime.shuffle.merge.percent=0.66
>>>>>  tez.runtime.shuffle.fetch.buffer.percent=0.70
>>>>>  tez.runtime.io.sort.mb=256
>>>>>  tez.runtime.shuffle.memory.limit.percent=0.25
>>>>>  tez.runtime.io.sort.factor=64
>>>>>  tez.runtime.shuffle.connect.timeout=180000
>>>>>  tez.runtime.internal.sorter.class=org.apache.hadoop.util.QuickSort
>>>>>  tez.runtime.merge.progress.records=10000
>>>>>  tez.runtime.compress=true
>>>>>  tez.runtime.sort.spill.percent=0.8
>>>>>  tez.runtime.shuffle.ssl.enable=false
>>>>>  tez.runtime.ifile.readahead=true
>>>>>  tez.runtime.shuffle.parallel.copies=10
>>>>>  tez.runtime.ifile.readahead.bytes=4194304
>>>>>  tez.runtime.task.input.post-merge.buffer.percent=0.0
>>>>>  tez.runtime.shuffle.read.timeout=180000
>>>>>  tez.runtime.compress.codec=org.apache.hadoop.io.compress.SnappyCodec
>>>>>
>>>>>
>>>>>
>>>>> PFA the list of task counter. I can see a lot of data is being spilled
>>>>> but if i try to increase tez.runtime.io.sort.mb through
>>>>> mapred-site.xml then my script terminates with OOM exception.
>>>>>
>>>>> Can you please suggest what parameters i should change to improve the
>>>>> performance of pig using Tez?
>>>>>
>>>>> Regards,
>>>>> Sandeep
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Tuning parameters in Tez for improving performance of PIG script

Posted by Sandeep Kumar <sa...@gmail.com>.

@Rohini, I used new version of pig i.e. 0.15.0 unfortunately the
performance of my script degraded.
2015-09-03 15:15:24,698 [main] INFO  org.apache.pig.Main - Pig script
completed in 4 minutes, 1 second and 22 milliseconds (241022 ms)

whereas earlier it was taking hardly 3 minutes and 27 seconds.

PFA the task counters. Following are the version of softwares being used:

HadoopVersion:
2.6.0-cdh5.4.4

PigVersion:
0.15.1-SNAPSHOT

TezVersion:
0.7.0


Regards,
Sandeep

On Thu, Sep 3, 2015 at 2:46 PM, Sandeep Kumar <sa...@gmail.com>
wrote:

> @Rajesh, PFA the required statistics. Its difficult to share application
> log because they are huge in size(i.e. 167MB). In case you want anything
> specific from those logs then please let me know.
>
> @Rohini,
> Thanks for suggesting regarding new version of Pig. I'll give it a try for
> sure.
>
> Regards,
> Sandeep
>
> On Thu, Sep 3, 2015 at 2:31 PM, Rohini Palaniswamy <
> rohini.aditya@gmail.com> wrote:
>
>> Sandeep,
>>    Can you try with Pig 0.15 first? There is ton of fixes that has gone
>> in for Pig on Tez into that release and many of them are performance fixes.
>>
>> Regards,
>> Rohini
>>
>> On Thu, Sep 3, 2015 at 1:05 AM, Rajesh Balamohan <rb...@apache.org>
>> wrote:
>>
>>> Can you post the application logs?  It would be helpful if you could run
>>> with "tez.task.generate.counters.per.io=true". This would generate the
>>> per IO statistics which can be useful for debugging.
>>>
>>>
>>> ~Rajesh.B
>>>
>>> On Thu, Sep 3, 2015 at 1:20 PM, Sandeep Kumar <sa...@gmail.com>
>>> wrote:
>>>
>>>> Hi All,
>>>>
>>>> I'm using Pig-0.14.0 over Tez-0.7.0 for running some basic pig scripts.
>>>> I'm not able to see any performance gain using Tez. My pig scripts are
>>>> taking same amount of time on mapred executionType as well.
>>>>
>>>> Following are the parameters which are in mapred-site.xml and being
>>>> read by Tez and I'm not able to override them even if i mention them in my
>>>> tez-site.xml:
>>>>
>>>>  tez.runtime.shuffle.merge.percent=0.66
>>>>  tez.runtime.shuffle.fetch.buffer.percent=0.70
>>>>  tez.runtime.io.sort.mb=256
>>>>  tez.runtime.shuffle.memory.limit.percent=0.25
>>>>  tez.runtime.io.sort.factor=64
>>>>  tez.runtime.shuffle.connect.timeout=180000
>>>>  tez.runtime.internal.sorter.class=org.apache.hadoop.util.QuickSort
>>>>  tez.runtime.merge.progress.records=10000
>>>>  tez.runtime.compress=true
>>>>  tez.runtime.sort.spill.percent=0.8
>>>>  tez.runtime.shuffle.ssl.enable=false
>>>>  tez.runtime.ifile.readahead=true
>>>>  tez.runtime.shuffle.parallel.copies=10
>>>>  tez.runtime.ifile.readahead.bytes=4194304
>>>>  tez.runtime.task.input.post-merge.buffer.percent=0.0
>>>>  tez.runtime.shuffle.read.timeout=180000
>>>>  tez.runtime.compress.codec=org.apache.hadoop.io.compress.SnappyCodec
>>>>
>>>>
>>>>
>>>> PFA the list of task counter. I can see a lot of data is being spilled
>>>> but if i try to increase tez.runtime.io.sort.mb through
>>>> mapred-site.xml then my script terminates with OOM exception.
>>>>
>>>> Can you please suggest what parameters i should change to improve the
>>>> performance of pig using Tez?
>>>>
>>>> Regards,
>>>> Sandeep
>>>>
>>>
>>>
>>
>

Re: Tuning parameters in Tez for improving performance of PIG script

Posted by Sandeep Kumar <sa...@gmail.com>.

@Rajesh, PFA the required statistics. Its difficult to share application
log because they are huge in size(i.e. 167MB). In case you want anything
specific from those logs then please let me know.

@Rohini,
Thanks for suggesting regarding new version of Pig. I'll give it a try for
sure.

Regards,
Sandeep

On Thu, Sep 3, 2015 at 2:31 PM, Rohini Palaniswamy <ro...@gmail.com>
wrote:

> Sandeep,
>    Can you try with Pig 0.15 first? There is ton of fixes that has gone in
> for Pig on Tez into that release and many of them are performance fixes.
>
> Regards,
> Rohini
>
> On Thu, Sep 3, 2015 at 1:05 AM, Rajesh Balamohan <rb...@apache.org>
> wrote:
>
>> Can you post the application logs?  It would be helpful if you could run
>> with "tez.task.generate.counters.per.io=true". This would generate the
>> per IO statistics which can be useful for debugging.
>>
>>
>> ~Rajesh.B
>>
>> On Thu, Sep 3, 2015 at 1:20 PM, Sandeep Kumar <sa...@gmail.com>
>> wrote:
>>
>>> Hi All,
>>>
>>> I'm using Pig-0.14.0 over Tez-0.7.0 for running some basic pig scripts.
>>> I'm not able to see any performance gain using Tez. My pig scripts are
>>> taking same amount of time on mapred executionType as well.
>>>
>>> Following are the parameters which are in mapred-site.xml and being read
>>> by Tez and I'm not able to override them even if i mention them in my
>>> tez-site.xml:
>>>
>>>  tez.runtime.shuffle.merge.percent=0.66
>>>  tez.runtime.shuffle.fetch.buffer.percent=0.70
>>>  tez.runtime.io.sort.mb=256
>>>  tez.runtime.shuffle.memory.limit.percent=0.25
>>>  tez.runtime.io.sort.factor=64
>>>  tez.runtime.shuffle.connect.timeout=180000
>>>  tez.runtime.internal.sorter.class=org.apache.hadoop.util.QuickSort
>>>  tez.runtime.merge.progress.records=10000
>>>  tez.runtime.compress=true
>>>  tez.runtime.sort.spill.percent=0.8
>>>  tez.runtime.shuffle.ssl.enable=false
>>>  tez.runtime.ifile.readahead=true
>>>  tez.runtime.shuffle.parallel.copies=10
>>>  tez.runtime.ifile.readahead.bytes=4194304
>>>  tez.runtime.task.input.post-merge.buffer.percent=0.0
>>>  tez.runtime.shuffle.read.timeout=180000
>>>  tez.runtime.compress.codec=org.apache.hadoop.io.compress.SnappyCodec
>>>
>>>
>>>
>>> PFA the list of task counter. I can see a lot of data is being spilled
>>> but if i try to increase tez.runtime.io.sort.mb through mapred-site.xml
>>> then my script terminates with OOM exception.
>>>
>>> Can you please suggest what parameters i should change to improve the
>>> performance of pig using Tez?
>>>
>>> Regards,
>>> Sandeep
>>>
>>
>>
>

Re: Tuning parameters in Tez for improving performance of PIG script

Posted by Rohini Palaniswamy <ro...@gmail.com>.

Sandeep,
   Can you try with Pig 0.15 first? There is ton of fixes that has gone in
for Pig on Tez into that release and many of them are performance fixes.

Regards,
Rohini

On Thu, Sep 3, 2015 at 1:05 AM, Rajesh Balamohan <rb...@apache.org>
wrote:

> Can you post the application logs?  It would be helpful if you could run
> with "tez.task.generate.counters.per.io=true". This would generate the
> per IO statistics which can be useful for debugging.
>
>
> ~Rajesh.B
>
> On Thu, Sep 3, 2015 at 1:20 PM, Sandeep Kumar <sa...@gmail.com>
> wrote:
>
>> Hi All,
>>
>> I'm using Pig-0.14.0 over Tez-0.7.0 for running some basic pig scripts.
>> I'm not able to see any performance gain using Tez. My pig scripts are
>> taking same amount of time on mapred executionType as well.
>>
>> Following are the parameters which are in mapred-site.xml and being read
>> by Tez and I'm not able to override them even if i mention them in my
>> tez-site.xml:
>>
>>  tez.runtime.shuffle.merge.percent=0.66
>>  tez.runtime.shuffle.fetch.buffer.percent=0.70
>>  tez.runtime.io.sort.mb=256
>>  tez.runtime.shuffle.memory.limit.percent=0.25
>>  tez.runtime.io.sort.factor=64
>>  tez.runtime.shuffle.connect.timeout=180000
>>  tez.runtime.internal.sorter.class=org.apache.hadoop.util.QuickSort
>>  tez.runtime.merge.progress.records=10000
>>  tez.runtime.compress=true
>>  tez.runtime.sort.spill.percent=0.8
>>  tez.runtime.shuffle.ssl.enable=false
>>  tez.runtime.ifile.readahead=true
>>  tez.runtime.shuffle.parallel.copies=10
>>  tez.runtime.ifile.readahead.bytes=4194304
>>  tez.runtime.task.input.post-merge.buffer.percent=0.0
>>  tez.runtime.shuffle.read.timeout=180000
>>  tez.runtime.compress.codec=org.apache.hadoop.io.compress.SnappyCodec
>>
>>
>>
>> PFA the list of task counter. I can see a lot of data is being spilled
>> but if i try to increase tez.runtime.io.sort.mb through mapred-site.xml
>> then my script terminates with OOM exception.
>>
>> Can you please suggest what parameters i should change to improve the
>> performance of pig using Tez?
>>
>> Regards,
>> Sandeep
>>
>
>

Re: Tuning parameters in Tez for improving performance of PIG script

Posted by Rajesh Balamohan <rb...@apache.org>.

Can you post the application logs?  It would be helpful if you could run
with "tez.task.generate.counters.per.io=true". This would generate the per
IO statistics which can be useful for debugging.


~Rajesh.B

On Thu, Sep 3, 2015 at 1:20 PM, Sandeep Kumar <sa...@gmail.com>
wrote:

> Hi All,
>
> I'm using Pig-0.14.0 over Tez-0.7.0 for running some basic pig scripts.
> I'm not able to see any performance gain using Tez. My pig scripts are
> taking same amount of time on mapred executionType as well.
>
> Following are the parameters which are in mapred-site.xml and being read
> by Tez and I'm not able to override them even if i mention them in my
> tez-site.xml:
>
>  tez.runtime.shuffle.merge.percent=0.66
>  tez.runtime.shuffle.fetch.buffer.percent=0.70
>  tez.runtime.io.sort.mb=256
>  tez.runtime.shuffle.memory.limit.percent=0.25
>  tez.runtime.io.sort.factor=64
>  tez.runtime.shuffle.connect.timeout=180000
>  tez.runtime.internal.sorter.class=org.apache.hadoop.util.QuickSort
>  tez.runtime.merge.progress.records=10000
>  tez.runtime.compress=true
>  tez.runtime.sort.spill.percent=0.8
>  tez.runtime.shuffle.ssl.enable=false
>  tez.runtime.ifile.readahead=true
>  tez.runtime.shuffle.parallel.copies=10
>  tez.runtime.ifile.readahead.bytes=4194304
>  tez.runtime.task.input.post-merge.buffer.percent=0.0
>  tez.runtime.shuffle.read.timeout=180000
>  tez.runtime.compress.codec=org.apache.hadoop.io.compress.SnappyCodec
>
>
>
> PFA the list of task counter. I can see a lot of data is being spilled but
> if i try to increase tez.runtime.io.sort.mb through mapred-site.xml then
> my script terminates with OOM exception.
>
> Can you please suggest what parameters i should change to improve the
> performance of pig using Tez?
>
> Regards,
> Sandeep
>