You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Sachin Sabbarwal <sa...@gmail.com> on 2015/07/07 11:43:23 UTC

Fwd: Same pig script running slower with Tez as compared with run in Mapred mode

Hi Guys
I'm using Apache Pig version 0.14.0 (r1640057) and  0.5.3 TEZ.
I am running a pig script in following 2 sceniors:
1. Without any data. In this case when run with TEZ mode script runs with
in 1 min. With Mapred it takes around 7 minutes.
2. With little data, In tez mode same pig script takes around 10 mins and
with mapred it takes around 14-15 mins.

I had sent a mail to user@tez.apache.org and got to know that the problem
is that pig auto-reduce isn't coming in to play for few of my scopes and
hence lots of tasks are getting created. This causes the run in tez mode to
become slow. As suggested i tried setting pig.exec.reducers.max to a
smaller value(tried with 150,25 and 10). And with this property set to 10,
my script ran under 4 minutes. Also from logs we could see that auto
-reducer is set to false for few scopes. For example:

2015-07-06 14:11:35,123 INFO [AsyncDispatcher event handler]
vertexmanager.ShuffleVertexManager: Shuffle Vertex Manager: settings
minFrac:0.25 maxFrac:0.75* auto:false* desiredTaskIput:104857600 minTasks:1
2015-07-06 14:11:35,123 INFO [AsyncDispatcher event handler]
impl.VertexImpl: Creating 999 for vertex:

Now the problem here is that in my pig.properties i have
set pig.tez.auto.parallelism=true and also in my pig script i have set this
propetry to true. But still we are seeing this getting set to false for few
scopes.

You can find the My conversation with Rajesh from TEZ below. Please lemme
know what i am missing here. Lemme know if any other information is
required.

Thanks in advance





---------- Forwarded message ----------
From: Rajesh Balamohan <ra...@gmail.com>
Date: Tue, Jul 7, 2015 at 2:12 PM
Subject: Re: Same pig script running slower with Tez as compared with run
in Mapred mode
To: user@tez.apache.org


Hi Sachin,

That was just a temporary workaround to ensure that was the issue.  Ideally
user does not need to set this parameter.  Real issue is why auto-reducer
is set to false in certain vertices in pig-tez. Will wait for Pig folks to
chime in.

For doc/tutorial related, you can start off with the following
- http://tez.apache.org/talks.html
- couple of youtube videos are available from hadoop summits and meetups.
-
http://hortonworks.com/blog/apache-tez-a-new-chapter-in-hadoop-data-processing/
(this is pretty old)
- Pig on tez
http://www.slideshare.net/Hadoop_Summit/pig-on-tez-low-latency-data-processing-with-big-data

~Rajesh.B

On Tue, Jul 7, 2015 at 1:54 PM, Sachin Sabbarwal <sa...@gmail.com>
wrote:

> Hi Rajesh
> Thanks for your response. *This seems to be working for me.*
> By setting pig.exec.reducers.max to 10 i am able to complete my run in
> under 4 mins.(Initially it was running in 14-15 mins).
> I'm new to pig/tez/hadoop world. Do you write any blogs about
> pig/tez/hadoop etc? Can you suggest any tutorials/links to read about tez?
> I need to understand concepts like scope, DAG, parallelism etc. I just
> have a very basic understanding of tez. If i understand all these concepts
> i'll be able to tune my job.
>
> Thanks
>
>
> On Tue, Jul 7, 2015 at 1:23 PM, Rajesh Balamohan <
> rajesh.balamohan@gmail.com> wrote:
>
>> Forgot to add the following.  Ideally auto-reduce implementation should
>> have kicked-in and on need basis, it should have decreased the number of
>> reducers needed.  However, for the vertices of concern (scope-2037 &
>> scope-2162), auto-reducer has been turned off in configuration by Pig and
>> for the rest of the vertices it is turned on.
>>
>> Pig folks would be able to help in terms of providing details on why
>> auto-reduce parallelism is turned off in certain vertices.
>>
>> 2015-07-06 14:11:35,109 INFO [AsyncDispatcher event handler]
>> impl.VertexImpl: Setting vertexManager to ShuffleVertexManager for
>> vertex_1436152736518_0210_1_28* [scope-2037]*
>> 2015-07-06 14:11:35,123 INFO [AsyncDispatcher event handler]
>> vertexmanager.ShuffleVertexManager: Shuffle Vertex Manager: settings
>> minFrac:0.25 maxFrac:0.75* auto:false* desiredTaskIput:104857600
>> minTasks:1
>> 2015-07-06 14:11:35,123 INFO [AsyncDispatcher event handler]
>> impl.VertexImpl: Creating 999 for vertex: vertex_1436152736518_0210_1_28
>> [scope-2037]
>> ....
>>
>> 2015-07-06 14:11:35,245 INFO [AsyncDispatcher event handler]
>> impl.VertexImpl: Setting vertexManager to ShuffleVertexManager for
>> vertex_1436152736518_0210_1_39 *[scope-2162]*
>> 2015-07-06 14:11:35,257 INFO [AsyncDispatcher event handler]
>> vertexmanager.ShuffleVertexManager: Shuffle Vertex Manager: settings
>> minFrac:0.25 maxFrac:0.75 *auto:false* desiredTaskIput:104857600
>> minTasks:1
>> 2015-07-06 14:11:35,257 INFO [AsyncDispatcher event handler]
>> impl.VertexImpl: Creating 999 for vertex: vertex_1436152736518_0210_1_39
>> [scope-2162]
>> ....
>>
>> 2015-07-06 14:11:35,417 INFO [AsyncDispatcher event handler]
>> impl.VertexImpl: Setting user vertex manager plugin:
>> org.apache.tez.dag.library.vertexmanager.ShuffleVertexManager on vertex:*
>> scope-2185*
>> 2015-07-06 14:11:35,419 INFO [AsyncDispatcher event handler]
>> vertexmanager.ShuffleVertexManager: Shuffle Vertex Manager: settings
>> minFrac:0.25 maxFrac:0.75 *auto:true* desiredTaskIput:104857600
>> minTasks:1
>> ...
>>
>>
>> On Tue, Jul 7, 2015 at 12:47 PM, Rajesh Balamohan <
>> rajesh.balamohan@gmail.com> wrote:
>>
>>>
>>> Attaching the DAG and the swimlane for the job.
>>>
>>> scope-2052 which had to give the data to other vertices slowed down (~
>>> 150-180 seconds) due to multiple spills and NumberFormatExceptions in
>>> data.  You might want to try setting
>>> "tez.task.scale.memory.additional-reservation.fraction.max='PARTITIONED_UNSORTED_OUTPUT:12,UNSORTED_INPUT:1,UNSORTED_OUTPUT:12,SORTED_OUTPUT:12,SORTED_MERGED_INPUT:1,PROCESSOR:1,OTHER:1'
>>> " to allocate more memory for unordered outputs.
>>> Following are the details for this scope.
>>> - attempt_1436152736518_0210_1_31_000000_0,
>>> PigLatin:dmwith1tapin.pig-0_scope-0, VertexName: scope-2052,
>>> VertexParallelism: 1,
>>> TaskAttemptID:attempt_1436152736518_0210_1_31_000000_0,
>>> - numInputs=1, numOutputs=4, JVM.maxFree=734527488
>>> - 2015-07-06 14:11:40,047 INFO [TezChild] resources.MemoryDistributor:
>>> Informing: INPUT, scope-546, org.apache.tez.mapreduce.input.MRInput:
>>> requested=0, allocated=0
>>> - Small allocation of ~7 MB allocation to unordered output lead to
>>> multiple spills.
>>> - 2015-07-06 14:11:40,047 INFO [TezChild] resources.MemoryDistributor:
>>> Informing: OUTPUTPUT_RECORDS, scope-2117,
>>> org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput:
>>> requested=268435456, allocated=222303401
>>> - 2015-07-06 14:11:40,047 INFO [TezChild] resources.MemoryDistributor:
>>> Informing: OUTPUT, scope-2251,
>>> org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput:
>>> requested=268435456, allocated=222303401
>>> - 2015-07-06 14:11:40,048 INFO [TezChild] resources.MemoryDistributor:
>>> Informing: OUTPUT, scope-2063,
>>> org.apache.tez.runtime.library.output.UnorderedPartitionedKVOutput:
>>> requested=104857600, allocated=7236438
>>> - 2015-07-06 14:11:40,048 INFO [TezChild] resources.MemoryDistributor:
>>> Informing: OUTPUT, scope-2068,
>>> org.apache.tez.runtime.library.output.UnorderedPartitionedKVOutput:
>>> requested=104857600, allocated=7236438
>>>  - Too many number of records had issues in NumberFormatException
>>> leading to large amount of logs. This dragged the runtime of this task to
>>> around
>>>    e.g "mapReduceLayer.PigHadoopLogger:
>>> java.lang.Class(FIELD_DISCARDED_TYPE_CONVERSION_FAILED): Unable to
>>> interpret value  in field being converted to long, caught
>>> NumberFormatException <empty String> field discarded"
>>>
>>>
>>> - scope-2037 and scope-2162 had set the vertex parallelism to "999"
>>> affecting subsequent execution.
>>> - VertexName: scope-2037, VertexParallelism: 999 vertex:
>>> vertex_1436152736518_0210_1_28 finished in *410* seconds.  Tasks
>>> themselves were small, but due to large number of tasks that had to be
>>> executed in small containers (pretty much used the same container to
>>> execute this) it took time.
>>> - VertexName: scope-2162, VertexParallelism: 999 vertex:
>>> vertex_1436152736518_0210_1_39 finished in *697* seconds. Similar
>>> observation as previous vertex.
>>> *Above 2 vertices have caused the entire job to slow down.*
>>>
>>> "999" is set as the reducer parallelism at compile time by Pig. This is
>>> not for the input. I am not sure how pig sets the parallelism at compile
>>> time.  You can possibly try setting "pig.exec.reducers.max=50" in your case
>>> and give it a try. Pig folks would be in a better position to explain that.
>>>
>>>
>>> On Tue, Jul 7, 2015 at 11:22 AM, Sachin Sabbarwal <
>>> sachinsabbarwal@gmail.com> wrote:
>>>
>>>> 
>>>>  logs.gz
>>>> <https://drive.google.com/file/d/0B-RFcYxUIHzzUVJpRzVDZXB5TUk/view?usp=drive_web>
>>>> Hi Rajesh
>>>>
>>>> PFA the gziped logs.
>>>> FYI It's a single file, when you'll gunzip it, it'll be around 1.5gb in
>>>> size.
>>>> One more thing which you might find useful:
>>>>
>>>> In the dmOutputTez file i could see following line, which suggests that
>>>> TEZ created a total of 7660 tasks. This is surprising as my data is only
>>>> few mbs(10-15 mb max). How is this number of tasks decided? is there any
>>>> property to tune it?
>>>>
>>>> 2015-07-07 05:37:02,647 [Timer-0] INFO
>>>>  org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG
>>>> Status: status=RUNNING, progress=TotalTasks: 7660 Succeeded: 0 Running:
>>>> 0 Failed: 0 Killed: 0, diagnostics=
>>>>
>>>> Thanks
>>>>
>>>> On Mon, Jul 6, 2015 at 8:34 PM, Rajesh Balamohan <
>>>> rajesh.balamohan@gmail.com> wrote:
>>>>
>>>>> yarn logs -applicationId application_1436152736518_0210
>>>>>
>>>>> You can possibly send the output to a log file, gzip it and post it.
>>>>>
>>>>> ~Rajesh.B
>>>>>
>>>>> On Mon, Jul 6, 2015 at 8:12 PM, Sachin Sabbarwal <
>>>>> sachinsabbarwal@gmail.com> wrote:
>>>>>
>>>>>> Hi
>>>>>> Thanks for reply. My tez-site.xml contains only following:
>>>>>>
>>>>>> <configuration>
>>>>>> <property>
>>>>>>   <name>tez.lib.uris</name>
>>>>>>   <value>${fs.defaultFS}/apps/tez-0.5/tez-0.5.3.tar.gz,
>>>>>> ${fs.defaultFS}/apps/tez-0.5/*,${fs.defaultFS}/apps/tez-0.5/lib/*</value>
>>>>>> </property>
>>>>>> </configuration>
>>>>>>
>>>>>> PFA the application logs. Here is the version information:
>>>>>> 1. Hadoop version: Hadoop 2.5.0-cdh5.3.1
>>>>>> 2. Pig: Apache Pig version 0.14.0 (r1640057)
>>>>>> 3. Tez: 0.5.3
>>>>>>
>>>>>> Lemme know if anything else is needed.
>>>>>>
>>>>>> Thanks in advance
>>>>>>
>>>>>> On Mon, Jul 6, 2015 at 7:07 PM, Rajesh Balamohan <
>>>>>> rajesh.balamohan@gmail.com> wrote:
>>>>>>
>>>>>>> Can you post the application logs, tez-site.xml and also the version
>>>>>>> details?
>>>>>>>
>>>>>>> ~Rajesh.B
>>>>>>>
>>>>>>> On Mon, Jul 6, 2015 at 6:38 PM, Sachin Sabbarwal <
>>>>>>> sachinsabbarwal@gmail.com> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> ---------- Forwarded message ----------
>>>>>>>> From: Sachin Sabbarwal <sa...@gmail.com>
>>>>>>>> Date: Mon, Jul 6, 2015 at 5:34 PM
>>>>>>>> Subject: Same pig script running slower with Tez as compared with
>>>>>>>> run in Mapred mode
>>>>>>>> To: user@tez.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>> Hello Guys
>>>>>>>> Trying Apache Tez.
>>>>>>>> I've setup to use pig in TEZ mode.
>>>>>>>> I'm running a pig script against i) no data and ii) with some data.
>>>>>>>> In case i) when i run with pig using TEZ mode my pig scripts
>>>>>>>> completes run in ~40secs. Whereas when i run case i) with mapred it takes
>>>>>>>> around 7-8 mins.
>>>>>>>> in case ii) when run with pig using TEZ, same pig script takes
>>>>>>>> around 14-15 mins but with mapred it takes around 10 mins.
>>>>>>>> When i'm running same pig script with production data(which is much
>>>>>>>> more than the data i used here to run case i) and (ii) ) the job takes
>>>>>>>> hours to complete.
>>>>>>>> Hence I'm trying tez to run my pig job in a faster mode. I'm not
>>>>>>>> really sure what i might be missing here. Please help, ask for any further
>>>>>>>> info if required.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> --
>>>>>>>> Sachin Sabbarwal
>>>>>>>> Linkedin:
>>>>>>>> https://www.linkedin.com/profile?viewProfile=&key=95777265
>>>>>>>> Facebook: facebook.com/sachinsabbarwal
>>>>>>>> Quora: http://www.quora.com/Sachin-Sabbarwal
>>>>>>>> Blog: http://sachinsabbarwal.tumblr.com/
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Sachin Sabbarwal
>>>>>>>> Linkedin:
>>>>>>>> https://www.linkedin.com/profile?viewProfile=&key=95777265
>>>>>>>> Facebook: facebook.com/sachinsabbarwal
>>>>>>>> Quora: http://www.quora.com/Sachin-Sabbarwal
>>>>>>>> Blog: http://sachinsabbarwal.tumblr.com/
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> ~Rajesh.B
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Sachin Sabbarwal
>>>>>> Linkedin: https://www.linkedin.com/profile?viewProfile=&key=95777265
>>>>>> Facebook: facebook.com/sachinsabbarwal
>>>>>> Quora: http://www.quora.com/Sachin-Sabbarwal
>>>>>> Blog: http://sachinsabbarwal.tumblr.com/
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> ~Rajesh.B
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Sachin Sabbarwal
>>>> Linkedin: https://www.linkedin.com/profile?viewProfile=&key=95777265
>>>> Facebook: facebook.com/sachinsabbarwal
>>>> Quora: http://www.quora.com/Sachin-Sabbarwal
>>>> Blog: http://sachinsabbarwal.tumblr.com/
>>>>
>>>
>>>
>>>
>>> --
>>> ~Rajesh.B
>>>
>>
>>
>>
>> --
>> ~Rajesh.B
>>
>
>
>
> --
> Sachin Sabbarwal
> Linkedin: https://www.linkedin.com/profile?viewProfile=&key=95777265
> Facebook: facebook.com/sachinsabbarwal
> Quora: http://www.quora.com/Sachin-Sabbarwal
> Blog: http://sachinsabbarwal.tumblr.com/
>



-- 
~Rajesh.B



-- 
Sachin Sabbarwal
Linkedin: https://www.linkedin.com/profile?viewProfile=&key=95777265
Facebook: facebook.com/sachinsabbarwal
Quora: http://www.quora.com/Sachin-Sabbarwal
Blog: http://sachinsabbarwal.tumblr.com/

Re: Same pig script running slower with Tez as compared with run in Mapred mode

Posted by Sachin Sabbarwal <sa...@gmail.com>.

Guys
Couldn't send pig earlier because needed to ask my colleagues if i am
allowed to send them. Please help as my problem is still unresolved.

Thanks in advance

On Wed, Jul 8, 2015 at 11:42 AM, Sachin Sabbarwal <sachinsabbarwal@gmail.com
> wrote:

> Hi Rohini
> Attached zip contains the main.pig and other Macros used in this pig.
> Please lemme know if there is anything else missing/required.
>
> Thanks
>
> On Wed, Jul 8, 2015 at 4:47 AM, Rohini Palaniswamy <
> rohini.aditya@gmail.com> wrote:
>
>> Sachin,
>>     Can you attach your pig script and pig client log as well as I asked
>> earlier?
>>
>> Regards,
>> Rohini
>>
>> On Tue, Jul 7, 2015 at 2:43 AM, Sachin Sabbarwal <
>> sachinsabbarwal@gmail.com>
>> wrote:
>>
>> > Hi Guys
>> > I'm using Apache Pig version 0.14.0 (r1640057) and  0.5.3 TEZ.
>> > I am running a pig script in following 2 sceniors:
>> > 1. Without any data. In this case when run with TEZ mode script runs
>> with
>> > in 1 min. With Mapred it takes around 7 minutes.
>> > 2. With little data, In tez mode same pig script takes around 10 mins
>> and
>> > with mapred it takes around 14-15 mins.
>> >
>> > I had sent a mail to user@tez.apache.org and got to know that the
>> problem
>> > is that pig auto-reduce isn't coming in to play for few of my scopes and
>> > hence lots of tasks are getting created. This causes the run in tez
>> mode to
>> > become slow. As suggested i tried setting pig.exec.reducers.max to a
>> > smaller value(tried with 150,25 and 10). And with this property set to
>> 10,
>> > my script ran under 4 minutes. Also from logs we could see that auto
>> > -reducer is set to false for few scopes. For example:
>> >
>> > 2015-07-06 14:11:35,123 INFO [AsyncDispatcher event handler]
>> > vertexmanager.ShuffleVertexManager: Shuffle Vertex Manager: settings
>> > minFrac:0.25 maxFrac:0.75* auto:false* desiredTaskIput:104857600
>> minTasks:1
>> > 2015-07-06 14:11:35,123 INFO [AsyncDispatcher event handler]
>> > impl.VertexImpl: Creating 999 for vertex:
>> >
>> > Now the problem here is that in my pig.properties i have
>> > set pig.tez.auto.parallelism=true and also in my pig script i have set
>> this
>> > propetry to true. But still we are seeing this getting set to false for
>> few
>> > scopes.
>> >
>> > You can find the My conversation with Rajesh from TEZ below. Please
>> lemme
>> > know what i am missing here. Lemme know if any other information is
>> > required.
>> >
>> > Thanks in advance
>> >
>> >
>> >
>> >
>> >
>> > ---------- Forwarded message ----------
>> > From: Rajesh Balamohan <ra...@gmail.com>
>> > Date: Tue, Jul 7, 2015 at 2:12 PM
>> > Subject: Re: Same pig script running slower with Tez as compared with
>> run
>> > in Mapred mode
>> > To: user@tez.apache.org
>> >
>> >
>> > Hi Sachin,
>> >
>> > That was just a temporary workaround to ensure that was the issue.
>> Ideally
>> > user does not need to set this parameter.  Real issue is why
>> auto-reducer
>> > is set to false in certain vertices in pig-tez. Will wait for Pig folks
>> to
>> > chime in.
>> >
>> > For doc/tutorial related, you can start off with the following
>> > - http://tez.apache.org/talks.html
>> > - couple of youtube videos are available from hadoop summits and
>> meetups.
>> > -
>> >
>> >
>> http://hortonworks.com/blog/apache-tez-a-new-chapter-in-hadoop-data-processing/
>> > (this is pretty old)
>> > - Pig on tez
>> >
>> >
>> http://www.slideshare.net/Hadoop_Summit/pig-on-tez-low-latency-data-processing-with-big-data
>> >
>> > ~Rajesh.B
>> >
>> > On Tue, Jul 7, 2015 at 1:54 PM, Sachin Sabbarwal <
>> > sachinsabbarwal@gmail.com>
>> > wrote:
>> >
>> > > Hi Rajesh
>> > > Thanks for your response. *This seems to be working for me.*
>> > > By setting pig.exec.reducers.max to 10 i am able to complete my run in
>> > > under 4 mins.(Initially it was running in 14-15 mins).
>> > > I'm new to pig/tez/hadoop world. Do you write any blogs about
>> > > pig/tez/hadoop etc? Can you suggest any tutorials/links to read about
>> > tez?
>> > > I need to understand concepts like scope, DAG, parallelism etc. I just
>> > > have a very basic understanding of tez. If i understand all these
>> > concepts
>> > > i'll be able to tune my job.
>> > >
>> > > Thanks
>> > >
>> > >
>> > > On Tue, Jul 7, 2015 at 1:23 PM, Rajesh Balamohan <
>> > > rajesh.balamohan@gmail.com> wrote:
>> > >
>> > >> Forgot to add the following.  Ideally auto-reduce implementation
>> should
>> > >> have kicked-in and on need basis, it should have decreased the
>> number of
>> > >> reducers needed.  However, for the vertices of concern (scope-2037 &
>> > >> scope-2162), auto-reducer has been turned off in configuration by Pig
>> > and
>> > >> for the rest of the vertices it is turned on.
>> > >>
>> > >> Pig folks would be able to help in terms of providing details on why
>> > >> auto-reduce parallelism is turned off in certain vertices.
>> > >>
>> > >> 2015-07-06 14:11:35,109 INFO [AsyncDispatcher event handler]
>> > >> impl.VertexImpl: Setting vertexManager to ShuffleVertexManager for
>> > >> vertex_1436152736518_0210_1_28* [scope-2037]*
>> > >> 2015-07-06 14:11:35,123 INFO [AsyncDispatcher event handler]
>> > >> vertexmanager.ShuffleVertexManager: Shuffle Vertex Manager: settings
>> > >> minFrac:0.25 maxFrac:0.75* auto:false* desiredTaskIput:104857600
>> > >> minTasks:1
>> > >> 2015-07-06 14:11:35,123 INFO [AsyncDispatcher event handler]
>> > >> impl.VertexImpl: Creating 999 for vertex:
>> vertex_1436152736518_0210_1_28
>> > >> [scope-2037]
>> > >> ....
>> > >>
>> > >> 2015-07-06 14:11:35,245 INFO [AsyncDispatcher event handler]
>> > >> impl.VertexImpl: Setting vertexManager to ShuffleVertexManager for
>> > >> vertex_1436152736518_0210_1_39 *[scope-2162]*
>> > >> 2015-07-06 14:11:35,257 INFO [AsyncDispatcher event handler]
>> > >> vertexmanager.ShuffleVertexManager: Shuffle Vertex Manager: settings
>> > >> minFrac:0.25 maxFrac:0.75 *auto:false* desiredTaskIput:104857600
>> > >> minTasks:1
>> > >> 2015-07-06 14:11:35,257 INFO [AsyncDispatcher event handler]
>> > >> impl.VertexImpl: Creating 999 for vertex:
>> vertex_1436152736518_0210_1_39
>> > >> [scope-2162]
>> > >> ....
>> > >>
>> > >> 2015-07-06 14:11:35,417 INFO [AsyncDispatcher event handler]
>> > >> impl.VertexImpl: Setting user vertex manager plugin:
>> > >> org.apache.tez.dag.library.vertexmanager.ShuffleVertexManager on
>> > vertex:*
>> > >> scope-2185*
>> > >> 2015-07-06 14:11:35,419 INFO [AsyncDispatcher event handler]
>> > >> vertexmanager.ShuffleVertexManager: Shuffle Vertex Manager: settings
>> > >> minFrac:0.25 maxFrac:0.75 *auto:true* desiredTaskIput:104857600
>> > >> minTasks:1
>> > >> ...
>> > >>
>> > >>
>> > >> On Tue, Jul 7, 2015 at 12:47 PM, Rajesh Balamohan <
>> > >> rajesh.balamohan@gmail.com> wrote:
>> > >>
>> > >>>
>> > >>> Attaching the DAG and the swimlane for the job.
>> > >>>
>> > >>> scope-2052 which had to give the data to other vertices slowed down
>> (~
>> > >>> 150-180 seconds) due to multiple spills and NumberFormatExceptions
>> in
>> > >>> data.  You might want to try setting
>> > >>>
>> >
>> "tez.task.scale.memory.additional-reservation.fraction.max='PARTITIONED_UNSORTED_OUTPUT:12,UNSORTED_INPUT:1,UNSORTED_OUTPUT:12,SORTED_OUTPUT:12,SORTED_MERGED_INPUT:1,PROCESSOR:1,OTHER:1'
>> > >>> " to allocate more memory for unordered outputs.
>> > >>> Following are the details for this scope.
>> > >>> - attempt_1436152736518_0210_1_31_000000_0,
>> > >>> PigLatin:dmwith1tapin.pig-0_scope-0, VertexName: scope-2052,
>> > >>> VertexParallelism: 1,
>> > >>> TaskAttemptID:attempt_1436152736518_0210_1_31_000000_0,
>> > >>> - numInputs=1, numOutputs=4, JVM.maxFree=734527488
>> > >>> - 2015-07-06 14:11:40,047 INFO [TezChild]
>> resources.MemoryDistributor:
>> > >>> Informing: INPUT, scope-546, org.apache.tez.mapreduce.input.MRInput:
>> > >>> requested=0, allocated=0
>> > >>> - Small allocation of ~7 MB allocation to unordered output lead to
>> > >>> multiple spills.
>> > >>> - 2015-07-06 14:11:40,047 INFO [TezChild]
>> resources.MemoryDistributor:
>> > >>> Informing: OUTPUTPUT_RECORDS, scope-2117,
>> > >>> org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput:
>> > >>> requested=268435456, allocated=222303401
>> > >>> - 2015-07-06 14:11:40,047 INFO [TezChild]
>> resources.MemoryDistributor:
>> > >>> Informing: OUTPUT, scope-2251,
>> > >>> org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput:
>> > >>> requested=268435456, allocated=222303401
>> > >>> - 2015-07-06 14:11:40,048 INFO [TezChild]
>> resources.MemoryDistributor:
>> > >>> Informing: OUTPUT, scope-2063,
>> > >>> org.apache.tez.runtime.library.output.UnorderedPartitionedKVOutput:
>> > >>> requested=104857600, allocated=7236438
>> > >>> - 2015-07-06 14:11:40,048 INFO [TezChild]
>> resources.MemoryDistributor:
>> > >>> Informing: OUTPUT, scope-2068,
>> > >>> org.apache.tez.runtime.library.output.UnorderedPartitionedKVOutput:
>> > >>> requested=104857600, allocated=7236438
>> > >>>  - Too many number of records had issues in NumberFormatException
>> > >>> leading to large amount of logs. This dragged the runtime of this
>> task
>> > to
>> > >>> around
>> > >>>    e.g "mapReduceLayer.PigHadoopLogger:
>> > >>> java.lang.Class(FIELD_DISCARDED_TYPE_CONVERSION_FAILED): Unable to
>> > >>> interpret value  in field being converted to long, caught
>> > >>> NumberFormatException <empty String> field discarded"
>> > >>>
>> > >>>
>> > >>> - scope-2037 and scope-2162 had set the vertex parallelism to "999"
>> > >>> affecting subsequent execution.
>> > >>> - VertexName: scope-2037, VertexParallelism: 999 vertex:
>> > >>> vertex_1436152736518_0210_1_28 finished in *410* seconds.  Tasks
>> > >>> themselves were small, but due to large number of tasks that had to
>> be
>> > >>> executed in small containers (pretty much used the same container to
>> > >>> execute this) it took time.
>> > >>> - VertexName: scope-2162, VertexParallelism: 999 vertex:
>> > >>> vertex_1436152736518_0210_1_39 finished in *697* seconds. Similar
>> > >>> observation as previous vertex.
>> > >>> *Above 2 vertices have caused the entire job to slow down.*
>> > >>>
>> > >>> "999" is set as the reducer parallelism at compile time by Pig.
>> This is
>> > >>> not for the input. I am not sure how pig sets the parallelism at
>> > compile
>> > >>> time.  You can possibly try setting "pig.exec.reducers.max=50" in
>> your
>> > case
>> > >>> and give it a try. Pig folks would be in a better position to
>> explain
>> > that.
>> > >>>
>> > >>>
>> > >>> On Tue, Jul 7, 2015 at 11:22 AM, Sachin Sabbarwal <
>> > >>> sachinsabbarwal@gmail.com> wrote:
>> > >>>
>> > >>>> 
>> > >>>>  logs.gz
>> > >>>> <
>> >
>> https://drive.google.com/file/d/0B-RFcYxUIHzzUVJpRzVDZXB5TUk/view?usp=drive_web
>> > >
>> > >>>> Hi Rajesh
>> > >>>>
>> > >>>> PFA the gziped logs.
>> > >>>> FYI It's a single file, when you'll gunzip it, it'll be around
>> 1.5gb
>> > in
>> > >>>> size.
>> > >>>> One more thing which you might find useful:
>> > >>>>
>> > >>>> In the dmOutputTez file i could see following line, which suggests
>> > that
>> > >>>> TEZ created a total of 7660 tasks. This is surprising as my data is
>> > only
>> > >>>> few mbs(10-15 mb max). How is this number of tasks decided? is
>> there
>> > any
>> > >>>> property to tune it?
>> > >>>>
>> > >>>> 2015-07-07 05:37:02,647 [Timer-0] INFO
>> > >>>>  org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG
>> > >>>> Status: status=RUNNING, progress=TotalTasks: 7660 Succeeded: 0
>> > Running:
>> > >>>> 0 Failed: 0 Killed: 0, diagnostics=
>> > >>>>
>> > >>>> Thanks
>> > >>>>
>> > >>>> On Mon, Jul 6, 2015 at 8:34 PM, Rajesh Balamohan <
>> > >>>> rajesh.balamohan@gmail.com> wrote:
>> > >>>>
>> > >>>>> yarn logs -applicationId application_1436152736518_0210
>> > >>>>>
>> > >>>>> You can possibly send the output to a log file, gzip it and post
>> it.
>> > >>>>>
>> > >>>>> ~Rajesh.B
>> > >>>>>
>> > >>>>> On Mon, Jul 6, 2015 at 8:12 PM, Sachin Sabbarwal <
>> > >>>>> sachinsabbarwal@gmail.com> wrote:
>> > >>>>>
>> > >>>>>> Hi
>> > >>>>>> Thanks for reply. My tez-site.xml contains only following:
>> > >>>>>>
>> > >>>>>> <configuration>
>> > >>>>>> <property>
>> > >>>>>>   <name>tez.lib.uris</name>
>> > >>>>>>   <value>${fs.defaultFS}/apps/tez-0.5/tez-0.5.3.tar.gz,
>> > >>>>>>
>> >
>> ${fs.defaultFS}/apps/tez-0.5/*,${fs.defaultFS}/apps/tez-0.5/lib/*</value>
>> > >>>>>> </property>
>> > >>>>>> </configuration>
>> > >>>>>>
>> > >>>>>> PFA the application logs. Here is the version information:
>> > >>>>>> 1. Hadoop version: Hadoop 2.5.0-cdh5.3.1
>> > >>>>>> 2. Pig: Apache Pig version 0.14.0 (r1640057)
>> > >>>>>> 3. Tez: 0.5.3
>> > >>>>>>
>> > >>>>>> Lemme know if anything else is needed.
>> > >>>>>>
>> > >>>>>> Thanks in advance
>> > >>>>>>
>> > >>>>>> On Mon, Jul 6, 2015 at 7:07 PM, Rajesh Balamohan <
>> > >>>>>> rajesh.balamohan@gmail.com> wrote:
>> > >>>>>>
>> > >>>>>>> Can you post the application logs, tez-site.xml and also the
>> > version
>> > >>>>>>> details?
>> > >>>>>>>
>> > >>>>>>> ~Rajesh.B
>> > >>>>>>>
>> > >>>>>>> On Mon, Jul 6, 2015 at 6:38 PM, Sachin Sabbarwal <
>> > >>>>>>> sachinsabbarwal@gmail.com> wrote:
>> > >>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> ---------- Forwarded message ----------
>> > >>>>>>>> From: Sachin Sabbarwal <sa...@gmail.com>
>> > >>>>>>>> Date: Mon, Jul 6, 2015 at 5:34 PM
>> > >>>>>>>> Subject: Same pig script running slower with Tez as compared
>> with
>> > >>>>>>>> run in Mapred mode
>> > >>>>>>>> To: user@tez.apache.org
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> Hello Guys
>> > >>>>>>>> Trying Apache Tez.
>> > >>>>>>>> I've setup to use pig in TEZ mode.
>> > >>>>>>>> I'm running a pig script against i) no data and ii) with some
>> > data.
>> > >>>>>>>> In case i) when i run with pig using TEZ mode my pig scripts
>> > >>>>>>>> completes run in ~40secs. Whereas when i run case i) with
>> mapred
>> > it takes
>> > >>>>>>>> around 7-8 mins.
>> > >>>>>>>> in case ii) when run with pig using TEZ, same pig script takes
>> > >>>>>>>> around 14-15 mins but with mapred it takes around 10 mins.
>> > >>>>>>>> When i'm running same pig script with production data(which is
>> > much
>> > >>>>>>>> more than the data i used here to run case i) and (ii) ) the
>> job
>> > takes
>> > >>>>>>>> hours to complete.
>> > >>>>>>>> Hence I'm trying tez to run my pig job in a faster mode. I'm
>> not
>> > >>>>>>>> really sure what i might be missing here. Please help, ask for
>> > any further
>> > >>>>>>>> info if required.
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> Thanks
>> > >>>>>>>> --
>> > >>>>>>>> Sachin Sabbarwal
>> > >>>>>>>> Linkedin:
>> > >>>>>>>> https://www.linkedin.com/profile?viewProfile=&key=95777265
>> > >>>>>>>> Facebook: facebook.com/sachinsabbarwal
>> > >>>>>>>> Quora: http://www.quora.com/Sachin-Sabbarwal
>> > >>>>>>>> Blog: http://sachinsabbarwal.tumblr.com/
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> --
>> > >>>>>>>> Sachin Sabbarwal
>> > >>>>>>>> Linkedin:
>> > >>>>>>>> https://www.linkedin.com/profile?viewProfile=&key=95777265
>> > >>>>>>>> Facebook: facebook.com/sachinsabbarwal
>> > >>>>>>>> Quora: http://www.quora.com/Sachin-Sabbarwal
>> > >>>>>>>> Blog: http://sachinsabbarwal.tumblr.com/
>> > >>>>>>>>
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>> --
>> > >>>>>>> ~Rajesh.B
>> > >>>>>>>
>> > >>>>>>
>> > >>>>>>
>> > >>>>>>
>> > >>>>>> --
>> > >>>>>> Sachin Sabbarwal
>> > >>>>>> Linkedin:
>> > https://www.linkedin.com/profile?viewProfile=&key=95777265
>> > >>>>>> Facebook: facebook.com/sachinsabbarwal
>> > >>>>>> Quora: http://www.quora.com/Sachin-Sabbarwal
>> > >>>>>> Blog: http://sachinsabbarwal.tumblr.com/
>> > >>>>>>
>> > >>>>>
>> > >>>>>
>> > >>>>>
>> > >>>>> --
>> > >>>>> ~Rajesh.B
>> > >>>>>
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>> --
>> > >>>> Sachin Sabbarwal
>> > >>>> Linkedin:
>> https://www.linkedin.com/profile?viewProfile=&key=95777265
>> > >>>> Facebook: facebook.com/sachinsabbarwal
>> > >>>> Quora: http://www.quora.com/Sachin-Sabbarwal
>> > >>>> Blog: http://sachinsabbarwal.tumblr.com/
>> > >>>>
>> > >>>
>> > >>>
>> > >>>
>> > >>> --
>> > >>> ~Rajesh.B
>> > >>>
>> > >>
>> > >>
>> > >>
>> > >> --
>> > >> ~Rajesh.B
>> > >>
>> > >
>> > >
>> > >
>> > > --
>> > > Sachin Sabbarwal
>> > > Linkedin: https://www.linkedin.com/profile?viewProfile=&key=95777265
>> > > Facebook: facebook.com/sachinsabbarwal
>> > > Quora: http://www.quora.com/Sachin-Sabbarwal
>> > > Blog: http://sachinsabbarwal.tumblr.com/
>> > >
>> >
>> >
>> >
>> > --
>> > ~Rajesh.B
>> >
>> >
>> >
>> > --
>> > Sachin Sabbarwal
>> > Linkedin: https://www.linkedin.com/profile?viewProfile=&key=95777265
>> > Facebook: facebook.com/sachinsabbarwal
>> > Quora: http://www.quora.com/Sachin-Sabbarwal
>> > Blog: http://sachinsabbarwal.tumblr.com/
>> >
>>
>
>
>
> --
> Sachin Sabbarwal
> Linkedin: https://www.linkedin.com/profile?viewProfile=&key=95777265
> Facebook: facebook.com/sachinsabbarwal
> Quora: http://www.quora.com/Sachin-Sabbarwal
> Blog: http://sachinsabbarwal.tumblr.com/
>



-- 
Sachin Sabbarwal
Linkedin: https://www.linkedin.com/profile?viewProfile=&key=95777265
Facebook: facebook.com/sachinsabbarwal
Quora: http://www.quora.com/Sachin-Sabbarwal
Blog: http://sachinsabbarwal.tumblr.com/

Re: Same pig script running slower with Tez as compared with run in Mapred mode

Posted by Sachin Sabbarwal <sa...@gmail.com>.

Hi Rohini
Attached zip contains the main.pig and other Macros used in this pig.
Please lemme know if there is anything else missing/required.

Thanks

On Wed, Jul 8, 2015 at 4:47 AM, Rohini Palaniswamy <ro...@gmail.com>
wrote:

> Sachin,
>     Can you attach your pig script and pig client log as well as I asked
> earlier?
>
> Regards,
> Rohini
>
> On Tue, Jul 7, 2015 at 2:43 AM, Sachin Sabbarwal <
> sachinsabbarwal@gmail.com>
> wrote:
>
> > Hi Guys
> > I'm using Apache Pig version 0.14.0 (r1640057) and  0.5.3 TEZ.
> > I am running a pig script in following 2 sceniors:
> > 1. Without any data. In this case when run with TEZ mode script runs with
> > in 1 min. With Mapred it takes around 7 minutes.
> > 2. With little data, In tez mode same pig script takes around 10 mins and
> > with mapred it takes around 14-15 mins.
> >
> > I had sent a mail to user@tez.apache.org and got to know that the
> problem
> > is that pig auto-reduce isn't coming in to play for few of my scopes and
> > hence lots of tasks are getting created. This causes the run in tez mode
> to
> > become slow. As suggested i tried setting pig.exec.reducers.max to a
> > smaller value(tried with 150,25 and 10). And with this property set to
> 10,
> > my script ran under 4 minutes. Also from logs we could see that auto
> > -reducer is set to false for few scopes. For example:
> >
> > 2015-07-06 14:11:35,123 INFO [AsyncDispatcher event handler]
> > vertexmanager.ShuffleVertexManager: Shuffle Vertex Manager: settings
> > minFrac:0.25 maxFrac:0.75* auto:false* desiredTaskIput:104857600
> minTasks:1
> > 2015-07-06 14:11:35,123 INFO [AsyncDispatcher event handler]
> > impl.VertexImpl: Creating 999 for vertex:
> >
> > Now the problem here is that in my pig.properties i have
> > set pig.tez.auto.parallelism=true and also in my pig script i have set
> this
> > propetry to true. But still we are seeing this getting set to false for
> few
> > scopes.
> >
> > You can find the My conversation with Rajesh from TEZ below. Please lemme
> > know what i am missing here. Lemme know if any other information is
> > required.
> >
> > Thanks in advance
> >
> >
> >
> >
> >
> > ---------- Forwarded message ----------
> > From: Rajesh Balamohan <ra...@gmail.com>
> > Date: Tue, Jul 7, 2015 at 2:12 PM
> > Subject: Re: Same pig script running slower with Tez as compared with run
> > in Mapred mode
> > To: user@tez.apache.org
> >
> >
> > Hi Sachin,
> >
> > That was just a temporary workaround to ensure that was the issue.
> Ideally
> > user does not need to set this parameter.  Real issue is why auto-reducer
> > is set to false in certain vertices in pig-tez. Will wait for Pig folks
> to
> > chime in.
> >
> > For doc/tutorial related, you can start off with the following
> > - http://tez.apache.org/talks.html
> > - couple of youtube videos are available from hadoop summits and meetups.
> > -
> >
> >
> http://hortonworks.com/blog/apache-tez-a-new-chapter-in-hadoop-data-processing/
> > (this is pretty old)
> > - Pig on tez
> >
> >
> http://www.slideshare.net/Hadoop_Summit/pig-on-tez-low-latency-data-processing-with-big-data
> >
> > ~Rajesh.B
> >
> > On Tue, Jul 7, 2015 at 1:54 PM, Sachin Sabbarwal <
> > sachinsabbarwal@gmail.com>
> > wrote:
> >
> > > Hi Rajesh
> > > Thanks for your response. *This seems to be working for me.*
> > > By setting pig.exec.reducers.max to 10 i am able to complete my run in
> > > under 4 mins.(Initially it was running in 14-15 mins).
> > > I'm new to pig/tez/hadoop world. Do you write any blogs about
> > > pig/tez/hadoop etc? Can you suggest any tutorials/links to read about
> > tez?
> > > I need to understand concepts like scope, DAG, parallelism etc. I just
> > > have a very basic understanding of tez. If i understand all these
> > concepts
> > > i'll be able to tune my job.
> > >
> > > Thanks
> > >
> > >
> > > On Tue, Jul 7, 2015 at 1:23 PM, Rajesh Balamohan <
> > > rajesh.balamohan@gmail.com> wrote:
> > >
> > >> Forgot to add the following.  Ideally auto-reduce implementation
> should
> > >> have kicked-in and on need basis, it should have decreased the number
> of
> > >> reducers needed.  However, for the vertices of concern (scope-2037 &
> > >> scope-2162), auto-reducer has been turned off in configuration by Pig
> > and
> > >> for the rest of the vertices it is turned on.
> > >>
> > >> Pig folks would be able to help in terms of providing details on why
> > >> auto-reduce parallelism is turned off in certain vertices.
> > >>
> > >> 2015-07-06 14:11:35,109 INFO [AsyncDispatcher event handler]
> > >> impl.VertexImpl: Setting vertexManager to ShuffleVertexManager for
> > >> vertex_1436152736518_0210_1_28* [scope-2037]*
> > >> 2015-07-06 14:11:35,123 INFO [AsyncDispatcher event handler]
> > >> vertexmanager.ShuffleVertexManager: Shuffle Vertex Manager: settings
> > >> minFrac:0.25 maxFrac:0.75* auto:false* desiredTaskIput:104857600
> > >> minTasks:1
> > >> 2015-07-06 14:11:35,123 INFO [AsyncDispatcher event handler]
> > >> impl.VertexImpl: Creating 999 for vertex:
> vertex_1436152736518_0210_1_28
> > >> [scope-2037]
> > >> ....
> > >>
> > >> 2015-07-06 14:11:35,245 INFO [AsyncDispatcher event handler]
> > >> impl.VertexImpl: Setting vertexManager to ShuffleVertexManager for
> > >> vertex_1436152736518_0210_1_39 *[scope-2162]*
> > >> 2015-07-06 14:11:35,257 INFO [AsyncDispatcher event handler]
> > >> vertexmanager.ShuffleVertexManager: Shuffle Vertex Manager: settings
> > >> minFrac:0.25 maxFrac:0.75 *auto:false* desiredTaskIput:104857600
> > >> minTasks:1
> > >> 2015-07-06 14:11:35,257 INFO [AsyncDispatcher event handler]
> > >> impl.VertexImpl: Creating 999 for vertex:
> vertex_1436152736518_0210_1_39
> > >> [scope-2162]
> > >> ....
> > >>
> > >> 2015-07-06 14:11:35,417 INFO [AsyncDispatcher event handler]
> > >> impl.VertexImpl: Setting user vertex manager plugin:
> > >> org.apache.tez.dag.library.vertexmanager.ShuffleVertexManager on
> > vertex:*
> > >> scope-2185*
> > >> 2015-07-06 14:11:35,419 INFO [AsyncDispatcher event handler]
> > >> vertexmanager.ShuffleVertexManager: Shuffle Vertex Manager: settings
> > >> minFrac:0.25 maxFrac:0.75 *auto:true* desiredTaskIput:104857600
> > >> minTasks:1
> > >> ...
> > >>
> > >>
> > >> On Tue, Jul 7, 2015 at 12:47 PM, Rajesh Balamohan <
> > >> rajesh.balamohan@gmail.com> wrote:
> > >>
> > >>>
> > >>> Attaching the DAG and the swimlane for the job.
> > >>>
> > >>> scope-2052 which had to give the data to other vertices slowed down
> (~
> > >>> 150-180 seconds) due to multiple spills and NumberFormatExceptions in
> > >>> data.  You might want to try setting
> > >>>
> >
> "tez.task.scale.memory.additional-reservation.fraction.max='PARTITIONED_UNSORTED_OUTPUT:12,UNSORTED_INPUT:1,UNSORTED_OUTPUT:12,SORTED_OUTPUT:12,SORTED_MERGED_INPUT:1,PROCESSOR:1,OTHER:1'
> > >>> " to allocate more memory for unordered outputs.
> > >>> Following are the details for this scope.
> > >>> - attempt_1436152736518_0210_1_31_000000_0,
> > >>> PigLatin:dmwith1tapin.pig-0_scope-0, VertexName: scope-2052,
> > >>> VertexParallelism: 1,
> > >>> TaskAttemptID:attempt_1436152736518_0210_1_31_000000_0,
> > >>> - numInputs=1, numOutputs=4, JVM.maxFree=734527488
> > >>> - 2015-07-06 14:11:40,047 INFO [TezChild]
> resources.MemoryDistributor:
> > >>> Informing: INPUT, scope-546, org.apache.tez.mapreduce.input.MRInput:
> > >>> requested=0, allocated=0
> > >>> - Small allocation of ~7 MB allocation to unordered output lead to
> > >>> multiple spills.
> > >>> - 2015-07-06 14:11:40,047 INFO [TezChild]
> resources.MemoryDistributor:
> > >>> Informing: OUTPUTPUT_RECORDS, scope-2117,
> > >>> org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput:
> > >>> requested=268435456, allocated=222303401
> > >>> - 2015-07-06 14:11:40,047 INFO [TezChild]
> resources.MemoryDistributor:
> > >>> Informing: OUTPUT, scope-2251,
> > >>> org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput:
> > >>> requested=268435456, allocated=222303401
> > >>> - 2015-07-06 14:11:40,048 INFO [TezChild]
> resources.MemoryDistributor:
> > >>> Informing: OUTPUT, scope-2063,
> > >>> org.apache.tez.runtime.library.output.UnorderedPartitionedKVOutput:
> > >>> requested=104857600, allocated=7236438
> > >>> - 2015-07-06 14:11:40,048 INFO [TezChild]
> resources.MemoryDistributor:
> > >>> Informing: OUTPUT, scope-2068,
> > >>> org.apache.tez.runtime.library.output.UnorderedPartitionedKVOutput:
> > >>> requested=104857600, allocated=7236438
> > >>>  - Too many number of records had issues in NumberFormatException
> > >>> leading to large amount of logs. This dragged the runtime of this
> task
> > to
> > >>> around
> > >>>    e.g "mapReduceLayer.PigHadoopLogger:
> > >>> java.lang.Class(FIELD_DISCARDED_TYPE_CONVERSION_FAILED): Unable to
> > >>> interpret value  in field being converted to long, caught
> > >>> NumberFormatException <empty String> field discarded"
> > >>>
> > >>>
> > >>> - scope-2037 and scope-2162 had set the vertex parallelism to "999"
> > >>> affecting subsequent execution.
> > >>> - VertexName: scope-2037, VertexParallelism: 999 vertex:
> > >>> vertex_1436152736518_0210_1_28 finished in *410* seconds.  Tasks
> > >>> themselves were small, but due to large number of tasks that had to
> be
> > >>> executed in small containers (pretty much used the same container to
> > >>> execute this) it took time.
> > >>> - VertexName: scope-2162, VertexParallelism: 999 vertex:
> > >>> vertex_1436152736518_0210_1_39 finished in *697* seconds. Similar
> > >>> observation as previous vertex.
> > >>> *Above 2 vertices have caused the entire job to slow down.*
> > >>>
> > >>> "999" is set as the reducer parallelism at compile time by Pig. This
> is
> > >>> not for the input. I am not sure how pig sets the parallelism at
> > compile
> > >>> time.  You can possibly try setting "pig.exec.reducers.max=50" in
> your
> > case
> > >>> and give it a try. Pig folks would be in a better position to explain
> > that.
> > >>>
> > >>>
> > >>> On Tue, Jul 7, 2015 at 11:22 AM, Sachin Sabbarwal <
> > >>> sachinsabbarwal@gmail.com> wrote:
> > >>>
> > >>>> 
> > >>>>  logs.gz
> > >>>> <
> >
> https://drive.google.com/file/d/0B-RFcYxUIHzzUVJpRzVDZXB5TUk/view?usp=drive_web
> > >
> > >>>> Hi Rajesh
> > >>>>
> > >>>> PFA the gziped logs.
> > >>>> FYI It's a single file, when you'll gunzip it, it'll be around 1.5gb
> > in
> > >>>> size.
> > >>>> One more thing which you might find useful:
> > >>>>
> > >>>> In the dmOutputTez file i could see following line, which suggests
> > that
> > >>>> TEZ created a total of 7660 tasks. This is surprising as my data is
> > only
> > >>>> few mbs(10-15 mb max). How is this number of tasks decided? is there
> > any
> > >>>> property to tune it?
> > >>>>
> > >>>> 2015-07-07 05:37:02,647 [Timer-0] INFO
> > >>>>  org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG
> > >>>> Status: status=RUNNING, progress=TotalTasks: 7660 Succeeded: 0
> > Running:
> > >>>> 0 Failed: 0 Killed: 0, diagnostics=
> > >>>>
> > >>>> Thanks
> > >>>>
> > >>>> On Mon, Jul 6, 2015 at 8:34 PM, Rajesh Balamohan <
> > >>>> rajesh.balamohan@gmail.com> wrote:
> > >>>>
> > >>>>> yarn logs -applicationId application_1436152736518_0210
> > >>>>>
> > >>>>> You can possibly send the output to a log file, gzip it and post
> it.
> > >>>>>
> > >>>>> ~Rajesh.B
> > >>>>>
> > >>>>> On Mon, Jul 6, 2015 at 8:12 PM, Sachin Sabbarwal <
> > >>>>> sachinsabbarwal@gmail.com> wrote:
> > >>>>>
> > >>>>>> Hi
> > >>>>>> Thanks for reply. My tez-site.xml contains only following:
> > >>>>>>
> > >>>>>> <configuration>
> > >>>>>> <property>
> > >>>>>>   <name>tez.lib.uris</name>
> > >>>>>>   <value>${fs.defaultFS}/apps/tez-0.5/tez-0.5.3.tar.gz,
> > >>>>>>
> > ${fs.defaultFS}/apps/tez-0.5/*,${fs.defaultFS}/apps/tez-0.5/lib/*</value>
> > >>>>>> </property>
> > >>>>>> </configuration>
> > >>>>>>
> > >>>>>> PFA the application logs. Here is the version information:
> > >>>>>> 1. Hadoop version: Hadoop 2.5.0-cdh5.3.1
> > >>>>>> 2. Pig: Apache Pig version 0.14.0 (r1640057)
> > >>>>>> 3. Tez: 0.5.3
> > >>>>>>
> > >>>>>> Lemme know if anything else is needed.
> > >>>>>>
> > >>>>>> Thanks in advance
> > >>>>>>
> > >>>>>> On Mon, Jul 6, 2015 at 7:07 PM, Rajesh Balamohan <
> > >>>>>> rajesh.balamohan@gmail.com> wrote:
> > >>>>>>
> > >>>>>>> Can you post the application logs, tez-site.xml and also the
> > version
> > >>>>>>> details?
> > >>>>>>>
> > >>>>>>> ~Rajesh.B
> > >>>>>>>
> > >>>>>>> On Mon, Jul 6, 2015 at 6:38 PM, Sachin Sabbarwal <
> > >>>>>>> sachinsabbarwal@gmail.com> wrote:
> > >>>>>>>
> > >>>>>>>>
> > >>>>>>>> ---------- Forwarded message ----------
> > >>>>>>>> From: Sachin Sabbarwal <sa...@gmail.com>
> > >>>>>>>> Date: Mon, Jul 6, 2015 at 5:34 PM
> > >>>>>>>> Subject: Same pig script running slower with Tez as compared
> with
> > >>>>>>>> run in Mapred mode
> > >>>>>>>> To: user@tez.apache.org
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Hello Guys
> > >>>>>>>> Trying Apache Tez.
> > >>>>>>>> I've setup to use pig in TEZ mode.
> > >>>>>>>> I'm running a pig script against i) no data and ii) with some
> > data.
> > >>>>>>>> In case i) when i run with pig using TEZ mode my pig scripts
> > >>>>>>>> completes run in ~40secs. Whereas when i run case i) with mapred
> > it takes
> > >>>>>>>> around 7-8 mins.
> > >>>>>>>> in case ii) when run with pig using TEZ, same pig script takes
> > >>>>>>>> around 14-15 mins but with mapred it takes around 10 mins.
> > >>>>>>>> When i'm running same pig script with production data(which is
> > much
> > >>>>>>>> more than the data i used here to run case i) and (ii) ) the job
> > takes
> > >>>>>>>> hours to complete.
> > >>>>>>>> Hence I'm trying tez to run my pig job in a faster mode. I'm not
> > >>>>>>>> really sure what i might be missing here. Please help, ask for
> > any further
> > >>>>>>>> info if required.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Thanks
> > >>>>>>>> --
> > >>>>>>>> Sachin Sabbarwal
> > >>>>>>>> Linkedin:
> > >>>>>>>> https://www.linkedin.com/profile?viewProfile=&key=95777265
> > >>>>>>>> Facebook: facebook.com/sachinsabbarwal
> > >>>>>>>> Quora: http://www.quora.com/Sachin-Sabbarwal
> > >>>>>>>> Blog: http://sachinsabbarwal.tumblr.com/
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> --
> > >>>>>>>> Sachin Sabbarwal
> > >>>>>>>> Linkedin:
> > >>>>>>>> https://www.linkedin.com/profile?viewProfile=&key=95777265
> > >>>>>>>> Facebook: facebook.com/sachinsabbarwal
> > >>>>>>>> Quora: http://www.quora.com/Sachin-Sabbarwal
> > >>>>>>>> Blog: http://sachinsabbarwal.tumblr.com/
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> --
> > >>>>>>> ~Rajesh.B
> > >>>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> --
> > >>>>>> Sachin Sabbarwal
> > >>>>>> Linkedin:
> > https://www.linkedin.com/profile?viewProfile=&key=95777265
> > >>>>>> Facebook: facebook.com/sachinsabbarwal
> > >>>>>> Quora: http://www.quora.com/Sachin-Sabbarwal
> > >>>>>> Blog: http://sachinsabbarwal.tumblr.com/
> > >>>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> --
> > >>>>> ~Rajesh.B
> > >>>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>> --
> > >>>> Sachin Sabbarwal
> > >>>> Linkedin:
> https://www.linkedin.com/profile?viewProfile=&key=95777265
> > >>>> Facebook: facebook.com/sachinsabbarwal
> > >>>> Quora: http://www.quora.com/Sachin-Sabbarwal
> > >>>> Blog: http://sachinsabbarwal.tumblr.com/
> > >>>>
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>> ~Rajesh.B
> > >>>
> > >>
> > >>
> > >>
> > >> --
> > >> ~Rajesh.B
> > >>
> > >
> > >
> > >
> > > --
> > > Sachin Sabbarwal
> > > Linkedin: https://www.linkedin.com/profile?viewProfile=&key=95777265
> > > Facebook: facebook.com/sachinsabbarwal
> > > Quora: http://www.quora.com/Sachin-Sabbarwal
> > > Blog: http://sachinsabbarwal.tumblr.com/
> > >
> >
> >
> >
> > --
> > ~Rajesh.B
> >
> >
> >
> > --
> > Sachin Sabbarwal
> > Linkedin: https://www.linkedin.com/profile?viewProfile=&key=95777265
> > Facebook: facebook.com/sachinsabbarwal
> > Quora: http://www.quora.com/Sachin-Sabbarwal
> > Blog: http://sachinsabbarwal.tumblr.com/
> >
>



-- 
Sachin Sabbarwal
Linkedin: https://www.linkedin.com/profile?viewProfile=&key=95777265
Facebook: facebook.com/sachinsabbarwal
Quora: http://www.quora.com/Sachin-Sabbarwal
Blog: http://sachinsabbarwal.tumblr.com/

Re: Same pig script running slower with Tez as compared with run in Mapred mode

Posted by Rohini Palaniswamy <ro...@gmail.com>.

Sachin,
    Can you attach your pig script and pig client log as well as I asked
earlier?

Regards,
Rohini

On Tue, Jul 7, 2015 at 2:43 AM, Sachin Sabbarwal <sa...@gmail.com>
wrote:

> Hi Guys
> I'm using Apache Pig version 0.14.0 (r1640057) and  0.5.3 TEZ.
> I am running a pig script in following 2 sceniors:
> 1. Without any data. In this case when run with TEZ mode script runs with
> in 1 min. With Mapred it takes around 7 minutes.
> 2. With little data, In tez mode same pig script takes around 10 mins and
> with mapred it takes around 14-15 mins.
>
> I had sent a mail to user@tez.apache.org and got to know that the problem
> is that pig auto-reduce isn't coming in to play for few of my scopes and
> hence lots of tasks are getting created. This causes the run in tez mode to
> become slow. As suggested i tried setting pig.exec.reducers.max to a
> smaller value(tried with 150,25 and 10). And with this property set to 10,
> my script ran under 4 minutes. Also from logs we could see that auto
> -reducer is set to false for few scopes. For example:
>
> 2015-07-06 14:11:35,123 INFO [AsyncDispatcher event handler]
> vertexmanager.ShuffleVertexManager: Shuffle Vertex Manager: settings
> minFrac:0.25 maxFrac:0.75* auto:false* desiredTaskIput:104857600 minTasks:1
> 2015-07-06 14:11:35,123 INFO [AsyncDispatcher event handler]
> impl.VertexImpl: Creating 999 for vertex:
>
> Now the problem here is that in my pig.properties i have
> set pig.tez.auto.parallelism=true and also in my pig script i have set this
> propetry to true. But still we are seeing this getting set to false for few
> scopes.
>
> You can find the My conversation with Rajesh from TEZ below. Please lemme
> know what i am missing here. Lemme know if any other information is
> required.
>
> Thanks in advance
>
>
>
>
>
> ---------- Forwarded message ----------
> From: Rajesh Balamohan <ra...@gmail.com>
> Date: Tue, Jul 7, 2015 at 2:12 PM
> Subject: Re: Same pig script running slower with Tez as compared with run
> in Mapred mode
> To: user@tez.apache.org
>
>
> Hi Sachin,
>
> That was just a temporary workaround to ensure that was the issue.  Ideally
> user does not need to set this parameter.  Real issue is why auto-reducer
> is set to false in certain vertices in pig-tez. Will wait for Pig folks to
> chime in.
>
> For doc/tutorial related, you can start off with the following
> - http://tez.apache.org/talks.html
> - couple of youtube videos are available from hadoop summits and meetups.
> -
>
> http://hortonworks.com/blog/apache-tez-a-new-chapter-in-hadoop-data-processing/
> (this is pretty old)
> - Pig on tez
>
> http://www.slideshare.net/Hadoop_Summit/pig-on-tez-low-latency-data-processing-with-big-data
>
> ~Rajesh.B
>
> On Tue, Jul 7, 2015 at 1:54 PM, Sachin Sabbarwal <
> sachinsabbarwal@gmail.com>
> wrote:
>
> > Hi Rajesh
> > Thanks for your response. *This seems to be working for me.*
> > By setting pig.exec.reducers.max to 10 i am able to complete my run in
> > under 4 mins.(Initially it was running in 14-15 mins).
> > I'm new to pig/tez/hadoop world. Do you write any blogs about
> > pig/tez/hadoop etc? Can you suggest any tutorials/links to read about
> tez?
> > I need to understand concepts like scope, DAG, parallelism etc. I just
> > have a very basic understanding of tez. If i understand all these
> concepts
> > i'll be able to tune my job.
> >
> > Thanks
> >
> >
> > On Tue, Jul 7, 2015 at 1:23 PM, Rajesh Balamohan <
> > rajesh.balamohan@gmail.com> wrote:
> >
> >> Forgot to add the following.  Ideally auto-reduce implementation should
> >> have kicked-in and on need basis, it should have decreased the number of
> >> reducers needed.  However, for the vertices of concern (scope-2037 &
> >> scope-2162), auto-reducer has been turned off in configuration by Pig
> and
> >> for the rest of the vertices it is turned on.
> >>
> >> Pig folks would be able to help in terms of providing details on why
> >> auto-reduce parallelism is turned off in certain vertices.
> >>
> >> 2015-07-06 14:11:35,109 INFO [AsyncDispatcher event handler]
> >> impl.VertexImpl: Setting vertexManager to ShuffleVertexManager for
> >> vertex_1436152736518_0210_1_28* [scope-2037]*
> >> 2015-07-06 14:11:35,123 INFO [AsyncDispatcher event handler]
> >> vertexmanager.ShuffleVertexManager: Shuffle Vertex Manager: settings
> >> minFrac:0.25 maxFrac:0.75* auto:false* desiredTaskIput:104857600
> >> minTasks:1
> >> 2015-07-06 14:11:35,123 INFO [AsyncDispatcher event handler]
> >> impl.VertexImpl: Creating 999 for vertex: vertex_1436152736518_0210_1_28
> >> [scope-2037]
> >> ....
> >>
> >> 2015-07-06 14:11:35,245 INFO [AsyncDispatcher event handler]
> >> impl.VertexImpl: Setting vertexManager to ShuffleVertexManager for
> >> vertex_1436152736518_0210_1_39 *[scope-2162]*
> >> 2015-07-06 14:11:35,257 INFO [AsyncDispatcher event handler]
> >> vertexmanager.ShuffleVertexManager: Shuffle Vertex Manager: settings
> >> minFrac:0.25 maxFrac:0.75 *auto:false* desiredTaskIput:104857600
> >> minTasks:1
> >> 2015-07-06 14:11:35,257 INFO [AsyncDispatcher event handler]
> >> impl.VertexImpl: Creating 999 for vertex: vertex_1436152736518_0210_1_39
> >> [scope-2162]
> >> ....
> >>
> >> 2015-07-06 14:11:35,417 INFO [AsyncDispatcher event handler]
> >> impl.VertexImpl: Setting user vertex manager plugin:
> >> org.apache.tez.dag.library.vertexmanager.ShuffleVertexManager on
> vertex:*
> >> scope-2185*
> >> 2015-07-06 14:11:35,419 INFO [AsyncDispatcher event handler]
> >> vertexmanager.ShuffleVertexManager: Shuffle Vertex Manager: settings
> >> minFrac:0.25 maxFrac:0.75 *auto:true* desiredTaskIput:104857600
> >> minTasks:1
> >> ...
> >>
> >>
> >> On Tue, Jul 7, 2015 at 12:47 PM, Rajesh Balamohan <
> >> rajesh.balamohan@gmail.com> wrote:
> >>
> >>>
> >>> Attaching the DAG and the swimlane for the job.
> >>>
> >>> scope-2052 which had to give the data to other vertices slowed down (~
> >>> 150-180 seconds) due to multiple spills and NumberFormatExceptions in
> >>> data.  You might want to try setting
> >>>
> "tez.task.scale.memory.additional-reservation.fraction.max='PARTITIONED_UNSORTED_OUTPUT:12,UNSORTED_INPUT:1,UNSORTED_OUTPUT:12,SORTED_OUTPUT:12,SORTED_MERGED_INPUT:1,PROCESSOR:1,OTHER:1'
> >>> " to allocate more memory for unordered outputs.
> >>> Following are the details for this scope.
> >>> - attempt_1436152736518_0210_1_31_000000_0,
> >>> PigLatin:dmwith1tapin.pig-0_scope-0, VertexName: scope-2052,
> >>> VertexParallelism: 1,
> >>> TaskAttemptID:attempt_1436152736518_0210_1_31_000000_0,
> >>> - numInputs=1, numOutputs=4, JVM.maxFree=734527488
> >>> - 2015-07-06 14:11:40,047 INFO [TezChild] resources.MemoryDistributor:
> >>> Informing: INPUT, scope-546, org.apache.tez.mapreduce.input.MRInput:
> >>> requested=0, allocated=0
> >>> - Small allocation of ~7 MB allocation to unordered output lead to
> >>> multiple spills.
> >>> - 2015-07-06 14:11:40,047 INFO [TezChild] resources.MemoryDistributor:
> >>> Informing: OUTPUTPUT_RECORDS, scope-2117,
> >>> org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput:
> >>> requested=268435456, allocated=222303401
> >>> - 2015-07-06 14:11:40,047 INFO [TezChild] resources.MemoryDistributor:
> >>> Informing: OUTPUT, scope-2251,
> >>> org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput:
> >>> requested=268435456, allocated=222303401
> >>> - 2015-07-06 14:11:40,048 INFO [TezChild] resources.MemoryDistributor:
> >>> Informing: OUTPUT, scope-2063,
> >>> org.apache.tez.runtime.library.output.UnorderedPartitionedKVOutput:
> >>> requested=104857600, allocated=7236438
> >>> - 2015-07-06 14:11:40,048 INFO [TezChild] resources.MemoryDistributor:
> >>> Informing: OUTPUT, scope-2068,
> >>> org.apache.tez.runtime.library.output.UnorderedPartitionedKVOutput:
> >>> requested=104857600, allocated=7236438
> >>>  - Too many number of records had issues in NumberFormatException
> >>> leading to large amount of logs. This dragged the runtime of this task
> to
> >>> around
> >>>    e.g "mapReduceLayer.PigHadoopLogger:
> >>> java.lang.Class(FIELD_DISCARDED_TYPE_CONVERSION_FAILED): Unable to
> >>> interpret value  in field being converted to long, caught
> >>> NumberFormatException <empty String> field discarded"
> >>>
> >>>
> >>> - scope-2037 and scope-2162 had set the vertex parallelism to "999"
> >>> affecting subsequent execution.
> >>> - VertexName: scope-2037, VertexParallelism: 999 vertex:
> >>> vertex_1436152736518_0210_1_28 finished in *410* seconds.  Tasks
> >>> themselves were small, but due to large number of tasks that had to be
> >>> executed in small containers (pretty much used the same container to
> >>> execute this) it took time.
> >>> - VertexName: scope-2162, VertexParallelism: 999 vertex:
> >>> vertex_1436152736518_0210_1_39 finished in *697* seconds. Similar
> >>> observation as previous vertex.
> >>> *Above 2 vertices have caused the entire job to slow down.*
> >>>
> >>> "999" is set as the reducer parallelism at compile time by Pig. This is
> >>> not for the input. I am not sure how pig sets the parallelism at
> compile
> >>> time.  You can possibly try setting "pig.exec.reducers.max=50" in your
> case
> >>> and give it a try. Pig folks would be in a better position to explain
> that.
> >>>
> >>>
> >>> On Tue, Jul 7, 2015 at 11:22 AM, Sachin Sabbarwal <
> >>> sachinsabbarwal@gmail.com> wrote:
> >>>
> >>>> 
> >>>>  logs.gz
> >>>> <
> https://drive.google.com/file/d/0B-RFcYxUIHzzUVJpRzVDZXB5TUk/view?usp=drive_web
> >
> >>>> Hi Rajesh
> >>>>
> >>>> PFA the gziped logs.
> >>>> FYI It's a single file, when you'll gunzip it, it'll be around 1.5gb
> in
> >>>> size.
> >>>> One more thing which you might find useful:
> >>>>
> >>>> In the dmOutputTez file i could see following line, which suggests
> that
> >>>> TEZ created a total of 7660 tasks. This is surprising as my data is
> only
> >>>> few mbs(10-15 mb max). How is this number of tasks decided? is there
> any
> >>>> property to tune it?
> >>>>
> >>>> 2015-07-07 05:37:02,647 [Timer-0] INFO
> >>>>  org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG
> >>>> Status: status=RUNNING, progress=TotalTasks: 7660 Succeeded: 0
> Running:
> >>>> 0 Failed: 0 Killed: 0, diagnostics=
> >>>>
> >>>> Thanks
> >>>>
> >>>> On Mon, Jul 6, 2015 at 8:34 PM, Rajesh Balamohan <
> >>>> rajesh.balamohan@gmail.com> wrote:
> >>>>
> >>>>> yarn logs -applicationId application_1436152736518_0210
> >>>>>
> >>>>> You can possibly send the output to a log file, gzip it and post it.
> >>>>>
> >>>>> ~Rajesh.B
> >>>>>
> >>>>> On Mon, Jul 6, 2015 at 8:12 PM, Sachin Sabbarwal <
> >>>>> sachinsabbarwal@gmail.com> wrote:
> >>>>>
> >>>>>> Hi
> >>>>>> Thanks for reply. My tez-site.xml contains only following:
> >>>>>>
> >>>>>> <configuration>
> >>>>>> <property>
> >>>>>>   <name>tez.lib.uris</name>
> >>>>>>   <value>${fs.defaultFS}/apps/tez-0.5/tez-0.5.3.tar.gz,
> >>>>>>
> ${fs.defaultFS}/apps/tez-0.5/*,${fs.defaultFS}/apps/tez-0.5/lib/*</value>
> >>>>>> </property>
> >>>>>> </configuration>
> >>>>>>
> >>>>>> PFA the application logs. Here is the version information:
> >>>>>> 1. Hadoop version: Hadoop 2.5.0-cdh5.3.1
> >>>>>> 2. Pig: Apache Pig version 0.14.0 (r1640057)
> >>>>>> 3. Tez: 0.5.3
> >>>>>>
> >>>>>> Lemme know if anything else is needed.
> >>>>>>
> >>>>>> Thanks in advance
> >>>>>>
> >>>>>> On Mon, Jul 6, 2015 at 7:07 PM, Rajesh Balamohan <
> >>>>>> rajesh.balamohan@gmail.com> wrote:
> >>>>>>
> >>>>>>> Can you post the application logs, tez-site.xml and also the
> version
> >>>>>>> details?
> >>>>>>>
> >>>>>>> ~Rajesh.B
> >>>>>>>
> >>>>>>> On Mon, Jul 6, 2015 at 6:38 PM, Sachin Sabbarwal <
> >>>>>>> sachinsabbarwal@gmail.com> wrote:
> >>>>>>>
> >>>>>>>>
> >>>>>>>> ---------- Forwarded message ----------
> >>>>>>>> From: Sachin Sabbarwal <sa...@gmail.com>
> >>>>>>>> Date: Mon, Jul 6, 2015 at 5:34 PM
> >>>>>>>> Subject: Same pig script running slower with Tez as compared with
> >>>>>>>> run in Mapred mode
> >>>>>>>> To: user@tez.apache.org
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Hello Guys
> >>>>>>>> Trying Apache Tez.
> >>>>>>>> I've setup to use pig in TEZ mode.
> >>>>>>>> I'm running a pig script against i) no data and ii) with some
> data.
> >>>>>>>> In case i) when i run with pig using TEZ mode my pig scripts
> >>>>>>>> completes run in ~40secs. Whereas when i run case i) with mapred
> it takes
> >>>>>>>> around 7-8 mins.
> >>>>>>>> in case ii) when run with pig using TEZ, same pig script takes
> >>>>>>>> around 14-15 mins but with mapred it takes around 10 mins.
> >>>>>>>> When i'm running same pig script with production data(which is
> much
> >>>>>>>> more than the data i used here to run case i) and (ii) ) the job
> takes
> >>>>>>>> hours to complete.
> >>>>>>>> Hence I'm trying tez to run my pig job in a faster mode. I'm not
> >>>>>>>> really sure what i might be missing here. Please help, ask for
> any further
> >>>>>>>> info if required.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Thanks
> >>>>>>>> --
> >>>>>>>> Sachin Sabbarwal
> >>>>>>>> Linkedin:
> >>>>>>>> https://www.linkedin.com/profile?viewProfile=&key=95777265
> >>>>>>>> Facebook: facebook.com/sachinsabbarwal
> >>>>>>>> Quora: http://www.quora.com/Sachin-Sabbarwal
> >>>>>>>> Blog: http://sachinsabbarwal.tumblr.com/
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Sachin Sabbarwal
> >>>>>>>> Linkedin:
> >>>>>>>> https://www.linkedin.com/profile?viewProfile=&key=95777265
> >>>>>>>> Facebook: facebook.com/sachinsabbarwal
> >>>>>>>> Quora: http://www.quora.com/Sachin-Sabbarwal
> >>>>>>>> Blog: http://sachinsabbarwal.tumblr.com/
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> ~Rajesh.B
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Sachin Sabbarwal
> >>>>>> Linkedin:
> https://www.linkedin.com/profile?viewProfile=&key=95777265
> >>>>>> Facebook: facebook.com/sachinsabbarwal
> >>>>>> Quora: http://www.quora.com/Sachin-Sabbarwal
> >>>>>> Blog: http://sachinsabbarwal.tumblr.com/
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> ~Rajesh.B
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Sachin Sabbarwal
> >>>> Linkedin: https://www.linkedin.com/profile?viewProfile=&key=95777265
> >>>> Facebook: facebook.com/sachinsabbarwal
> >>>> Quora: http://www.quora.com/Sachin-Sabbarwal
> >>>> Blog: http://sachinsabbarwal.tumblr.com/
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> ~Rajesh.B
> >>>
> >>
> >>
> >>
> >> --
> >> ~Rajesh.B
> >>
> >
> >
> >
> > --
> > Sachin Sabbarwal
> > Linkedin: https://www.linkedin.com/profile?viewProfile=&key=95777265
> > Facebook: facebook.com/sachinsabbarwal
> > Quora: http://www.quora.com/Sachin-Sabbarwal
> > Blog: http://sachinsabbarwal.tumblr.com/
> >
>
>
>
> --
> ~Rajesh.B
>
>
>
> --
> Sachin Sabbarwal
> Linkedin: https://www.linkedin.com/profile?viewProfile=&key=95777265
> Facebook: facebook.com/sachinsabbarwal
> Quora: http://www.quora.com/Sachin-Sabbarwal
> Blog: http://sachinsabbarwal.tumblr.com/
>