You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by Rohini Palaniswamy <ro...@gmail.com> on 2016/07/26 13:57:39 UTC

Re: Can anyone who has the experience on pigmix share configuration and expected results?

Let us just take one script L9 for analysis.
    - What was the failure error/stack trace? We run Pigmix with just 1G of
heap. So it cannot be going out of memory.
    - Where was the 6 hours spent? Can you give a breakdown? Are all the
reducer tasks being launched in parallel? For eg: If a reducer normally
takes 30 mins, if it is launched in 6 waves it can take 3 hrs.  Try
lowering reducer memory from -Xmx3276m to -Xmx2048m or -Xmx1638m if that is
the case.



On Tue, Jul 26, 2016 at 12:18 AM, Zhang, Liyun <li...@intel.com>
wrote:

> Hi all:
>
>   Now I’m using pigmix to test the performance of Pig On Spark(PIG-4937
> <https://issues.apache.org/jira/browse/PIG-4937>). The test data is 1TB.
> After generating all the test data, I have run first round of test in mr
> mode.
>
> The cluster has 8 nodes(each node has 40 cores and 60g memory, will assign
> 28 cores and 56g for  nodemanager on the node).  Total cores and memory for
> the cluster is 224 cores and 448g memory.
>
>
>
> The snippet of yarn-site.xml:
>
> <property>
>
>     <name>yarn.nodemanager.resource.memory-mb</name>
>
>     <value>57344</value>
>
>     <description>the amount of memory on the NodeManager in
> MB</description>
>
>   </property>
>
>    <property>
>
>     <name>yarn.nodemanager.resource.cpu-vcores</name>
>
>     <value>28</value>
>
>   </property>
>
>   <property>
>
>     <name>yarn.scheduler.minimum-allocation-mb</name>
>
>     <value>2048</value>
>
>   </property>
>
>   <property>
>
>     <name>yarn.scheduler.maximum-allocation-mb</name>
>
>     <value>57344</value>
>
>   </property>
>
>     <property>
>
>     <name>yarn.nodemanager.vmem-check-enabled</name>
>
>     <value>false</value>
>
>     <description>Whether virtual memory limits will be enforced for
> containers</description>
>
>   </property>
>
>   <property>
>
>     <name>yarn.nodemanager.vmem-pmem-ratio</name>
>
>     <value>4</value>
>
>     <description>Ratio between virtual memory to physical memory when
> setting memory limits for containers</description>
>
>   </property>
>
>
>
> The snippet of mapred-site.xml is
>
>   <property>
>
>     <name>mapreduce.map.java.opts</name>
>
>     <value>-Xmx1638m</value>
>
>   </property>
>
>   <property>
>
>     <name>mapreduce.reduce.java.opts</name>
>
>     <value>-Xmx3276m</value>
>
>   </property>
>
>   <property>
>
>     <name>mapreduce.map.memory.mb</name>
>
>     <value>2048</value>
>
>   </property>
>
>   <property>
>
>     <name>mapreduce.reduce.memory.mb</name>
>
>     <value>4096</value>
>
>   </property>
>
>   <property>
>
>     <name>mapreduce.task.io.sort.mb</name>
>
>     <value>820</value>
>
>   </property>
>
>   <property>
>
>     <name>mapred.task.timeout</name>
>
>     <value>1200000</value>
>
>   </property>
>
>
>
> The snippet of hdfs-site.xml
>
> <property>
>
>     <name>dfs.blocksize</name>
>
>     <value>1124217344</value>
>
>   </property>
>
> <property>
>
>   <name>dfs.replication</name>
>
>   <value>1</value>
>
> </property>
>
> <property>
>
> <name>dfs.socket.timeout</name>
>
> <value>1200000</value>
>
> </property>
>
> <property>
>
> <name>dfs.datanode.socket.write.timeout</name>
>
> <value>1200000</value>
>
> </property>
>
>
>
> The result of last run of pigmix in mr mode(L9,10,13,14,17 fail). It shows
> that the average time spent on one script is nearly *6* hours.  I don’t
> know whether it really need so *much* time to run L1~L17?  Can anyone who
> has experience on pigmix share his/her configuration and expected result
> with me?
>
>
>
>
>
> MR(sec)
>
> L1
>
> 21544
>
> L2
>
> 20482
>
> L3
>
> 21629
>
> L4
>
> 20905
>
> L5
>
> 20738
>
> L6
>
> 24131
>
> L7
>
> 21983
>
> L8
>
> 24549
>
> L9
>
> 6585(Fail)
>
> L10
>
> 22286(Fail)
>
> L11
>
> 21849
>
> L12
>
> 21266
>
> L13
>
> 11099(Fail)
>
> L14
>
> 43(Fail)
>
> L15
>
> 23808
>
> L16
>
> 42889
>
> L17
>
> 10(Fail)
>
>
>
>
>
>
>
> Kelly Zhang/Zhang,Liyun
>
> Best Regards
>
>
>

RE: Can anyone who has the experience on pigmix share configuration and expected results?

Posted by "Zhang, Liyun" <li...@intel.com>.

Hi  Rohini:
  I view the web ui, all the task is executed in parallel.
  After investigating the logs, found following points for L9 failure.
L9.pig
register pigperf.jar;
A = load '/user/pig/tests/data/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
    as (user, action, timespent, query_term, ip_addr, timestamp,
        estimated_revenue, page_info, page_links);
B = order A by query_term parallel 40;
store B into 'L9out';

There will be 3 map-reduce job(scope-23,scope-26,scope-41) in this case.
#--------------------------------------------------
# Map Reduce Plan
#--------------------------------------------------
MapReduce node scope-23
Map Plan
Store(hdfs://zly1.sh.intel.com:8020/tmp/temp-287979498/tmp1627657499:org.apache.pig.impl.io.InterStorage) - scope-24
|
|---A: New For Each(false,false,false,false,false,false,false,false,false)[bag] - scope-19
    |   |
    |   Project[bytearray][0] - scope-1
    |   |
    |   Project[bytearray][1] - scope-3
    |   |
    |   Project[bytearray][2] - scope-5
    |   |
    |   Project[bytearray][3] - scope-7
    |   |
    |   Project[bytearray][4] - scope-9
    |   |
    |   Project[bytearray][5] - scope-11
    |   |
    |   Project[bytearray][6] - scope-13
    |   |
    |   Project[bytearray][7] - scope-15
    |   |
    |   Project[bytearray][8] - scope-17
    |
    |---A: Load(hdfs://bdpe16.sh.intel.com:8020/user/pig/tests/data/pigmix/page_views:org.apache.pig.test.pigmix.udf.PigPerformanceLoader) - scope-0--------
Global sort: false
----------------

MapReduce node scope-26
Map Plan
B: Local Rearrange[tuple]{tuple}(false) - scope-30
|   |
|   Constant(all) - scope-29
|
|---New For Each(false)[tuple] - scope-28
    |   |
    |   Project[bytearray][3] - scope-27
    |
    |---Load(hdfs://zly1.sh.intel.com:8020/tmp/temp-287979498/tmp1627657499:org.apache.pig.impl.builtin.RandomSampleLoader('org.apache.pig.impl.io.InterStorage','100')) - scope-25--------
Reduce Plan
Store(hdfs://zly1.sh.intel.com:8020/tmp/temp-287979498/tmp610018336:org.apache.pig.impl.io.InterStorage) - scope-39
|
|---New For Each(false)[tuple] - scope-38
    |   |
    |   POUserFunc(org.apache.pig.impl.builtin.FindQuantiles)[tuple] - scope-37
    |   |
    |   |---Project[tuple][*] - scope-36
    |
    |---New For Each(false,false)[tuple] - scope-35
        |   |
        |   Constant(10) - scope-34
        |   |
        |   Project[bag][1] - scope-32
        |
        |---Package(Packager)[tuple]{chararray} - scope-31--------
Global sort: false
Secondary sort: true
----------------

MapReduce node scope-41
Map Plan
B: Local Rearrange[tuple]{bytearray}(false) - scope-42
|   |
|   Project[bytearray][3] - scope-20
|
|---Load(hdfs://zly1.sh.intel.com:8020/tmp/temp-287979498/tmp1627657499:org.apache.pig.impl.io.InterStorage) - scope-40--------
Reduce Plan
B: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-22
|
|---New For Each(true)[tuple] - scope-45
    |   |
    |   Project[bag][1] - scope-44
    |
    |---Package(LitePackager)[tuple]{bytearray} - scope-43--------
Global sort: true
Quantile file: hdfs://zly1.sh.intel.com:8020/tmp/temp-287979498/tmp610018336


Scope-26 is do sampling and generate Quantile file.
Always scope-26 fail
#hadoop job –history job_1469651298110_0002-1469672332355-root-PigLatin%3AL9.pig-1469678558094-6414-0-FAILED-default-1469672377395.jhist
Hadoop job: job_1469651298110_0002
=====================================
User: root
JobName: PigLatin:L9.pig
JobConf: hdfs://bdpe41:8020/tmp/hadoop-yarn/staging/root/.staging/job_1469651298110_0002/job.xml
Submitted At: 27-Jul-2016 22:18:52
Launched At: 27-Jul-2016 22:19:37 (45sec)
Finished At: 28-Jul-2016 00:02:38 (1hrs, 43mins, 0sec)
Status: FAILED

=====================================

Task Summary
============================
Kind    Total   Successful      Failed  Killed  StartTime       FinishTime

Setup   0       0               0       0
Map     7197    6414            572     211     27-Jul-2016 22:19:41    28-Jul-2016 00:02:40 (1hrs, 42mins, 59sec)
Reduce  1       0               0       1       27-Jul-2016 22:21:20    28-Jul-2016 00:02:40 (1hrs, 41mins, 19sec)
Cleanup 0       0               0       0

Query  why reduce fails in log, only find that “Task KILL is received. Killing attempt!”.  Not know why the reduce task is killed.

{"type":"REDUCE_ATTEMPT_KILLED","event":{"org.apache.hadoop.mapreduce.jobhistory.TaskAttemptUnsuccessfulCompletion":{"taskid":"task_1469651298110_0002_r_000000","taskType":"REDUCE","attemptId":"attempt_1469651298110_0002_r_000000_0","finishTime":1469678560791,"hostname":"bdpe15","port":41213,"rackname":"/default-rack","status":"KILLED","error":"Task KILL is received. Killing attempt!","counters":{"org.apache.hadoop.mapreduce.jobhistory.JhCounters":{"name":"COUNTERS","groups":[{"name":"org.apache.hadoop.mapreduce.FileSystemCounter","displayName":"File System Counters","counts":[{"name":"FILE_BYTES_READ","displayName":"FILE: Number of bytes read","value":0},{"name":"FILE_BYTES_WRITTEN","displayName":"FILE: Number of bytes written","value":169316},{"name":"FILE_READ_OPS","displayName":"FILE: Number of read operations","value":0},{"name":"FILE_LARGE_READ_OPS","displayName":"FILE: Number of large read operations","value":0},{"name":"FILE_WRITE_OPS","displayName":"FILE: Number of write operations","value":0},{"name":"HDFS_BYTES_READ","displayName":"HDFS: Number of bytes read","value":0},{"name":"HDFS_BYTES_WRITTEN","displayName":"HDFS: Number of bytes written","value":0},{"name":"HDFS_READ_OPS","displayName":"HDFS: Number of read operations","value":0},{"name":"HDFS_LARGE_READ_OPS","displayName":"HDFS: Number of large read operations","value":0},{"name":"HDFS_WRITE_OPS","displayName":"HDFS: Number of write operations","value":0}]},{"name":"org.apache.hadoop.mapreduce.TaskCounter","displayName":"Map-Reduce Framework","counts":[{"name":"COMBINE_INPUT_RECORDS","displayName":"Combine input records","value":0},{"name":"COMBINE_OUTPUT_RECORDS","displayName":"Combine output records","value":0},{"name":"REDUCE_INPUT_GROUPS","displayName":"Reduce input groups","value":0},{"name":"REDUCE_SHUFFLE_BYTES","displayName":"Reduce shuffle bytes","value":21039704},{"name":"REDUCE_INPUT_RECORDS","displayName":"Reduce input records","value":0},{"name":"REDUCE_OUTPUT_RECORDS","displayName":"Reduce output records","value":0},{"name":"SPILLED_RECORDS","displayName":"Spilled Records","value":0},{"name":"SHUFFLED_MAPS","displayName":"Shuffled Maps ","value":6405},{"name":"FAILED_SHUFFLE","displayName":"Failed Shuffles","value":0},{"name":"MERGED_MAP_OUTPUTS","displayName":"Merged Map outputs","value":0},{"name":"GC_TIME_MILLIS","displayName":"GC time elapsed (ms)","value":3617},{"name":"CPU_MILLISECONDS","displayName":"CPU time spent (ms)","value":148570},{"name":"PHYSICAL_MEMORY_BYTES","displayName":"Physical memory (bytes) snapshot","value":346775552},{"name":"VIRTUAL_MEMORY_BYTES","displayName":"Virtual memory (bytes) snapshot","value":2975604736},{"name":"COMMITTED_HEAP_BYTES","displayName":"Total committed heap usage (bytes)","value":1490026496}]},{"name":"Shuffle Errors","displayName":"Shuffle Errors","counts":[{"name":"BAD_ID","displayName":"BAD_ID","value":0},{"name":"CONNECTION","displayName":"CONNECTION","value":0},{"name":"IO_ERROR","displayName":"IO_ERROR","value":0},{"name":"WRONG_LENGTH","displayName":"WRONG_LENGTH","value":0},{"name":"WRONG_MAP","displayName":"WRONG_MAP","value":0},{"name":"WRONG_REDUCE","displayName":"WRONG_REDUCE","value":0}]},{"name":"org.apache.hadoop.mapreduce.lib.output.FileOutputFormatCounter","displayName":"File Output Format Counters ","counts":[{"name":"BYTES_WRITTEN","displayName":"Bytes Written","value":0}]}]}},"clockSplits":[363810,597022,913686,4199950,340,339,340,340,339,340,340,340],"cpuUsages":[14016,15693,22227,96634,0,0,0,0,0,0,0,0],"vMemKbytes":[2635265,2905863,2905864,2905863,2905864,2905863,2905864,2905863,2905864,2905864,2905863,2905864],"physMemKbytes":[534798,640361,522500,355737,338648,338647,338648,338647,338648,338648,338647,338648]}}}
{"type":"TASK_FAILED","event":{"org.apache.hadoop.mapreduce.jobhistory.TaskFailed":{"taskid":"task_1469651298110_0002_r_000000","taskType":"REDUCE","finishTime":1469678560792,"error":"","failedDueToAttempt":null,"status":"KILLED","counters":{"org.apache.hadoop.mapreduce.jobhistory.JhCounters":{"name":"COUNTERS","groups":[{"name":"org.apache.hadoop.mapreduce.TaskCounter","displayName":"Map-Reduce Framework","counts":[{"name":"CPU_MILLISECONDS","displayName":"CPU time spent (ms)","value":0},{"name":"PHYSICAL_MEMORY_BYTES","displayName":"Physical memory (bytes) snapshot","value":0},{"name":"VIRTUAL_MEMORY_BYTES","displayName":"Virtual memory (bytes) snapshot","value":0}]}]}}}}}
{"type":"JOB_FAILED","event":{"org.apache.hadoop.mapreduce.jobhistory.JobUnsuccessfulCompletion":{"jobid":"job_1469651298110_0002","finishTime":1469678558094,"finishedMaps":6414,"finishedReduces":0,"jobStatus":"FAILED","diagnostics":{"string":"Task failed task_1469651298110_0002_m_003030\nJob failed as tasks failed. failedMaps:1 failedReduces:0"}}}}




Kelly Zhang/Zhang,Liyun
Best Regards





From: Rohini Palaniswamy [mailto:rohini.aditya@gmail.com]
Sent: Tuesday, July 26, 2016 9:58 PM
To: Zhang, Liyun
Cc: pig-dev@hadoop.apache.org; Daniel Dai (daijy@hortonworks.com)
Subject: Re: Can anyone who has the experience on pigmix share configuration and expected results?

Let us just take one script L9 for analysis.
    - What was the failure error/stack trace? We run Pigmix with just 1G of heap. So it cannot be going out of memory.
    - Where was the 6 hours spent? Can you give a breakdown? Are all the reducer tasks being launched in parallel? For eg: If a reducer normally takes 30 mins, if it is launched in 6 waves it can take 3 hrs.  Try lowering reducer memory from -Xmx3276m to -Xmx2048m or -Xmx1638m if that is the case.



On Tue, Jul 26, 2016 at 12:18 AM, Zhang, Liyun <li...@intel.com>> wrote:
Hi all:
  Now I’m using pigmix to test the performance of Pig On Spark(PIG-4937<https://issues.apache.org/jira/browse/PIG-4937>). The test data is 1TB. After generating all the test data, I have run first round of test in mr mode.
The cluster has 8 nodes(each node has 40 cores and 60g memory, will assign 28 cores and 56g for  nodemanager on the node).  Total cores and memory for the cluster is 224 cores and 448g memory.

The snippet of yarn-site.xml:
<property>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>57344</value>
    <description>the amount of memory on the NodeManager in MB</description>
  </property>
   <property>
    <name>yarn.nodemanager.resource.cpu-vcores</name>
    <value>28</value>
  </property>
  <property>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>2048</value>
  </property>
  <property>
    <name>yarn.scheduler.maximum-allocation-mb</name>
    <value>57344</value>
  </property>
    <property>
    <name>yarn.nodemanager.vmem-check-enabled</name>
    <value>false</value>
    <description>Whether virtual memory limits will be enforced for containers</description>
  </property>
  <property>
    <name>yarn.nodemanager.vmem-pmem-ratio</name>
    <value>4</value>
    <description>Ratio between virtual memory to physical memory when setting memory limits for containers</description>
  </property>

The snippet of mapred-site.xml is
  <property>
    <name>mapreduce.map.java.opts</name>
    <value>-Xmx1638m</value>
  </property>
  <property>
    <name>mapreduce.reduce.java.opts</name>
    <value>-Xmx3276m</value>
  </property>
  <property>
    <name>mapreduce.map.memory.mb</name>
    <value>2048</value>
  </property>
  <property>
    <name>mapreduce.reduce.memory.mb</name>
    <value>4096</value>
  </property>
  <property>
    <name>mapreduce.task.io.sort.mb</name>
    <value>820</value>
  </property>
  <property>
    <name>mapred.task.timeout</name>
    <value>1200000</value>
  </property>

The snippet of hdfs-site.xml
<property>
    <name>dfs.blocksize</name>
    <value>1124217344</value>
  </property>
<property>
  <name>dfs.replication</name>
  <value>1</value>
</property>
<property>
<name>dfs.socket.timeout</name>
<value>1200000</value>
</property>
<property>
<name>dfs.datanode.socket.write.timeout</name>
<value>1200000</value>
</property>

The result of last run of pigmix in mr mode(L9,10,13,14,17 fail). It shows that the average time spent on one script is nearly 6 hours.  I don’t know whether it really need so much time to run L1~L17?  Can anyone who has experience on pigmix share his/her configuration and expected result with me?



MR(sec)

L1

21544

L2

20482

L3

21629

L4

20905

L5

20738

L6

24131

L7

21983

L8

24549

L9

6585(Fail)

L10

22286(Fail)

L11

21849

L12

21266

L13

11099(Fail)

L14

43(Fail)

L15

23808

L16

42889

L17

10(Fail)




Kelly Zhang/Zhang,Liyun
Best Regards