You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "liyunzhang_intel (JIRA)" <ji...@apache.org> on 2017/04/11 06:27:41 UTC
[jira] [Updated] (PIG-5212) SkewedJoin_6 is failing on Spark
[ https://issues.apache.org/jira/browse/PIG-5212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
liyunzhang_intel updated PIG-5212:
----------------------------------
Attachment: PIG-5212.patch
after PIG-5212.patch.
The spark plan changes to
{code}
- scope-57--------
scope-51->scope-71
scope-56->scope-71
scope-71
#--------------------------------------------------
# Spark Plan
#--------------------------------------------------
Spark node scope-51
Store(hdfs://zly1.sh.intel.com:8020/tmp/temp-2120872783/tmp-1165619913:org.apache.pig.impl.io.InterStorage) - scope-52
|
|---a: Load(hdfs://zly1.sh.intel.com:8020/user/root/studenttab10k.mk:org.apache.pig.builtin.PigStorage) - scope-36--------
Spark node scope-71
c: Store(hdfs://zly1.sh.intel.com:8020/user/root/skewed.out:org.apache.pig.builtin.PigStorage) - scope-50
|
|---c: SkewedJoin[tuple] - scope-49
| |
| Project[bytearray][0] - scope-47
| |
| Project[bytearray][0] - scope-48
|
|---Load(hdfs://zly1.sh.intel.com:8020/tmp/temp-2120872783/tmp-1165619913:org.apache.pig.impl.io.InterStorage) - scope-36
|
|---b: Filter[bag] - scope-42
| |
| Greater Than[boolean] - scope-46
| |
| |---Cast[int] - scope-44
| | |
| | |---Project[bytearray][1] - scope-43
| |
| |---Constant(25) - scope-45
|
|---Load(hdfs://zly1.sh.intel.com:8020/tmp/temp-2120872783/tmp-1165619913:org.apache.pig.impl.io.InterStorage) - scope-54--------
Spark node scope-56
BroadcastSpark - scope-70
|
|---New For Each(false)[tuple] - scope-69
| |
| POUserFunc(org.apache.pig.impl.builtin.PartitionSkewedKeys)[tuple] - scope-68
| |
| |---Project[tuple][*] - scope-67
|
|---New For Each(false,false)[tuple] - scope-66
| |
| Constant(7) - scope-65
| |
| Project[bag][1] - scope-64
|
|---POSparkSort[tuple]() - scope-49
| |
| Project[bytearray][0] - scope-47
|
|---New For Each(false,true)[tuple] - scope-63
| |
| Project[bytearray][0] - scope-47
| |
| POUserFunc(org.apache.pig.impl.builtin.GetMemNumRows)[tuple] - scope-61
| |
| |---Project[tuple][*] - scope-60
|
|---PoissonSampleSpark - scope-62
|
|---Load(hdfs://zly1.sh.intel.com:8020/tmp/temp-2120872783/tmp-1165619913:org.apache.pig.impl.io.InterStorage) - scope-57--------
{code}
the difference between current spark and previous spark plan is the predecessor of SkewedJoin(scope-49) is Load(scope-36) and Filter(scope-42). set the operatorKey of poload in SparkCompiler#startNew when SparkCompiler#visitSplit is called (POload in Spark node scope-74 is same as the POload in Spark node scope-51 in OperatorKey)
> SkewedJoin_6 is failing on Spark
> --------------------------------
>
> Key: PIG-5212
> URL: https://issues.apache.org/jira/browse/PIG-5212
> Project: Pig
> Issue Type: Sub-task
> Components: spark
> Reporter: Nandor Kollar
> Assignee: Xianda Ke
> Fix For: spark-branch
>
> Attachments: PIG-5212.patch
>
>
> result are different:
> {code}
> diff <(head -20 SkewedJoin_6_benchmark.out/out_sorted) <(head -20 SkewedJoin_6.out/out_sorted)
> < alice allen 19 1.930 alice allen 27 1.950
> < alice allen 19 1.930 alice allen 34 1.230
> < alice allen 19 1.930 alice allen 36 2.270
> < alice allen 19 1.930 alice allen 38 0.810
> < alice allen 19 1.930 alice allen 38 1.800
> < alice allen 19 1.930 alice allen 42 2.460
> < alice allen 19 1.930 alice allen 43 0.880
> < alice allen 19 1.930 alice allen 45 2.800
> < alice allen 19 1.930 alice allen 46 3.970
> < alice allen 19 1.930 alice allen 51 1.080
> < alice allen 19 1.930 alice allen 68 3.390
> < alice allen 19 1.930 alice allen 68 3.510
> < alice allen 19 1.930 alice allen 72 1.750
> < alice allen 19 1.930 alice allen 72 3.630
> < alice allen 19 1.930 alice allen 74 0.020
> < alice allen 19 1.930 alice allen 74 2.400
> < alice allen 19 1.930 alice allen 77 2.520
> < alice allen 20 2.470 alice allen 27 1.950
> < alice allen 20 2.470 alice allen 34 1.230
> < alice allen 20 2.470 alice allen 36 2.270
> ---
> > alice allen 27 1.950 alice allen 19 1.930
> > alice allen 27 1.950 alice allen 20 2.470
> > alice allen 27 1.950 alice allen 27 1.950
> > alice allen 27 1.950 alice allen 34 1.230
> > alice allen 27 1.950 alice allen 36 2.270
> > alice allen 27 1.950 alice allen 38 0.810
> > alice allen 27 1.950 alice allen 38 1.800
> > alice allen 27 1.950 alice allen 42 2.460
> > alice allen 27 1.950 alice allen 43 0.880
> > alice allen 27 1.950 alice allen 45 2.800
> > alice allen 27 1.950 alice allen 46 3.970
> > alice allen 27 1.950 alice allen 51 1.080
> > alice allen 27 1.950 alice allen 68 3.390
> > alice allen 27 1.950 alice allen 68 3.510
> > alice allen 27 1.950 alice allen 72 1.750
> > alice allen 27 1.950 alice allen 72 3.630
> > alice allen 27 1.950 alice allen 74 0.020
> > alice allen 27 1.950 alice allen 74 2.400
> > alice allen 27 1.950 alice allen 77 2.520
> > alice allen 34 1.230 alice allen 19 1.930
> {code}
> It looks like the two tables are in wrong order, columns from 'a' should come first, then columns from 'b'. In spark mode this is inverted.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)