You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Mohit Sabharwal (JIRA)" <ji...@apache.org> on 2016/05/17 04:18:12 UTC

[jira] [Commented] (PIG-4886) Add PigSplit#getLocationInfo to fix the NPE found in log in spark mode

    [ https://issues.apache.org/jira/browse/PIG-4886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15286003#comment-15286003 ] 

Mohit Sabharwal commented on PIG-4886:
--------------------------------------

Thanks, [~kellyzly] - left couple of comments on RB.

> Add PigSplit#getLocationInfo to fix the NPE found in log in spark mode
> ----------------------------------------------------------------------
>
>                 Key: PIG-4886
>                 URL: https://issues.apache.org/jira/browse/PIG-4886
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>            Reporter: liyunzhang_intel
>            Assignee: liyunzhang_intel
>             Fix For: spark-branch
>
>         Attachments: PIG-4886.patch
>
>
> Use branch code(119f313) to test following pig script in spark mode:
> {code}
> A = load './SkewedJoinInput1.txt' as (id,name,n);
> B = load './SkewedJoinInput2.txt' as (id,name);
> D = join A by (id,name), B by (id,name);
> store D into './testFRJoin.out';
> {code}
> cat bin/SkewedJoinInput1.txt 
> {noformat}
> 100	apple1	aaa
> 200	orange1	bbb
> 300	strawberry	ccc
> {noformat}
> cat bin/SkewedJoinInput2.txt 
> {noformat}
> 100	apple1
> 100	apple2
> 100	apple2
> 200	orange1
> 200	orange2
> 300	strawberry
> 400	pear
> {noformat}
> following exception found in log:
> {noformat}
> [dag-scheduler-event-loop] 2016-05-05 14:21:01,046 DEBUG rdd.NewHadoopRDD (Logging.scala:logDebug(84)) - Failed to use InputSplit#getLocationInfo.
> java.lang.NullPointerException
>         at scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:114)
>         at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:114)
>         at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:32)
>         at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>         at org.apache.spark.rdd.HadoopRDD$.convertSplitLocationInfo(HadoopRDD.scala:406)
>         at org.apache.spark.rdd.NewHadoopRDD.getPreferredLocations(NewHadoopRDD.scala:202)
>         at org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:231)
>         at org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:231)
>         at scala.Option.getOrElse(Option.scala:120)
>         at org.apache.spark.rdd.RDD.preferredLocations(RDD.scala:230)
>         at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1387)
>         at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1397)
>         at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1396)
>         at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1396)
> {noformat}
> org.apache.spark.rdd.NewHadoopRDD.getPreferredLocations will call PigSplit#getLocationInfo but currently PigSplit extends InputSplit and InputSplit#getLocationInfo return null.
> {code}
>   @Evolving
>   public SplitLocationInfo[] getLocationInfo() throws IOException {
>     return null;
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)