You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "liyunzhang_intel (JIRA)" <ji...@apache.org> on 2016/05/05 06:50:12 UTC
[jira] [Updated] (PIG-4886) Add PigSplit#getLocationInfo to fix the
NPE found in log in spark mode
[ https://issues.apache.org/jira/browse/PIG-4886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
liyunzhang_intel updated PIG-4886:
----------------------------------
Description:
Use branch code(119f313) to test following pig script in spark mode:
{code}
A = load './SkewedJoinInput1.txt' as (id,name,n);
B = load './SkewedJoinInput2.txt' as (id,name);
D = join A by (id,name), B by (id,name);
store D into './testFRJoin.out';
{code}
cat bin/SkewedJoinInput1.txt
{noformat}
100 apple1 aaa
200 orange1 bbb
300 strawberry ccc
{noformat}
cat bin/SkewedJoinInput2.txt
{noformat}
100 apple1
100 apple2
100 apple2
200 orange1
200 orange2
300 strawberry
400 pear
{noformat}
following exception found in log:
{noformat}
[dag-scheduler-event-loop] 2016-05-05 14:21:01,046 DEBUG rdd.NewHadoopRDD (Logging.scala:logDebug(84)) - Failed to use InputSplit#get LocationInfo.
java.lang.NullPointerException
at scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:114)
at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:114)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:32)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at org.apache.spark.rdd.HadoopRDD$.convertSplitLocationInfo(HadoopRDD.scala:406)
at org.apache.spark.rdd.NewHadoopRDD.getPreferredLocations(NewHadoopRDD.scala:202)
at org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:231)
at org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:231)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.preferredLocations(RDD.scala:230)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1387)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1397)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1396)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1396)
{noformat}
org.apache.spark.rdd.NewHadoopRDD.getPreferredLocations will call PigSplit#getLocationInfo but currently PigSplit extends InputSplit and InputSplit#getLocationInfo return null.
{code}
@Evolving
public SplitLocationInfo[] getLocationInfo() throws IOException {
return null;
}
{code}
was:
Use branch code(119f313) to test following pig script in spark mode:
{code}
A = load './SkewedJoinInput1.txt' as (id,name,n);
B = load './SkewedJoinInput2.txt' as (id,name);
D = join A by (id,name), B by (id,name);
store D into './testFRJoin.out';
{code}
cat bin/SkewedJoinInput1.txt
{noformat}
100 apple1 aaa
200 orange1 bbb
300 strawberry ccc
{noformat}
cat bin/SkewedJoinInput2.txt
{noformat}
100 apple1
100 apple2
100 apple2
200 orange1
200 orange2
300 strawberry
400 pear
{noformat}
following exception found in log:
{noformat}
java.lang.NullPointerException
at scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:114)
at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:114)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:32)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at org.apache.spark.rdd.HadoopRDD$.convertSplitLocationInfo(HadoopRDD.scala:406)
at org.apache.spark.rdd.NewHadoopRDD.getPreferredLocations(NewHadoopRDD.scala:202)
at org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:231)
at org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:231)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.preferredLocations(RDD.scala:230)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1387)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1397)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1396)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1396)
{noformat}
> Add PigSplit#getLocationInfo to fix the NPE found in log in spark mode
> ----------------------------------------------------------------------
>
> Key: PIG-4886
> URL: https://issues.apache.org/jira/browse/PIG-4886
> Project: Pig
> Issue Type: Sub-task
> Components: spark
> Reporter: liyunzhang_intel
> Assignee: liyunzhang_intel
> Fix For: spark-branch
>
>
> Use branch code(119f313) to test following pig script in spark mode:
> {code}
> A = load './SkewedJoinInput1.txt' as (id,name,n);
> B = load './SkewedJoinInput2.txt' as (id,name);
> D = join A by (id,name), B by (id,name);
> store D into './testFRJoin.out';
> {code}
> cat bin/SkewedJoinInput1.txt
> {noformat}
> 100 apple1 aaa
> 200 orange1 bbb
> 300 strawberry ccc
> {noformat}
> cat bin/SkewedJoinInput2.txt
> {noformat}
> 100 apple1
> 100 apple2
> 100 apple2
> 200 orange1
> 200 orange2
> 300 strawberry
> 400 pear
> {noformat}
> following exception found in log:
> {noformat}
> [dag-scheduler-event-loop] 2016-05-05 14:21:01,046 DEBUG rdd.NewHadoopRDD (Logging.scala:logDebug(84)) - Failed to use InputSplit#get LocationInfo.
> java.lang.NullPointerException
> at scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:114)
> at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:114)
> at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:32)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at org.apache.spark.rdd.HadoopRDD$.convertSplitLocationInfo(HadoopRDD.scala:406)
> at org.apache.spark.rdd.NewHadoopRDD.getPreferredLocations(NewHadoopRDD.scala:202)
> at org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:231)
> at org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:231)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.preferredLocations(RDD.scala:230)
> at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1387)
> at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1397)
> at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1396)
> at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1396)
> {noformat}
> org.apache.spark.rdd.NewHadoopRDD.getPreferredLocations will call PigSplit#getLocationInfo but currently PigSplit extends InputSplit and InputSplit#getLocationInfo return null.
> {code}
> @Evolving
> public SplitLocationInfo[] getLocationInfo() throws IOException {
> return null;
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)