You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@systemml.apache.org by "Mike Dusenberry (JIRA)" <ji...@apache.org> on 2016/09/15 17:55:22 UTC
[jira] [Comment Edited] (SYSTEMML-909) `determineDataFrameDimensionsIfNeeded(...)` is a bottleneck.

    [ https://issues.apache.org/jira/browse/SYSTEMML-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15494078#comment-15494078 ] 

Mike Dusenberry edited comment on SYSTEMML-909 at 9/15/16 5:55 PM:
-------------------------------------------------------------------

[~mboehm7] I tried the latest update, but I ran into the following {{ArrayIndexOutOfBoundsException}} failure executing a script that previously ran.  The DataFrame is the same as before in that it has and index column ("__INDEX") and a vector column.  The index column was created with the usual {{zipWithIndex}} and mapping it to be a column in the DataFrame. Thus, the indexing is 0-based.  The schema is {{DataFrame[__INDEX: int, sample: vector]}}.

{code}
org.apache.sysml.api.mlcontext.MLContextException: Exception when executing script
	at org.apache.sysml.api.mlcontext.MLContext.execute(MLContext.java:301)
	at org.apache.sysml.api.mlcontext.MLContext.execute(MLContext.java:271)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
	at py4j.Gateway.invoke(Gateway.java:259)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:209)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.sysml.api.mlcontext.MLContextException: Exception occurred while executing runtime program
	at org.apache.sysml.api.mlcontext.ScriptExecutor.executeRuntimeProgram(ScriptExecutor.java:381)
	at org.apache.sysml.api.mlcontext.ScriptExecutor.execute(ScriptExecutor.java:324)
	at org.apache.sysml.api.mlcontext.MLContext.execute(MLContext.java:293)
	... 12 more
Caused by: org.apache.sysml.runtime.DMLRuntimeException: org.apache.sysml.runtime.DMLRuntimeException: ERROR: Runtime error in program block generated from statement block between lines 1 and 26 -- Error evaluating instruction: CP°/°X·MATRIX·DOUBLE°255·SCALAR·INT·true°_mVar127·MATRIX·DOUBLE
	at org.apache.sysml.runtime.controlprogram.Program.execute(Program.java:152)
	at org.apache.sysml.api.mlcontext.ScriptExecutor.executeRuntimeProgram(ScriptExecutor.java:379)
	... 14 more
Caused by: org.apache.sysml.runtime.DMLRuntimeException: ERROR: Runtime error in program block generated from statement block between lines 1 and 26 -- Error evaluating instruction: CP°/°X·MATRIX·DOUBLE°255·SCALAR·INT·true°_mVar127·MATRIX·DOUBLE
	at org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleInstruction(ProgramBlock.java:335)
	at org.apache.sysml.runtime.controlprogram.ProgramBlock.executeInstructions(ProgramBlock.java:224)
	at org.apache.sysml.runtime.controlprogram.ProgramBlock.execute(ProgramBlock.java:168)
	at org.apache.sysml.runtime.controlprogram.Program.execute(Program.java:145)
	... 15 more
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 22.0 failed 4 times, most recent failure: Lost task 0.3 in stage 22.0 (TID 40023, spark-ml-node-8.fyre.ibm.com): java.lang.ArrayIndexOutOfBoundsException: -1000
	at org.apache.sysml.runtime.matrix.data.MatrixBlock.appendValue(MatrixBlock.java:704)
	at org.apache.sysml.runtime.instructions.spark.utils.RDDConverterUtils$DataFrameToBinaryBlockFunction.call(RDDConverterUtils.java:921)
	at org.apache.sysml.runtime.instructions.spark.utils.RDDConverterUtils$DataFrameToBinaryBlockFunction.call(RDDConverterUtils.java:865)
	at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1.apply(JavaRDDLike.scala:192)
	at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1.apply(JavaRDDLike.scala:192)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
	at org.apache.spark.scheduler.Task.run(Task.scala:89)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
	at scala.Option.foreach(Option.scala:236)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:926)
	at org.apache.spark.api.java.JavaRDDLike$class.collect(JavaRDDLike.scala:339)
	at org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:46)
	at org.apache.sysml.runtime.controlprogram.context.SparkExecutionContext.toMatrixBlock(SparkExecutionContext.java:803)
	at org.apache.sysml.runtime.controlprogram.context.SparkExecutionContext.toMatrixBlock(SparkExecutionContext.java:756)
	at org.apache.sysml.runtime.controlprogram.caching.MatrixObject.readBlobFromRDD(MatrixObject.java:543)
	at org.apache.sysml.runtime.controlprogram.caching.MatrixObject.readBlobFromRDD(MatrixObject.java:61)
	at org.apache.sysml.runtime.controlprogram.caching.CacheableData.acquireRead(CacheableData.java:464)
	at org.apache.sysml.runtime.controlprogram.context.ExecutionContext.getMatrixInput(ExecutionContext.java:241)
	at org.apache.sysml.runtime.instructions.cp.ScalarMatrixArithmeticCPInstruction.processInstruction(ScalarMatrixArithmeticCPInstruction.java:49)
	at org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleInstruction(ProgramBlock.java:305)
	... 18 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: -1000
	at org.apache.sysml.runtime.matrix.data.MatrixBlock.appendValue(MatrixBlock.java:704)
	at org.apache.sysml.runtime.instructions.spark.utils.RDDConverterUtils$DataFrameToBinaryBlockFunction.call(RDDConverterUtils.java:921)
	at org.apache.sysml.runtime.instructions.spark.utils.RDDConverterUtils$DataFrameToBinaryBlockFunction.call(RDDConverterUtils.java:865)
	at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1.apply(JavaRDDLike.scala:192)
	at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1.apply(JavaRDDLike.scala:192)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
	at org.apache.spark.scheduler.Task.run(Task.scala:89)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	... 1 more
{code}


was (Author: mwdusenb@us.ibm.com):
[~mboehm7] I tried the latest update, but I ran into the following failure executing a script that previously ran.  The DataFrame is the same as before in that it has and index column ("__INDEX") and a vector column.  The index column was created with the usual {{zipWithIndex}} and mapping it to be a column in the DataFrame. Thus, the indexing is 0-based.  The schema is {{DataFrame[__INDEX: int, sample: vector]}}.

{code}
org.apache.sysml.api.mlcontext.MLContextException: Exception when executing script
	at org.apache.sysml.api.mlcontext.MLContext.execute(MLContext.java:301)
	at org.apache.sysml.api.mlcontext.MLContext.execute(MLContext.java:271)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
	at py4j.Gateway.invoke(Gateway.java:259)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:209)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.sysml.api.mlcontext.MLContextException: Exception occurred while executing runtime program
	at org.apache.sysml.api.mlcontext.ScriptExecutor.executeRuntimeProgram(ScriptExecutor.java:381)
	at org.apache.sysml.api.mlcontext.ScriptExecutor.execute(ScriptExecutor.java:324)
	at org.apache.sysml.api.mlcontext.MLContext.execute(MLContext.java:293)
	... 12 more
Caused by: org.apache.sysml.runtime.DMLRuntimeException: org.apache.sysml.runtime.DMLRuntimeException: ERROR: Runtime error in program block generated from statement block between lines 1 and 26 -- Error evaluating instruction: CP°/°X·MATRIX·DOUBLE°255·SCALAR·INT·true°_mVar127·MATRIX·DOUBLE
	at org.apache.sysml.runtime.controlprogram.Program.execute(Program.java:152)
	at org.apache.sysml.api.mlcontext.ScriptExecutor.executeRuntimeProgram(ScriptExecutor.java:379)
	... 14 more
Caused by: org.apache.sysml.runtime.DMLRuntimeException: ERROR: Runtime error in program block generated from statement block between lines 1 and 26 -- Error evaluating instruction: CP°/°X·MATRIX·DOUBLE°255·SCALAR·INT·true°_mVar127·MATRIX·DOUBLE
	at org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleInstruction(ProgramBlock.java:335)
	at org.apache.sysml.runtime.controlprogram.ProgramBlock.executeInstructions(ProgramBlock.java:224)
	at org.apache.sysml.runtime.controlprogram.ProgramBlock.execute(ProgramBlock.java:168)
	at org.apache.sysml.runtime.controlprogram.Program.execute(Program.java:145)
	... 15 more
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 22.0 failed 4 times, most recent failure: Lost task 0.3 in stage 22.0 (TID 40023, spark-ml-node-8.fyre.ibm.com): java.lang.ArrayIndexOutOfBoundsException: -1000
	at org.apache.sysml.runtime.matrix.data.MatrixBlock.appendValue(MatrixBlock.java:704)
	at org.apache.sysml.runtime.instructions.spark.utils.RDDConverterUtils$DataFrameToBinaryBlockFunction.call(RDDConverterUtils.java:921)
	at org.apache.sysml.runtime.instructions.spark.utils.RDDConverterUtils$DataFrameToBinaryBlockFunction.call(RDDConverterUtils.java:865)
	at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1.apply(JavaRDDLike.scala:192)
	at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1.apply(JavaRDDLike.scala:192)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
	at org.apache.spark.scheduler.Task.run(Task.scala:89)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
	at scala.Option.foreach(Option.scala:236)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:926)
	at org.apache.spark.api.java.JavaRDDLike$class.collect(JavaRDDLike.scala:339)
	at org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:46)
	at org.apache.sysml.runtime.controlprogram.context.SparkExecutionContext.toMatrixBlock(SparkExecutionContext.java:803)
	at org.apache.sysml.runtime.controlprogram.context.SparkExecutionContext.toMatrixBlock(SparkExecutionContext.java:756)
	at org.apache.sysml.runtime.controlprogram.caching.MatrixObject.readBlobFromRDD(MatrixObject.java:543)
	at org.apache.sysml.runtime.controlprogram.caching.MatrixObject.readBlobFromRDD(MatrixObject.java:61)
	at org.apache.sysml.runtime.controlprogram.caching.CacheableData.acquireRead(CacheableData.java:464)
	at org.apache.sysml.runtime.controlprogram.context.ExecutionContext.getMatrixInput(ExecutionContext.java:241)
	at org.apache.sysml.runtime.instructions.cp.ScalarMatrixArithmeticCPInstruction.processInstruction(ScalarMatrixArithmeticCPInstruction.java:49)
	at org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleInstruction(ProgramBlock.java:305)
	... 18 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: -1000
	at org.apache.sysml.runtime.matrix.data.MatrixBlock.appendValue(MatrixBlock.java:704)
	at org.apache.sysml.runtime.instructions.spark.utils.RDDConverterUtils$DataFrameToBinaryBlockFunction.call(RDDConverterUtils.java:921)
	at org.apache.sysml.runtime.instructions.spark.utils.RDDConverterUtils$DataFrameToBinaryBlockFunction.call(RDDConverterUtils.java:865)
	at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1.apply(JavaRDDLike.scala:192)
	at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1.apply(JavaRDDLike.scala:192)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
	at org.apache.spark.scheduler.Task.run(Task.scala:89)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	... 1 more
{code}

> `determineDataFrameDimensionsIfNeeded(...)` is a bottleneck.
> ------------------------------------------------------------
>
>                 Key: SYSTEMML-909
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-909
>             Project: SystemML
>          Issue Type: Improvement
>            Reporter: Mike Dusenberry
>
> The {{[determineDataFrameDimensionsIfNeeded(...) | https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/api/mlcontext/MLContextConversionUtil.java#L585]}} function in {{MLContext}} is a major bottleneck, particularly due to the `javaRDD` call.
> The issue I'm seeing is that the javaRDD.count() function causes execution of the lazy DataFrames I pass in, which are created from another DataFrame via df.randomSplit([0.8, 0.2]), thus a shuffle occurs. I know that this is going to happen anyways in the internal conversion, but it wastes a lot of time by having to also do it in this step too. Assume that I have more data than I can efficiently cache (~7TB with the potential for much more), so I need to incur the shuffle step only once on the way into the engine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)