You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@systemml.apache.org by de...@apache.org on 2017/04/07 18:58:50 UTC

[46/50] [abbrv] incubator-systemml git commit: [MINOR] Added common errors and troubleshooting tricks

[MINOR] Added common errors and troubleshooting tricks

Closes #428.


Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/bd232241
Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/bd232241
Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/bd232241

Branch: refs/heads/gh-pages
Commit: bd232241b432dbe28e952ae36f1dce03f5658e23
Parents: 358cfc9
Author: Niketan Pansare <np...@us.ibm.com>
Authored: Mon Mar 13 13:53:45 2017 -0800
Committer: Niketan Pansare <np...@us.ibm.com>
Committed: Mon Mar 13 14:53:45 2017 -0700

----------------------------------------------------------------------
 troubleshooting-guide.md | 42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/bd232241/troubleshooting-guide.md
----------------------------------------------------------------------
diff --git a/troubleshooting-guide.md b/troubleshooting-guide.md
index db8f060..629bcf5 100644
--- a/troubleshooting-guide.md
+++ b/troubleshooting-guide.md
@@ -94,3 +94,45 @@ Note: The default `SystemML-config.xml` is located in `<path to SystemML root>/c
     hadoop jar SystemML.jar [-? | -help | -f <filename>] (-config=<config_filename>) ([-args | -nvargs] <args-list>)
     
 See [Invoking SystemML in Hadoop Batch Mode](hadoop-batch-mode.html) for details of the syntax. 
+
+## Total size of serialized results is bigger than spark.driver.maxResultSize
+
+Spark aborts a job if the estimated result size of collect is greater than maxResultSize to avoid out-of-memory errors in driver.
+However, SystemML's optimizer has estimates the memory required for each operator and provides guards against these out-of-memory errors in driver.
+So, we recommend setting the configuration `--conf spark.driver.maxResultSize=0`.
+
+## File does not exist on HDFS/LFS error from remote parfor
+
+This error usually comes from incorrect HDFS configuration on the worker nodes. To investigate this, we recommend
+
+- Testing if HDFS is accessible from the worker node: `hadoop fs -ls <file path>`
+- Synchronize hadoop configuration across the worker nodes.
+- Set the environment variable `HADOOP_CONF_DIR`. You may have to restart the cluster-manager to get the hadoop configuration. 
+
+## JVM Garbage Collection related flags
+
+We recommend providing 10% of maximum memory to young generation and using `-server` flag for robust garbage collection policy. 
+For example: if you intend to use 20G driver and 60G executor, then please add following to your configuration:
+
+	 spark-submit --driver-memory 20G --executor-memory 60G --conf "spark.executor.extraJavaOptions=-Xmn6G -server" --conf  "spark.driver.extraJavaOptions=-Xmn2G -server" ... 
+
+## Memory overhead
+
+Spark sets `spark.yarn.executor.memoryOverhead`, `spark.yarn.driver.memoryOverhead` and `spark.yarn.am.memoryOverhead` to be 10% of memory provided
+to the executor, driver and YARN Application Master respectively (with minimum of 384 MB). For certain workloads, the user may have to increase this
+overhead to 12-15% of the memory budget.
+
+## Network timeout
+
+To avoid false-positive errors due to network failures in case of compute-bound scripts, the user may have to increase the timeout `spark.network.timeout` (default: 120s).
+
+## Advanced developer statistics
+
+Few of our operators (for example: convolution-related operator) and GPU backend allows an expert user to get advanced statistics
+by setting the configuration `systemml.stats.extraGPU` and `systemml.stats.extraDNN` in the file SystemML-config.xml. 
+
+## Out-Of-Memory on executors
+
+Out-Of-Memory on executors is often caused due to side-effects of lazy evaluation and in-memory input data of Spark for large-scale problems. 
+Though we are constantly improving our optimizer to address this scenario, a quick hack to resolve this is reducing the number of cores allocated to the executor.
+We would highly appreciate if you file a bug report on our [issue tracker](https://issues.apache.org/jira/browse/SYSTEMML) if and when you encounter OOM.
\ No newline at end of file