You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Luca Canali (JIRA)" <ji...@apache.org> on 2018/11/16 12:39:01 UTC

[jira] [Updated] (SPARK-26087) Spark on K8S jobs fail when returning large result sets in client mode under certain conditions

     [ https://issues.apache.org/jira/browse/SPARK-26087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Luca Canali updated SPARK-26087:
--------------------------------
    Description: 
*Short description*: Jobs on Spark on k8s fail when certain conditions are met: when using client mode with driver on a host external to K8s cluster and the task result size is greater than the value "maxDirectResultSize" and the executor blockmanagers in the k8s cluster are not routable from the driver.

*Background:* when the result of a task has to be serialized back to the driver three cases are posible: (1) failure if the result size is greater than MAX_RESULT_SIZE,  (2) the result is serialized directly to the driver if the size is equal or less than "maxDirectResultSize", (3) pointers to the task results in the block manager are sent back to the driver in the case of result size between "maxDirectResultSize" and MAX_RESULT_SIZE.

For behavior 3 to successfully complete (i.e. the driver is sent pointers into the block manager rather than data), the driver needs to be able to access the block manager address from in the executor JVM. This currently fails in our configuration when using a Spark driver outside the K8S cluster, as the block manager addresses are private K8S pods addresses and thus not routable from the driver, at least in our configuration, but we think others may be affected.

How to *reproduce* the issue:
 * Use Spark on K8S with a driver located externally to the K8S cluster
 * Start the Spark session using --conf *spark.task.maxDirectResultSize* =<testValue>, for example with <testValue> = 100 to make this problem appear with most tasks even returning small result size.

*Error* stack:

 
{code:java}
ERROR RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocksjava.io.IOException: Connecting to /10.100.112.3:35973 timed out (30000 ms)
{code}
Existing *workaround*:
 * Set *spark.task.maxDirectResultSize* to the same value as spark.driver.maxResultSize when running in the configuration affected by this bug (K8S with driver external to the cluster).

Additional *notes*:
 * The executors are registered with routable public ips but the executor blockmanagers are registered with ips private to K8S, that is why the task assignment works however driver cannot connect to executor blockmanager to pull the final result (if result is bigger than maxDirectResultSize).
 * Spark documentation correctly states "spark executors must be able to connect to the Spark driver over a hostname and a port that is routable from the Spark executors”, however, this case is about the Spark driver needing to connect to the executors’ block managers.

  was:
*Short description*: Jobs on Spark on k8s fail when certain conditions are met: when using client mode with driver on a host external to K8s cluster and the task result size is greater than the value "maxDirectResultSize" and the executor blockmanagers in the k8s cluster are not routable from the driver.

*Background:* when the result of a task has to be serialized back to the driver three cases are posible: (1) failure if the result size is greater than MAX_RESULT_SIZE,  (2) the result is serialized directly to the driver if the size is equal or less than "maxDirectResultSize", (3) pointers to the task results in the block manager are sent back to the driver in the case of result size between "maxDirectResultSize" and MAX_RESULT_SIZE.

For behavior 3 to successfully complete (i.e. the driver is sent pointers into the block manager rather than data), the driver needs to be able to access the block manager address from in the executor JVM. This currently fails in our configuration when using a Spark driver outside the K8S cluster, as the block manager addresses are private K8S pods addresses and thus not routable from the driver, at least in our configuration, but we think others may be affected.

How to *reproduce* the issue:
 * Use Spark on K8S with a driver located externally to the K8S cluster
 * Start the Spark session using --conf *task.maxDirectResultSize* =<testValue>, for example with <testValue> = 100 to make this problem appear with most tasks even returning small result size.

Existing *workaround*:
 * Set *task.maxDirectResultSize* to the same value as spark.driver.maxResultSize when running in the configuration affected by this bug (K8S with driver external to the cluster).

Additional *notes*:
 * The executors are registered with routable public ips but the executor blockmanagers are registered with ips private to K8S, that is why the task assignment works however driver cannot connect to executor blockmanager to pull the final result (if result is bigger than maxDirectResultSize).
 * Spark documentation correctly states "spark executors must be able to connect to the Spark driver over a hostname and a port that is routable from the Spark executors”, however, this case is about the Spark driver needing to connect to the executors’ block managers.


> Spark on K8S jobs fail when returning large result sets in client mode under certain conditions
> -----------------------------------------------------------------------------------------------
>
>                 Key: SPARK-26087
>                 URL: https://issues.apache.org/jira/browse/SPARK-26087
>             Project: Spark
>          Issue Type: Bug
>          Components: Kubernetes
>    Affects Versions: 2.4.0
>            Reporter: Luca Canali
>            Priority: Minor
>
> *Short description*: Jobs on Spark on k8s fail when certain conditions are met: when using client mode with driver on a host external to K8s cluster and the task result size is greater than the value "maxDirectResultSize" and the executor blockmanagers in the k8s cluster are not routable from the driver.
> *Background:* when the result of a task has to be serialized back to the driver three cases are posible: (1) failure if the result size is greater than MAX_RESULT_SIZE,  (2) the result is serialized directly to the driver if the size is equal or less than "maxDirectResultSize", (3) pointers to the task results in the block manager are sent back to the driver in the case of result size between "maxDirectResultSize" and MAX_RESULT_SIZE.
> For behavior 3 to successfully complete (i.e. the driver is sent pointers into the block manager rather than data), the driver needs to be able to access the block manager address from in the executor JVM. This currently fails in our configuration when using a Spark driver outside the K8S cluster, as the block manager addresses are private K8S pods addresses and thus not routable from the driver, at least in our configuration, but we think others may be affected.
> How to *reproduce* the issue:
>  * Use Spark on K8S with a driver located externally to the K8S cluster
>  * Start the Spark session using --conf *spark.task.maxDirectResultSize* =<testValue>, for example with <testValue> = 100 to make this problem appear with most tasks even returning small result size.
> *Error* stack:
>  
> {code:java}
> ERROR RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocksjava.io.IOException: Connecting to /10.100.112.3:35973 timed out (30000 ms)
> {code}
> Existing *workaround*:
>  * Set *spark.task.maxDirectResultSize* to the same value as spark.driver.maxResultSize when running in the configuration affected by this bug (K8S with driver external to the cluster).
> Additional *notes*:
>  * The executors are registered with routable public ips but the executor blockmanagers are registered with ips private to K8S, that is why the task assignment works however driver cannot connect to executor blockmanager to pull the final result (if result is bigger than maxDirectResultSize).
>  * Spark documentation correctly states "spark executors must be able to connect to the Spark driver over a hostname and a port that is routable from the Spark executors”, however, this case is about the Spark driver needing to connect to the executors’ block managers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org