You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Andreas Weise <an...@gmail.com> on 2022/03/16 10:33:23 UTC

spark 3.2.1: Unexpected reuse of dynamic PVC

Hi,

when using dynamic allocation on k8s with dynamic pvc reuse, I face that
only few executors are running. 2 of 4 are stucked in 'ContainerCreating'
with Events like:
spark-medium-1x-38b7c47f92340e9e-exec-3 : Multi-Attach error for volume
"pvc-c184e264-4a6d-406f-8d95-c59ff9e074d8" Volume is already used by pod(s)
spark-medium-1x-38b7c47f92340e9e-exec-2

According to the documentation, only PVCs of deleted executors should be
reused (hence our ReadOnlyMany Storage should be sufficient right?). But
the executor of the reused pvc is still running. Is this expected ?

Config:

spark.dynamicAllocation.enabled=true
spark.dynamicAllocation.maxExecutors=4
spark.dynamicAllocation.minExecutors=1
spark.dynamicAllocation.executorIdleTimeout=60s
spark.dynamicAllocation.shuffleTracking.enabled=true
spark.kubernetes.driver.ownPersistentVolumeClaim=true
spark.kubernetes.driver.reusePersistentVolumeClaim=true
spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.claimName=OnDemand
spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.storageClass=sc-openshift-default
spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.sizeLimit=1Gi
spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.path=/tmp/data/
spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.readOnly=false

Log excerpt (full log attached):

INFO [2022-03-16 11:09:21,678]
({kubernetes-executor-snapshots-subscribers-0}
Logging.scala[logInfo]:57) - Going to request 1 executors from
Kubernetes for ResourceProfile Id: 0, target: 1, known: 0,
sharedSlotFromPendingPods: 2147483647.
INFO [2022-03-16 11:09:21,684]
({kubernetes-executor-snapshots-subscribers-0}
Logging.scala[logInfo]:57) - Found 0 reusable PVCs from 0 PVCs
INFO [2022-03-16 11:09:21,686]
({kubernetes-executor-snapshots-subscribers-0}
Logging.scala[logInfo]:57) - Spark configuration files loaded from
Some(/opt/conda/lib/python3.9/site-packages/pyspark/conf) :
log4j.properties,hive-site.xml
INFO [2022-03-16 11:09:21,687]
({kubernetes-executor-snapshots-subscribers-0}
Logging.scala[logInfo]:57) - Adding decommission script to lifecycle
INFO [2022-03-16 11:09:21,689]
({FIFOScheduler-interpreter_510428346-Worker-1}
Logging.scala[logInfo]:57) - Using initial executors = 1, max of
spark.dynamicAllocation.initialExecutors,
spark.dynamicAllocation.minExecutors and spark.executor.instances
WARN [2022-03-16 11:09:21,690]
({FIFOScheduler-interpreter_510428346-Worker-1}
Logging.scala[logWarning]:69) - Dynamic allocation without a shuffle
service is an experimental feature.
INFO [2022-03-16 11:09:21,775]
({kubernetes-executor-snapshots-subscribers-0}
Logging.scala[logInfo]:57) - Trying to create PersistentVolumeClaim
spark-medium-1x-38b7c47f92340e9e-exec-2-pvc-0 with StorageClass
sc-openshift-default
INFO [2022-03-16 11:09:37,220] ({dispatcher-CoarseGrainedScheduler}
Logging.scala[logInfo]:57) - Registered executor
NettyRpcEndpointRef(spark-client://Executor) (10.128.6.67:57144) with
ID 2,  ResourceProfileId 0
INFO [2022-03-16 11:09:37,225]
({spark-listener-group-executorManagement} Logging.scala[logInfo]:57)
- New executor 2 has registered (new total is 1)

...

INFO [2022-03-16 11:09:51,708]
({kubernetes-executor-snapshots-subscribers-1}
Logging.scala[logInfo]:57) - Going to request 1 executors from
Kubernetes for ResourceProfile Id: 0, target: 2, known: 1,
sharedSlotFromPendingPods: 2147483647.
INFO [2022-03-16 11:09:51,709] ({spark-dynamic-executor-allocation}
Logging.scala[logInfo]:57) - Requesting 1 new executor because tasks
are backlogged (new desired total will be 2 for resource profile id:
0)
INFO [2022-03-16 11:09:51,717]
({kubernetes-executor-snapshots-subscribers-1}
Logging.scala[logInfo]:57) - Found 1 reusable PVCs from 1 PVCs
INFO [2022-03-16 11:09:51,719]
({kubernetes-executor-snapshots-subscribers-1}
Logging.scala[logInfo]:57) - Spark configuration files loaded from
Some(/opt/conda/lib/python3.9/site-packages/pyspark/conf) :
log4j.properties,hive-site.xml
INFO [2022-03-16 11:09:51,721]
({kubernetes-executor-snapshots-subscribers-1}
Logging.scala[logInfo]:57) - Adding decommission script to lifecycle
INFO [2022-03-16 11:09:51,726]
({kubernetes-executor-snapshots-subscribers-1}
Logging.scala[logInfo]:57) - Reuse PersistentVolumeClaim
spark-medium-1x-38b7c47f92340e9e-exec-2-pvc-0
INFO [2022-03-16 11:09:52,713] ({spark-dynamic-executor-allocation}
Logging.scala[logInfo]:57) - Requesting 2 new executors because tasks
are backlogged (new desired total will be 4 for resource profile id:
0)
INFO [2022-03-16 11:09:52,813]
({kubernetes-executor-snapshots-subscribers-0}
Logging.scala[logInfo]:57) - Going to request 2 executors from
Kubernetes for ResourceProfile Id: 0, target: 4, known: 2,
sharedSlotFromPendingPods: 2147483646.
INFO [2022-03-16 11:09:52,820]
({kubernetes-executor-snapshots-subscribers-0}
Logging.scala[logInfo]:57) - Found 0 reusable PVCs from 1 PVCs
INFO [2022-03-16 11:09:52,821]
({kubernetes-executor-snapshots-subscribers-0}
Logging.scala[logInfo]:57) - Spark configuration files loaded from
Some(/opt/conda/lib/python3.9/site-packages/pyspark/conf) :
log4j.properties,hive-site.xml
INFO [2022-03-16 11:09:52,822]
({kubernetes-executor-snapshots-subscribers-0}
Logging.scala[logInfo]:57) - Adding decommission script to lifecycle
INFO [2022-03-16 11:09:52,855]
({kubernetes-executor-snapshots-subscribers-0}
Logging.scala[logInfo]:57) - Trying to create PersistentVolumeClaim
spark-medium-1x-38b7c47f92340e9e-exec-4-pvc-0 with StorageClass
sc-openshift-default
INFO [2022-03-16 11:09:52,866]
({kubernetes-executor-snapshots-subscribers-0}
Logging.scala[logInfo]:57) - Spark configuration files loaded from
Some(/opt/conda/lib/python3.9/site-packages/pyspark/conf) :
log4j.properties,hive-site.xml
INFO [2022-03-16 11:09:52,867]
({kubernetes-executor-snapshots-subscribers-0}
Logging.scala[logInfo]:57) - Adding decommission script to lifecycle
INFO [2022-03-16 11:09:52,899]
({kubernetes-executor-snapshots-subscribers-0}
Logging.scala[logInfo]:57) - Trying to create PersistentVolumeClaim
spark-medium-1x-38b7c47f92340e9e-exec-5-pvc-0 with StorageClass
sc-openshift-default


Best regards

Andreas

...

Re: spark 3.2.1: Unexpected reuse of dynamic PVC

Posted by Andreas Weise <an...@gmail.com>.
minor correction:
>> (hence our *ReadWriteOnce* Storage should be sufficient right?...

On Wed, Mar 16, 2022 at 11:33 AM Andreas Weise <an...@gmail.com>
wrote:

> Hi,
>
> when using dynamic allocation on k8s with dynamic pvc reuse, I face that
> only few executors are running. 2 of 4 are stucked in 'ContainerCreating'
> with Events like:
> spark-medium-1x-38b7c47f92340e9e-exec-3 : Multi-Attach error for volume
> "pvc-c184e264-4a6d-406f-8d95-c59ff9e074d8" Volume is already used by pod(s)
> spark-medium-1x-38b7c47f92340e9e-exec-2
>
> According to the documentation, only PVCs of deleted executors should be
> reused (hence our ReadOnlyMany Storage should be sufficient right?). But
> the executor of the reused pvc is still running. Is this expected ?
>
> Config:
>
> spark.dynamicAllocation.enabled=true
> spark.dynamicAllocation.maxExecutors=4
> spark.dynamicAllocation.minExecutors=1
> spark.dynamicAllocation.executorIdleTimeout=60s
> spark.dynamicAllocation.shuffleTracking.enabled=true
> spark.kubernetes.driver.ownPersistentVolumeClaim=true
> spark.kubernetes.driver.reusePersistentVolumeClaim=true
>
> spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.claimName=OnDemand
>
> spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.storageClass=sc-openshift-default
>
> spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.sizeLimit=1Gi
>
> spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.path=/tmp/data/
>
> spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.readOnly=false
>
> Log excerpt (full log attached):
>
> INFO [2022-03-16 11:09:21,678] ({kubernetes-executor-snapshots-subscribers-0} Logging.scala[logInfo]:57) - Going to request 1 executors from Kubernetes for ResourceProfile Id: 0, target: 1, known: 0, sharedSlotFromPendingPods: 2147483647.
> INFO [2022-03-16 11:09:21,684] ({kubernetes-executor-snapshots-subscribers-0} Logging.scala[logInfo]:57) - Found 0 reusable PVCs from 0 PVCs
> INFO [2022-03-16 11:09:21,686] ({kubernetes-executor-snapshots-subscribers-0} Logging.scala[logInfo]:57) - Spark configuration files loaded from Some(/opt/conda/lib/python3.9/site-packages/pyspark/conf) : log4j.properties,hive-site.xml
> INFO [2022-03-16 11:09:21,687] ({kubernetes-executor-snapshots-subscribers-0} Logging.scala[logInfo]:57) - Adding decommission script to lifecycle
> INFO [2022-03-16 11:09:21,689] ({FIFOScheduler-interpreter_510428346-Worker-1} Logging.scala[logInfo]:57) - Using initial executors = 1, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
> WARN [2022-03-16 11:09:21,690] ({FIFOScheduler-interpreter_510428346-Worker-1} Logging.scala[logWarning]:69) - Dynamic allocation without a shuffle service is an experimental feature.
> INFO [2022-03-16 11:09:21,775] ({kubernetes-executor-snapshots-subscribers-0} Logging.scala[logInfo]:57) - Trying to create PersistentVolumeClaim spark-medium-1x-38b7c47f92340e9e-exec-2-pvc-0 with StorageClass sc-openshift-default
> INFO [2022-03-16 11:09:37,220] ({dispatcher-CoarseGrainedScheduler} Logging.scala[logInfo]:57) - Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.128.6.67:57144) with ID 2,  ResourceProfileId 0
> INFO [2022-03-16 11:09:37,225] ({spark-listener-group-executorManagement} Logging.scala[logInfo]:57) - New executor 2 has registered (new total is 1)
>
> ...
>
> INFO [2022-03-16 11:09:51,708] ({kubernetes-executor-snapshots-subscribers-1} Logging.scala[logInfo]:57) - Going to request 1 executors from Kubernetes for ResourceProfile Id: 0, target: 2, known: 1, sharedSlotFromPendingPods: 2147483647.
> INFO [2022-03-16 11:09:51,709] ({spark-dynamic-executor-allocation} Logging.scala[logInfo]:57) - Requesting 1 new executor because tasks are backlogged (new desired total will be 2 for resource profile id: 0)
> INFO [2022-03-16 11:09:51,717] ({kubernetes-executor-snapshots-subscribers-1} Logging.scala[logInfo]:57) - Found 1 reusable PVCs from 1 PVCs
> INFO [2022-03-16 11:09:51,719] ({kubernetes-executor-snapshots-subscribers-1} Logging.scala[logInfo]:57) - Spark configuration files loaded from Some(/opt/conda/lib/python3.9/site-packages/pyspark/conf) : log4j.properties,hive-site.xml
> INFO [2022-03-16 11:09:51,721] ({kubernetes-executor-snapshots-subscribers-1} Logging.scala[logInfo]:57) - Adding decommission script to lifecycle
> INFO [2022-03-16 11:09:51,726] ({kubernetes-executor-snapshots-subscribers-1} Logging.scala[logInfo]:57) - Reuse PersistentVolumeClaim spark-medium-1x-38b7c47f92340e9e-exec-2-pvc-0
> INFO [2022-03-16 11:09:52,713] ({spark-dynamic-executor-allocation} Logging.scala[logInfo]:57) - Requesting 2 new executors because tasks are backlogged (new desired total will be 4 for resource profile id: 0)
> INFO [2022-03-16 11:09:52,813] ({kubernetes-executor-snapshots-subscribers-0} Logging.scala[logInfo]:57) - Going to request 2 executors from Kubernetes for ResourceProfile Id: 0, target: 4, known: 2, sharedSlotFromPendingPods: 2147483646.
> INFO [2022-03-16 11:09:52,820] ({kubernetes-executor-snapshots-subscribers-0} Logging.scala[logInfo]:57) - Found 0 reusable PVCs from 1 PVCs
> INFO [2022-03-16 11:09:52,821] ({kubernetes-executor-snapshots-subscribers-0} Logging.scala[logInfo]:57) - Spark configuration files loaded from Some(/opt/conda/lib/python3.9/site-packages/pyspark/conf) : log4j.properties,hive-site.xml
> INFO [2022-03-16 11:09:52,822] ({kubernetes-executor-snapshots-subscribers-0} Logging.scala[logInfo]:57) - Adding decommission script to lifecycle
> INFO [2022-03-16 11:09:52,855] ({kubernetes-executor-snapshots-subscribers-0} Logging.scala[logInfo]:57) - Trying to create PersistentVolumeClaim spark-medium-1x-38b7c47f92340e9e-exec-4-pvc-0 with StorageClass sc-openshift-default
> INFO [2022-03-16 11:09:52,866] ({kubernetes-executor-snapshots-subscribers-0} Logging.scala[logInfo]:57) - Spark configuration files loaded from Some(/opt/conda/lib/python3.9/site-packages/pyspark/conf) : log4j.properties,hive-site.xml
> INFO [2022-03-16 11:09:52,867] ({kubernetes-executor-snapshots-subscribers-0} Logging.scala[logInfo]:57) - Adding decommission script to lifecycle
> INFO [2022-03-16 11:09:52,899] ({kubernetes-executor-snapshots-subscribers-0} Logging.scala[logInfo]:57) - Trying to create PersistentVolumeClaim spark-medium-1x-38b7c47f92340e9e-exec-5-pvc-0 with StorageClass sc-openshift-default
>
>
> Best regards
>
> Andreas
>
> ...
>
>