You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Mich Talebzadeh <mi...@gmail.com> on 2022/02/11 20:34:03 UTC

Deploying Spark on Google Kubernetes (GKE) autopilot, preliminary findings

The equivalent of Google GKE autopilot
<https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-overview>
in
AWS is AWS Fargate <https://aws.amazon.com/fargate/>


I have not used the AWS Fargate so I can only mension Google's GKE
Autopilot.


This is developed from the concept of containerization and microservices.
In the standard mode of creating a GKE cluster users can customize their
configurations based on the requirements, GKE manages the control plane and
users manually provision and manage their node infrastructure. So you
choose your hardware type and memory/CPU where your spark containers will
be running and they will be shown as VM hosts in your account. In GKE
Autopilot mode, GKE manages the nodes, pre-configures the cluster with
adds-on for auto-scaling, auto-upgrades, maintenance, Day 2 operations and
security hardening. So there is a lot there. You don't choose your nodes
and their sizes. You are effectively paying for the pods you use.


Within spark-submit, you still need to specify the number of executors,
driver and executor memory plus cores for each driver and executor when
doing spark-submit. The theory is that the k8s cluster will deploy suitable
nodes and will create enough pods on those nodes. With the standard k8s
cluster you choose your nodes and you ensure that one core on each node is
reserved for the OS itself. Otherwise if you allocate all cores to spark
with --conf spark.executor.cores, you will receive this error


kubctl describe pods -n spark

...

Events:

  Type     Reason             Age                 From
Message

  ----     ------             ----                ----
-------

  Warning  FailedScheduling   9s (x17 over 15m)   default-scheduler   0/3
nodes are available: 3 Insufficient cpu.

So with the standard k8s you have a choice of selecting your core sizes.
With autopilot this node selection is left to autopilot to deploy suitable
nodes and this will be a trial and error at the start (to get the
configuration right). You may be lucky if the history of executions are
kept current and the same job can be repeated. However, in my experience,
to procedure the driver pod in "running state" is expensive timewise and
without an executor in running state, there is no chance of spark job doing
anything


NAME                                         READY   STATUS    RESTARTS
 AGE

randomdatabigquery-cebab77eea6de971-exec-1   0/1     Pending   0
31s

randomdatabigquery-cebab77eea6de971-exec-2   0/1     Pending   0
31s

randomdatabigquery-cebab77eea6de971-exec-3   0/1     Pending   0
31s

randomdatabigquery-cebab77eea6de971-exec-4   0/1     Pending   0
31s

randomdatabigquery-cebab77eea6de971-exec-5   0/1     Pending   0
31s

randomdatabigquery-cebab77eea6de971-exec-6   0/1     Pending   0
31s

sparkbq-37405a7eea6b9468-driver              1/1     Running   0
3m4s


NAME                                         READY   STATUS
RESTARTS   AGE

randomdatabigquery-cebab77eea6de971-exec-6   0/1     ContainerCreating   0
        112s

sparkbq-37405a7eea6b9468-driver              1/1     Running             0
        4m25s

NAME                                         READY   STATUS    RESTARTS
 AGE

randomdatabigquery-cebab77eea6de971-exec-6   1/1     Running   0
114s

sparkbq-37405a7eea6b9468-driver              1/1     Running   0
4m27s

Basically I told Spak to have 6 executors but could only bring into running
state one executor after the driver pod spinning for 4 minutes.

22/02/11 20:16:18 INFO SparkKubernetesClientFactory: Auto-configuring K8S
client using current context from users K8S config file

22/02/11 20:16:19 INFO Utils: Using initial executors = 6, max of
spark.dynamicAllocation.initialExecutors,
spark.dynamicAllocation.minExecutors and spark.executor.instances

22/02/11 20:16:19 INFO ExecutorPodsAllocator: Going to request 3 executors
from Kubernetes for ResourceProfile Id: 0, target: 6 running: 0.

22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
enabled, skipping shutdown script

22/02/11 20:16:20 INFO Utils: Successfully started service
'org.apache.spark.network.netty.NettyBlockTransferService' on port 7079.

22/02/11 20:16:20 INFO NettyBlockTransferService: Server created on
sparkbq-37405a7eea6b9468-driver-svc.spark.svc:7079

22/02/11 20:16:20 INFO BlockManager: Using
org.apache.spark.storage.RandomBlockReplicationPolicy for block replication
policy

22/02/11 20:16:20 INFO BlockManagerMaster: Registering BlockManager
BlockManagerId(driver, sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079,
None)

22/02/11 20:16:20 INFO BlockManagerMasterEndpoint: Registering block
manager sparkbq-37405a7eea6b9468-driver-svc.spark.svc:7079 with 366.3 MiB
RAM, BlockManagerId(driver, sparkbq-37405a7eea6b9468-driver-svc.spark.svc,
7079, None)

22/02/11 20:16:20 INFO BlockManagerMaster: Registered BlockManager
BlockManagerId(driver, sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079,
None)

22/02/11 20:16:20 INFO BlockManager: Initialized BlockManager:
BlockManagerId(driver, sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079,
None)

22/02/11 20:16:20 INFO Utils: Using initial executors = 6, max of
spark.dynamicAllocation.initialExecutors,
spark.dynamicAllocation.minExecutors and spark.executor.instances

22/02/11 20:16:20 WARN ExecutorAllocationManager: Dynamic allocation
without a shuffle service is an experimental feature.

22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
enabled, skipping shutdown script

22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
enabled, skipping shutdown script

22/02/11 20:16:20 INFO ExecutorPodsAllocator: Going to request 3 executors
from Kubernetes for ResourceProfile Id: 0, target: 6 running: 3.

22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
enabled, skipping shutdown script

22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
enabled, skipping shutdown script

22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
enabled, skipping shutdown script

22/02/11 20:16:49 INFO KubernetesClusterSchedulerBackend: SchedulerBackend
is ready for scheduling beginning after waiting
maxRegisteredResourcesWaitingTime: 30000000000(ns)

22/02/11 20:16:49 INFO SharedState: Setting hive.metastore.warehouse.dir
('null') to the value of spark.sql.warehouse.dir
('file:/opt/spark/work-dir/spark-warehouse').

22/02/11 20:16:49 INFO SharedState: Warehouse path is
'file:/opt/spark/work-dir/spark-warehouse'.

OK there is a lot to digest here and I appreciate feedback from other
members that have experimented with GKE autopilot or AWS Fargate or are
familiar with k8s.

Thanks


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Re: Deploying Spark on Google Kubernetes (GKE) autopilot, preliminary findings

Posted by Gourav Sengupta <go...@gmail.com>.

Hi,

I would still not build any custom solution, and if in GCP use serverless
Dataproc. I think that it is always better to be hands on with AWS Glue
before commenting on it.

Regards,
Gourav Sengupta

On Mon, Feb 14, 2022 at 11:18 AM Mich Talebzadeh <mi...@gmail.com>
wrote:

> Good question. However, we ought to look at what options we have so to
> speak.
>
> Let us consider Spark on Dataproc, Spark on Kubernetes and Spark on
> Dataflow
>
>
> Spark on DataProc <https://cloud.google.com/dataproc> is proven and it is
> in use at many organizations, I have deployed it extensively. It is
> infrastructure as a service provided including Spark, Hadoop and other
> artefacts. You have to manage cluster creation, automate cluster creation
> and tear down, submitting jobs etc. However, it is another stack that needs
> to be managed. It now has autoscaling
> <https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/autoscaling>
> (enables cluster worker VM autoscaling ) policy as well.
>
> Spark on GKE
> <https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-overview>
> is something newer. Worth adding that the Spark DEV team are working hard
> to improve the performance of Spark on Kubernetes, for example, through Support
> for Customized Kubernetes Scheduler
> <https://docs.google.com/document/d/1xgQGRpaHQX6-QH_J9YV2C2Dh6RpXefUpLM7KGkzL6Fg>.
> As I explained in the first thread, Spark on Kubernetes relies on
> containerisation. Containers make applications more portable. Moreover,
> they simplify the packaging of dependencies, especially with PySpark and
> enable repeatable and reliable build workflows which is cost effective.
> They also reduce the overall devops load and allow one to iterate on the
> code faster. From a purely cost perspective it would be cheaper with Docker *as
> you can share resources* with your other services. You can create Spark
> docker with different versions of Spark, Scala, Java, OS etc. That docker
> file is portable. Can be used on Prem, AWS, GCP etc in container registries
> and devops and data science people can share it as well. Built once used by
> many. Kubernetes with autopilo
> <https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-overview#:~:text=Autopilot%20is%20a%20new%20mode,and%20yield%20higher%20workload%20availability.>t
> helps scale the nodes of the Kubernetes cluster depending on the load. *That
> is what I am currently looking into*.
>
> With regard to Dataflow <https://cloud.google.com/dataflow/docs>, which I
> believe is similar to AWS Glue
> <https://aws.amazon.com/glue/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc>,
> it is a managed service for executing data processing patterns. Patterns or
> pipelines are built with the Apache Beam SDK
> <https://beam.apache.org/documentation/runners/spark/>, which is an open
> source programming model that supports Java, Python and GO. It enables
> batch and streaming pipelines. You create your pipelines with an Apache
> Beam program and then run them on the Dataflow service. The Apache Spark
> Runner
> <https://beam.apache.org/documentation/runners/spark/#:~:text=The%20Apache%20Spark%20Runner%20can,Beam%20pipelines%20using%20Apache%20Spark.&text=The%20Spark%20Runner%20executes%20Beam,same%20security%20features%20Spark%20provides.>
> can be used to execute Beam pipelines using Spark. When you run a job on
> Dataflow, it spins up a cluster of virtual machines, distributes the tasks
> in the job to the VMs, and dynamically scales the cluster based on how the
> job is performing. As I understand both iterative processing and notebooks
> plus Machine learning with Spark ML are not currently supported by Dataflow
>
> So we have three choices here. If you are migrating from on-prem
> Hadoop/spark/YARN set-up, you may go for Dataproc which will provide the
> same look and feel. If you want to use microservices and containers in your
> event driven architecture, you can adopt docker images that run on
> Kubernetes clusters, including Multi-Cloud Kubernetes Cluster. Dataflow is
> probably best suited for green-field projects.  Less operational
> overhead, unified approach for batch and streaming pipelines.
>
> *So as ever your mileage varies*. If you want to migrate from your
> existing Hadoop/Spark cluster to GCP, or take advantage of your existing
> workforce, choose Dataproc or GKE. In many cases, a big consideration is
> that one already has a codebase written against a particular framework, and
> one just wants to deploy it on the GCP, so even if, say, the Beam
> programming mode/dataflow is superior to Hadoop, someone with a lot of
> Hadoop code might still choose Dataproc or GDE for the time being, rather
> than rewriting their code on Beam to run on Dataflow.
>
>  HTH
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 14 Feb 2022 at 05:46, Gourav Sengupta <go...@gmail.com>
> wrote:
>
>> Hi,
>> may be this is useful in case someone is testing SPARK in containers for
>> developing SPARK.
>>
>> *From a production scale work point of view:*
>> But if I am in AWS, I will just use GLUE if I want to use containers for
>> SPARK, without massively increasing my costs for operations unnecessarily.
>>
>> Also, in case I am not wrong, GCP already has SPARK running in serverless
>> mode.  Personally I would never create the overhead of additional costs and
>> issues to my clients of deploying SPARK when those solutions are already
>> available by Cloud vendors. Infact, that is one of the precise reasons why
>> people use cloud - to reduce operational costs.
>>
>> Sorry, just trying to understand what is the scope of this work.
>>
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Fri, Feb 11, 2022 at 8:35 PM Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> The equivalent of Google GKE autopilot
>>> <https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-overview> in
>>> AWS is AWS Fargate <https://aws.amazon.com/fargate/>
>>>
>>>
>>> I have not used the AWS Fargate so I can only mension Google's GKE
>>> Autopilot.
>>>
>>>
>>> This is developed from the concept of containerization and
>>> microservices. In the standard mode of creating a GKE cluster users can
>>> customize their configurations based on the requirements, GKE manages the
>>> control plane and users manually provision and manage their node
>>> infrastructure. So you choose your hardware type and memory/CPU where your
>>> spark containers will be running and they will be shown as VM hosts in your
>>> account. In GKE Autopilot mode, GKE manages the nodes, pre-configures
>>> the cluster with adds-on for auto-scaling, auto-upgrades, maintenance, Day
>>> 2 operations and security hardening. So there is a lot there. You don't
>>> choose your nodes and their sizes. You are effectively paying for the pods
>>> you use.
>>>
>>>
>>> Within spark-submit, you still need to specify the number of executors,
>>> driver and executor memory plus cores for each driver and executor when
>>> doing spark-submit. The theory is that the k8s cluster will deploy suitable
>>> nodes and will create enough pods on those nodes. With the standard k8s
>>> cluster you choose your nodes and you ensure that one core on each node is
>>> reserved for the OS itself. Otherwise if you allocate all cores to spark
>>> with --conf spark.executor.cores, you will receive this error
>>>
>>>
>>> kubctl describe pods -n spark
>>>
>>> ...
>>>
>>> Events:
>>>
>>>   Type     Reason             Age                 From
>>> Message
>>>
>>>   ----     ------             ----                ----
>>> -------
>>>
>>>   Warning  FailedScheduling   9s (x17 over 15m)   default-scheduler
>>>  0/3 nodes are available: 3 Insufficient cpu.
>>>
>>> So with the standard k8s you have a choice of selecting your core sizes.
>>> With autopilot this node selection is left to autopilot to deploy suitable
>>> nodes and this will be a trial and error at the start (to get the
>>> configuration right). You may be lucky if the history of executions are
>>> kept current and the same job can be repeated. However, in my experience,
>>> to procedure the driver pod in "running state" is expensive timewise and
>>> without an executor in running state, there is no chance of spark job doing
>>> anything
>>>
>>>
>>> NAME                                         READY   STATUS    RESTARTS
>>>  AGE
>>>
>>> randomdatabigquery-cebab77eea6de971-exec-1   0/1     Pending   0
>>>   31s
>>>
>>> randomdatabigquery-cebab77eea6de971-exec-2   0/1     Pending   0
>>>   31s
>>>
>>> randomdatabigquery-cebab77eea6de971-exec-3   0/1     Pending   0
>>>   31s
>>>
>>> randomdatabigquery-cebab77eea6de971-exec-4   0/1     Pending   0
>>>   31s
>>>
>>> randomdatabigquery-cebab77eea6de971-exec-5   0/1     Pending   0
>>>   31s
>>>
>>> randomdatabigquery-cebab77eea6de971-exec-6   0/1     Pending   0
>>>   31s
>>>
>>> sparkbq-37405a7eea6b9468-driver              1/1     Running   0
>>>   3m4s
>>>
>>>
>>> NAME                                         READY   STATUS
>>> RESTARTS   AGE
>>>
>>> randomdatabigquery-cebab77eea6de971-exec-6   0/1     ContainerCreating
>>>  0          112s
>>>
>>> sparkbq-37405a7eea6b9468-driver              1/1     Running
>>>  0          4m25s
>>>
>>> NAME                                         READY   STATUS    RESTARTS
>>>  AGE
>>>
>>> randomdatabigquery-cebab77eea6de971-exec-6   1/1     Running   0
>>>   114s
>>>
>>> sparkbq-37405a7eea6b9468-driver              1/1     Running   0
>>>   4m27s
>>>
>>> Basically I told Spak to have 6 executors but could only bring into
>>> running state one executor after the driver pod spinning for 4 minutes.
>>>
>>> 22/02/11 20:16:18 INFO SparkKubernetesClientFactory: Auto-configuring
>>> K8S client using current context from users K8S config file
>>>
>>> 22/02/11 20:16:19 INFO Utils: Using initial executors = 6, max of
>>> spark.dynamicAllocation.initialExecutors,
>>> spark.dynamicAllocation.minExecutors and spark.executor.instances
>>>
>>> 22/02/11 20:16:19 INFO ExecutorPodsAllocator: Going to request 3
>>> executors from Kubernetes for ResourceProfile Id: 0, target: 6 running: 0.
>>>
>>> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
>>> enabled, skipping shutdown script
>>>
>>> 22/02/11 20:16:20 INFO Utils: Successfully started service
>>> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 7079.
>>>
>>> 22/02/11 20:16:20 INFO NettyBlockTransferService: Server created on
>>> sparkbq-37405a7eea6b9468-driver-svc.spark.svc:7079
>>>
>>> 22/02/11 20:16:20 INFO BlockManager: Using
>>> org.apache.spark.storage.RandomBlockReplicationPolicy for block replication
>>> policy
>>>
>>> 22/02/11 20:16:20 INFO BlockManagerMaster: Registering BlockManager
>>> BlockManagerId(driver, sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079,
>>> None)
>>>
>>> 22/02/11 20:16:20 INFO BlockManagerMasterEndpoint: Registering block
>>> manager sparkbq-37405a7eea6b9468-driver-svc.spark.svc:7079 with 366.3 MiB
>>> RAM, BlockManagerId(driver,
>>> sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079, None)
>>>
>>> 22/02/11 20:16:20 INFO BlockManagerMaster: Registered BlockManager
>>> BlockManagerId(driver, sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079,
>>> None)
>>>
>>> 22/02/11 20:16:20 INFO BlockManager: Initialized BlockManager:
>>> BlockManagerId(driver, sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079,
>>> None)
>>>
>>> 22/02/11 20:16:20 INFO Utils: Using initial executors = 6, max of
>>> spark.dynamicAllocation.initialExecutors,
>>> spark.dynamicAllocation.minExecutors and spark.executor.instances
>>>
>>> 22/02/11 20:16:20 WARN ExecutorAllocationManager: Dynamic allocation
>>> without a shuffle service is an experimental feature.
>>>
>>> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
>>> enabled, skipping shutdown script
>>>
>>> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
>>> enabled, skipping shutdown script
>>>
>>> 22/02/11 20:16:20 INFO ExecutorPodsAllocator: Going to request 3
>>> executors from Kubernetes for ResourceProfile Id: 0, target: 6 running: 3.
>>>
>>> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
>>> enabled, skipping shutdown script
>>>
>>> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
>>> enabled, skipping shutdown script
>>>
>>> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
>>> enabled, skipping shutdown script
>>>
>>> 22/02/11 20:16:49 INFO KubernetesClusterSchedulerBackend:
>>> SchedulerBackend is ready for scheduling beginning after waiting
>>> maxRegisteredResourcesWaitingTime: 30000000000(ns)
>>>
>>> 22/02/11 20:16:49 INFO SharedState: Setting hive.metastore.warehouse.dir
>>> ('null') to the value of spark.sql.warehouse.dir
>>> ('file:/opt/spark/work-dir/spark-warehouse').
>>>
>>> 22/02/11 20:16:49 INFO SharedState: Warehouse path is
>>> 'file:/opt/spark/work-dir/spark-warehouse'.
>>>
>>> OK there is a lot to digest here and I appreciate feedback from other
>>> members that have experimented with GKE autopilot or AWS Fargate or are
>>> familiar with k8s.
>>>
>>> Thanks
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>

Re: Deploying Spark on Google Kubernetes (GKE) autopilot, preliminary findings

Posted by Gourav Sengupta <go...@gmail.com>.

Hi,

sorry in case it appeared otherwise, Mich's takes are super interesting.
Just that while applying solutions on commercial undertakings things are
quite different from research/ development scenarios .



Regards,
Gourav Sengupta





On Mon, Feb 14, 2022 at 5:02 PM ashok34668@yahoo.com.INVALID
<as...@yahoo.com.invalid> wrote:

> Thanks Mich. Very insightful.
>
>
> AK
> On Monday, 14 February 2022, 11:18:19 GMT, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>
> Good question. However, we ought to look at what options we have so to
> speak.
>
> Let us consider Spark on Dataproc, Spark on Kubernetes and Spark on
> Dataflow
>
>
> Spark on DataProc <https://cloud.google.com/dataproc> is proven and it is
> in use at many organizations, I have deployed it extensively. It is
> infrastructure as a service provided including Spark, Hadoop and other
> artefacts. You have to manage cluster creation, automate cluster creation
> and tear down, submitting jobs etc. However, it is another stack that needs
> to be managed. It now has autoscaling
> <https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/autoscaling>
> (enables cluster worker VM autoscaling ) policy as well.
>
> Spark on GKE
> <https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-overview>
> is something newer. Worth adding that the Spark DEV team are working hard
> to improve the performance of Spark on Kubernetes, for example, through Support
> for Customized Kubernetes Scheduler
> <https://docs.google.com/document/d/1xgQGRpaHQX6-QH_J9YV2C2Dh6RpXefUpLM7KGkzL6Fg>.
> As I explained in the first thread, Spark on Kubernetes relies on
> containerisation. Containers make applications more portable. Moreover,
> they simplify the packaging of dependencies, especially with PySpark and
> enable repeatable and reliable build workflows which is cost effective.
> They also reduce the overall devops load and allow one to iterate on the
> code faster. From a purely cost perspective it would be cheaper with Docker *as
> you can share resources* with your other services. You can create Spark
> docker with different versions of Spark, Scala, Java, OS etc. That docker
> file is portable. Can be used on Prem, AWS, GCP etc in container registries
> and devops and data science people can share it as well. Built once used by
> many. Kubernetes with autopilo
> <https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-overview#:~:text=Autopilot%20is%20a%20new%20mode,and%20yield%20higher%20workload%20availability.>t
> helps scale the nodes of the Kubernetes cluster depending on the load. *That
> is what I am currently looking into*.
>
> With regard to Dataflow <https://cloud.google.com/dataflow/docs>, which I
> believe is similar to AWS Glue
> <https://aws.amazon.com/glue/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc>,
> it is a managed service for executing data processing patterns. Patterns or
> pipelines are built with the Apache Beam SDK
> <https://beam.apache.org/documentation/runners/spark/>, which is an open
> source programming model that supports Java, Python and GO. It enables
> batch and streaming pipelines. You create your pipelines with an Apache
> Beam program and then run them on the Dataflow service. The Apache Spark
> Runner
> <https://beam.apache.org/documentation/runners/spark/#:~:text=The%20Apache%20Spark%20Runner%20can,Beam%20pipelines%20using%20Apache%20Spark.&text=The%20Spark%20Runner%20executes%20Beam,same%20security%20features%20Spark%20provides.>
> can be used to execute Beam pipelines using Spark. When you run a job on
> Dataflow, it spins up a cluster of virtual machines, distributes the tasks
> in the job to the VMs, and dynamically scales the cluster based on how the
> job is performing. As I understand both iterative processing and notebooks
> plus Machine learning with Spark ML are not currently supported by Dataflow
>
> So we have three choices here. If you are migrating from on-prem
> Hadoop/spark/YARN set-up, you may go for Dataproc which will provide the
> same look and feel. If you want to use microservices and containers in your
> event driven architecture, you can adopt docker images that run on
> Kubernetes clusters, including Multi-Cloud Kubernetes Cluster. Dataflow is
> probably best suited for green-field projects.  Less operational
> overhead, unified approach for batch and streaming pipelines.
>
> *So as ever your mileage varies*. If you want to migrate from your
> existing Hadoop/Spark cluster to GCP, or take advantage of your existing
> workforce, choose Dataproc or GKE. In many cases, a big consideration is
> that one already has a codebase written against a particular framework, and
> one just wants to deploy it on the GCP, so even if, say, the Beam
> programming mode/dataflow is superior to Hadoop, someone with a lot of
> Hadoop code might still choose Dataproc or GDE for the time being, rather
> than rewriting their code on Beam to run on Dataflow.
>
>  HTH
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 14 Feb 2022 at 05:46, Gourav Sengupta <go...@gmail.com>
> wrote:
>
> Hi,
> may be this is useful in case someone is testing SPARK in containers for
> developing SPARK.
>
> *From a production scale work point of view:*
> But if I am in AWS, I will just use GLUE if I want to use containers for
> SPARK, without massively increasing my costs for operations unnecessarily.
>
> Also, in case I am not wrong, GCP already has SPARK running in serverless
> mode.  Personally I would never create the overhead of additional costs and
> issues to my clients of deploying SPARK when those solutions are already
> available by Cloud vendors. Infact, that is one of the precise reasons why
> people use cloud - to reduce operational costs.
>
> Sorry, just trying to understand what is the scope of this work.
>
>
> Regards,
> Gourav Sengupta
>
> On Fri, Feb 11, 2022 at 8:35 PM Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
> The equivalent of Google GKE autopilot
> <https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-overview> in
> AWS is AWS Fargate <https://aws.amazon.com/fargate/>
>
>
> I have not used the AWS Fargate so I can only mension Google's GKE
> Autopilot.
>
>
> This is developed from the concept of containerization and microservices.
> In the standard mode of creating a GKE cluster users can customize their
> configurations based on the requirements, GKE manages the control plane and
> users manually provision and manage their node infrastructure. So you
> choose your hardware type and memory/CPU where your spark containers will
> be running and they will be shown as VM hosts in your account. In GKE
> Autopilot mode, GKE manages the nodes, pre-configures the cluster with
> adds-on for auto-scaling, auto-upgrades, maintenance, Day 2 operations and
> security hardening. So there is a lot there. You don't choose your nodes
> and their sizes. You are effectively paying for the pods you use.
>
>
> Within spark-submit, you still need to specify the number of executors,
> driver and executor memory plus cores for each driver and executor when
> doing spark-submit. The theory is that the k8s cluster will deploy suitable
> nodes and will create enough pods on those nodes. With the standard k8s
> cluster you choose your nodes and you ensure that one core on each node is
> reserved for the OS itself. Otherwise if you allocate all cores to spark
> with --conf spark.executor.cores, you will receive this error
>
>
> kubctl describe pods -n spark
>
> ...
>
> Events:
>
>   Type     Reason             Age                 From
> Message
>
>   ----     ------             ----                ----
> -------
>
>   Warning  FailedScheduling   9s (x17 over 15m)   default-scheduler   0/3
> nodes are available: 3 Insufficient cpu.
>
> So with the standard k8s you have a choice of selecting your core sizes.
> With autopilot this node selection is left to autopilot to deploy suitable
> nodes and this will be a trial and error at the start (to get the
> configuration right). You may be lucky if the history of executions are
> kept current and the same job can be repeated. However, in my experience,
> to procedure the driver pod in "running state" is expensive timewise and
> without an executor in running state, there is no chance of spark job doing
> anything
>
>
> NAME                                         READY   STATUS    RESTARTS
>  AGE
>
> randomdatabigquery-cebab77eea6de971-exec-1   0/1     Pending   0
> 31s
>
> randomdatabigquery-cebab77eea6de971-exec-2   0/1     Pending   0
> 31s
>
> randomdatabigquery-cebab77eea6de971-exec-3   0/1     Pending   0
> 31s
>
> randomdatabigquery-cebab77eea6de971-exec-4   0/1     Pending   0
> 31s
>
> randomdatabigquery-cebab77eea6de971-exec-5   0/1     Pending   0
> 31s
>
> randomdatabigquery-cebab77eea6de971-exec-6   0/1     Pending   0
> 31s
>
> sparkbq-37405a7eea6b9468-driver              1/1     Running   0
> 3m4s
>
>
> NAME                                         READY   STATUS
> RESTARTS   AGE
>
> randomdatabigquery-cebab77eea6de971-exec-6   0/1     ContainerCreating
>  0          112s
>
> sparkbq-37405a7eea6b9468-driver              1/1     Running
>  0          4m25s
>
> NAME                                         READY   STATUS    RESTARTS
>  AGE
>
> randomdatabigquery-cebab77eea6de971-exec-6   1/1     Running   0
> 114s
>
> sparkbq-37405a7eea6b9468-driver              1/1     Running   0
> 4m27s
>
> Basically I told Spak to have 6 executors but could only bring into
> running state one executor after the driver pod spinning for 4 minutes.
>
> 22/02/11 20:16:18 INFO SparkKubernetesClientFactory: Auto-configuring K8S
> client using current context from users K8S config file
>
> 22/02/11 20:16:19 INFO Utils: Using initial executors = 6, max of
> spark.dynamicAllocation.initialExecutors,
> spark.dynamicAllocation.minExecutors and spark.executor.instances
>
> 22/02/11 20:16:19 INFO ExecutorPodsAllocator: Going to request 3 executors
> from Kubernetes for ResourceProfile Id: 0, target: 6 running: 0.
>
> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
> enabled, skipping shutdown script
>
> 22/02/11 20:16:20 INFO Utils: Successfully started service
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 7079.
>
> 22/02/11 20:16:20 INFO NettyBlockTransferService: Server created on
> sparkbq-37405a7eea6b9468-driver-svc.spark.svc:7079
>
> 22/02/11 20:16:20 INFO BlockManager: Using
> org.apache.spark.storage.RandomBlockReplicationPolicy for block replication
> policy
>
> 22/02/11 20:16:20 INFO BlockManagerMaster: Registering BlockManager
> BlockManagerId(driver, sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079,
> None)
>
> 22/02/11 20:16:20 INFO BlockManagerMasterEndpoint: Registering block
> manager sparkbq-37405a7eea6b9468-driver-svc.spark.svc:7079 with 366.3 MiB
> RAM, BlockManagerId(driver,
> sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079, None)
>
> 22/02/11 20:16:20 INFO BlockManagerMaster: Registered BlockManager
> BlockManagerId(driver, sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079,
> None)
>
> 22/02/11 20:16:20 INFO BlockManager: Initialized BlockManager:
> BlockManagerId(driver, sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079,
> None)
>
> 22/02/11 20:16:20 INFO Utils: Using initial executors = 6, max of
> spark.dynamicAllocation.initialExecutors,
> spark.dynamicAllocation.minExecutors and spark.executor.instances
>
> 22/02/11 20:16:20 WARN ExecutorAllocationManager: Dynamic allocation
> without a shuffle service is an experimental feature.
>
> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
> enabled, skipping shutdown script
>
> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
> enabled, skipping shutdown script
>
> 22/02/11 20:16:20 INFO ExecutorPodsAllocator: Going to request 3 executors
> from Kubernetes for ResourceProfile Id: 0, target: 6 running: 3.
>
> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
> enabled, skipping shutdown script
>
> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
> enabled, skipping shutdown script
>
> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
> enabled, skipping shutdown script
>
> 22/02/11 20:16:49 INFO KubernetesClusterSchedulerBackend: SchedulerBackend
> is ready for scheduling beginning after waiting
> maxRegisteredResourcesWaitingTime: 30000000000(ns)
>
> 22/02/11 20:16:49 INFO SharedState: Setting hive.metastore.warehouse.dir
> ('null') to the value of spark.sql.warehouse.dir
> ('file:/opt/spark/work-dir/spark-warehouse').
>
> 22/02/11 20:16:49 INFO SharedState: Warehouse path is
> 'file:/opt/spark/work-dir/spark-warehouse'.
>
> OK there is a lot to digest here and I appreciate feedback from other
> members that have experimented with GKE autopilot or AWS Fargate or are
> familiar with k8s.
>
> Thanks
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>

Re: Deploying Spark on Google Kubernetes (GKE) autopilot, preliminary findings

Posted by "ashok34668@yahoo.com.INVALID" <as...@yahoo.com.INVALID>.

 Thanks Mich. Very insightful.

AK    On Monday, 14 February 2022, 11:18:19 GMT, Mich Talebzadeh <mi...@gmail.com> wrote:  
 
 Good question. However, we ought to look at what options we have so to speak. 
Let us consider Spark on Dataproc, Spark on Kubernetes and Spark on Dataflow



Spark on DataProc is proven and it is in useat many organizations, I have deployed it extensively. It is infrastructure asa service provided including Spark, Hadoop and other artefacts. You have tomanage cluster creation, automate cluster creation and tear down, submittingjobs etc. However, it is another stack that needs to be managed.It now has autoscaling(enables cluster worker VM autoscaling ) policy as well.


Spark on GKEis something newer. Worth adding that the Spark DEV team are working hard to improve the performanceof Spark on Kubernetes, for example, through Support forCustomized Kubernetes Scheduler. As I explained in the first thread, Spark on Kubernetes relies on containerisation.Containers make applications more portable. Moreover, they simplify thepackaging of dependencies, especially with PySpark and enable repeatable andreliable build workflows which is cost effective. They also reduce the overalldevops load and allow one to iterate on the code faster. From a purely costperspective it would be cheaper with Docker as you can share resourceswith your other services. You can create Spark docker with different versionsof Spark, Scala, Java, OS etc. That docker file is portable. Can be used onPrem, AWS, GCP etc in container registries and devops and data science peoplecan share it as well. Built once used by many. Kuberneteswith autopilot helps scale the nodes of the Kubernetes cluster depending on theload. That is what I am currently looking into.

With regard to Dataflow, which I believe issimilar to AWSGlue, it is a managed service for executing data processing patterns. Patternsor pipelines are built with the Apache Beam SDK,which is an open source programming model that supports Java, Python and GO. Itenables batch and streaming pipelines. You create your pipelines with an ApacheBeam program and then run them on the Dataflow service. TheApache Spark Runner can be used to execute Beam pipelines using Spark. When you run a job on Dataflow,it spins up a cluster of virtual machines, distributes the tasks in the job tothe VMs, and dynamically scales the cluster based on how the job is performing.As I understand both iterative processing and notebooks plus Machine learning withSpark ML are not currently supported by Dataflow

So we have three choiceshere. If you are migrating from on-prem Hadoop/spark/YARN set-up, you may gofor Dataproc which will provide the same look and feel. If you want to usemicroservices and containers in your event driven architecture, you can adopt dockerimages that run on Kubernetes clusters, including Multi-Cloud KubernetesCluster. Dataflow is probably best suited for green-field projects.  Lessoperational overhead, unified approach for batch and streaming pipelines.

So as ever your mileage varies. If you want to migratefrom your existing Hadoop/Spark cluster to GCP, or take advantage of yourexisting workforce, choose Dataproc or GKE. In many cases, a bigconsideration is that one already has a codebase written against a particularframework, and one just wants to deploy it on the GCP, so even if, say, theBeam programming mode/dataflow is superior to Hadoop, someone with a lot ofHadoop code might still choose Dataproc or GDE for the time being, rather thanrewriting their code on Beam to run on Dataflow.

 HTH




   view my Linkedin profile




 https://en.everybodywiki.com/Mich_Talebzadeh

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destructionof data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction. 

 


On Mon, 14 Feb 2022 at 05:46, Gourav Sengupta <go...@gmail.com> wrote:

Hi,may be this is useful in case someone is testing SPARK in containers for developing SPARK. 
From a production scale work point of view:But if I am in AWS, I will just use GLUE if I want to use containers for SPARK, without massively increasing my costs for operations unnecessarily. 
Also, in case I am not wrong, GCP already has SPARK running in serverless mode.  Personally I would never create the overhead of additional costs and issues to my clients of deploying SPARK when those solutions are already available by Cloud vendors. Infact, that is one of the precise reasons why people use cloud - to reduce operational costs.
Sorry, just trying to understand what is the scope of this work.

Regards,Gourav Sengupta
On Fri, Feb 11, 2022 at 8:35 PM Mich Talebzadeh <mi...@gmail.com> wrote:

The equivalent of Google GKE autopilot in AWS is AWS Fargate




I have not used the AWS Fargate so I can only mension Google's GKE Autopilot.




This is developed from the concept of containerization and microservices. In the standard mode of creating a GKE cluster userscan customize their configurations based on the requirements, GKE manages thecontrol plane and users manually provision and manage their nodeinfrastructure. So you choose your hardware type and memory/CPU where your spark containers will be running and they will be shown as VM hosts in your account. In GKE Autopilot mode, GKE manages the nodes,pre-configures the cluster with adds-on for auto-scaling, auto-upgrades, maintenance, Day2 operations and security hardening. So there is a lot there. You don't choose your nodes and their sizes. You are effectively paying for the pods you use.




Within spark-submit, you still need to specify the number of executors, driver and executor memory plus cores for each driver and executor when doing spark-submit. The theory is that the k8s cluster will deploy suitable nodes and will create enough pods on those nodes. With the standard k8s cluster you choose your nodes and you ensure that one core on each node is reserved for the OS itself. Otherwise if you allocate all cores to spark with --conf spark.executor.cores, you will receive this error




kubctl describe pods -n spark

...

Events:

  Type     Reason             Age                 From                Message

  ----     ------             ----                ----                -------

  Warning  FailedScheduling   9s (x17 over 15m)   default-scheduler   0/3 nodes are available: 3 Insufficient cpu.

So with the standard k8s you have a choice of selecting your core sizes. With autopilot this node selection is left to autopilot to deploy suitable nodes and this will be a trial and error at the start (to get the configuration right). You may be lucky if the history of executions are kept current and the same job can be repeated. However, in my experience, to procedure the driver pod in "running state" is expensive timewise and  without an executor in running state, there is no chance of spark job doing anything 



NAME                                         READY   STATUS    RESTARTS   AGE

randomdatabigquery-cebab77eea6de971-exec-1   0/1     Pending   0          31s

randomdatabigquery-cebab77eea6de971-exec-2   0/1     Pending   0          31s

randomdatabigquery-cebab77eea6de971-exec-3   0/1     Pending   0          31s

randomdatabigquery-cebab77eea6de971-exec-4   0/1     Pending   0          31s

randomdatabigquery-cebab77eea6de971-exec-5   0/1     Pending   0          31s

randomdatabigquery-cebab77eea6de971-exec-6   0/1     Pending   0          31s



sparkbq-37405a7eea6b9468-driver              1/1     Running   0          3m4s




NAME                                         READY   STATUS              RESTARTS   AGE

randomdatabigquery-cebab77eea6de971-exec-6   0/1     ContainerCreating   0          112s

sparkbq-37405a7eea6b9468-driver              1/1     Running             0          4m25s

NAME                                         READY   STATUS    RESTARTS   AGE

randomdatabigquery-cebab77eea6de971-exec-6   1/1     Running   0          114s



sparkbq-37405a7eea6b9468-driver              1/1     Running   0          4m27s

Basically I told Spak to have 6 executors but could only bring into running state one executor after the driver pod spinning for 4 minutes. 

22/02/11 20:16:18 INFO SparkKubernetesClientFactory: Auto-configuring K8S client using current context from users K8S config file

22/02/11 20:16:19 INFO Utils: Using initial executors = 6, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances

22/02/11 20:16:19 INFO ExecutorPodsAllocator: Going to request 3 executors from Kubernetes for ResourceProfile Id: 0, target: 6 running: 0.

22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not enabled, skipping shutdown script

22/02/11 20:16:20 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 7079.

22/02/11 20:16:20 INFO NettyBlockTransferService: Server created on sparkbq-37405a7eea6b9468-driver-svc.spark.svc:7079

22/02/11 20:16:20 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy

22/02/11 20:16:20 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079, None)

22/02/11 20:16:20 INFO BlockManagerMasterEndpoint: Registering block manager sparkbq-37405a7eea6b9468-driver-svc.spark.svc:7079 with 366.3 MiB RAM, BlockManagerId(driver, sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079, None)

22/02/11 20:16:20 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079, None)

22/02/11 20:16:20 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079, None)

22/02/11 20:16:20 INFO Utils: Using initial executors = 6, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances

22/02/11 20:16:20 WARN ExecutorAllocationManager: Dynamic allocation without a shuffle service is an experimental feature.

22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not enabled, skipping shutdown script

22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not enabled, skipping shutdown script

22/02/11 20:16:20 INFO ExecutorPodsAllocator: Going to request 3 executors from Kubernetes for ResourceProfile Id: 0, target: 6 running: 3.

22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not enabled, skipping shutdown script

22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not enabled, skipping shutdown script

22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not enabled, skipping shutdown script

22/02/11 20:16:49 INFO KubernetesClusterSchedulerBackend: SchedulerBackend is ready for scheduling beginning after waiting maxRegisteredResourcesWaitingTime: 30000000000(ns)

22/02/11 20:16:49 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/opt/spark/work-dir/spark-warehouse').



22/02/11 20:16:49 INFO SharedState: Warehouse path is 'file:/opt/spark/work-dir/spark-warehouse'.

OK there is a lot to digest here and I appreciate feedback from other members that have experimented with GKE autopilot or AWS Fargate or are familiar with k8s.
Thanks



   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destructionof data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction.

Re: Deploying Spark on Google Kubernetes (GKE) autopilot, preliminary findings

Posted by Mich Talebzadeh <mi...@gmail.com>.

Good question. However, we ought to look at what options we have so to
speak.

Let us consider Spark on Dataproc, Spark on Kubernetes and Spark on Dataflow

Spark on DataProc <https://cloud.google.com/dataproc> is proven and it is
in use at many organizations, I have deployed it extensively. It is
infrastructure as a service provided including Spark, Hadoop and other
artefacts. You have to manage cluster creation, automate cluster creation
and tear down, submitting jobs etc. However, it is another stack that needs
to be managed. It now has autoscaling
<https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/autoscaling>
(enables cluster worker VM autoscaling ) policy as well.

Spark on GKE
<https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-overview>
is something newer. Worth adding that the Spark DEV team are working hard
to improve the performance of Spark on Kubernetes, for example, through Support
for Customized Kubernetes Scheduler
<https://docs.google.com/document/d/1xgQGRpaHQX6-QH_J9YV2C2Dh6RpXefUpLM7KGkzL6Fg>.
As I explained in the first thread, Spark on Kubernetes relies on
containerisation. Containers make applications more portable. Moreover,
they simplify the packaging of dependencies, especially with PySpark and
enable repeatable and reliable build workflows which is cost effective.
They also reduce the overall devops load and allow one to iterate on the
code faster. From a purely cost perspective it would be cheaper with Docker *as
you can share resources* with your other services. You can create Spark
docker with different versions of Spark, Scala, Java, OS etc. That docker
file is portable. Can be used on Prem, AWS, GCP etc in container registries
and devops and data science people can share it as well. Built once used by
many. Kubernetes with autopilo
<https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-overview#:~:text=Autopilot%20is%20a%20new%20mode,and%20yield%20higher%20workload%20availability.>t
helps scale the nodes of the Kubernetes cluster depending on the load. *That
is what I am currently looking into*.

With regard to Dataflow <https://cloud.google.com/dataflow/docs>, which I
believe is similar to AWS Glue
<https://aws.amazon.com/glue/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc>,
it is a managed service for executing data processing patterns. Patterns or
pipelines are built with the Apache Beam SDK
<https://beam.apache.org/documentation/runners/spark/>, which is an open
source programming model that supports Java, Python and GO. It enables
batch and streaming pipelines. You create your pipelines with an Apache
Beam program and then run them on the Dataflow service. The Apache Spark
Runner
<https://beam.apache.org/documentation/runners/spark/#:~:text=The%20Apache%20Spark%20Runner%20can,Beam%20pipelines%20using%20Apache%20Spark.&text=The%20Spark%20Runner%20executes%20Beam,same%20security%20features%20Spark%20provides.>
can be used to execute Beam pipelines using Spark. When you run a job on
Dataflow, it spins up a cluster of virtual machines, distributes the tasks
in the job to the VMs, and dynamically scales the cluster based on how the
job is performing. As I understand both iterative processing and notebooks
plus Machine learning with Spark ML are not currently supported by Dataflow

So we have three choices here. If you are migrating from on-prem
Hadoop/spark/YARN set-up, you may go for Dataproc which will provide the
same look and feel. If you want to use microservices and containers in your
event driven architecture, you can adopt docker images that run on
Kubernetes clusters, including Multi-Cloud Kubernetes Cluster. Dataflow is
probably best suited for green-field projects.  Less operational overhead,
unified approach for batch and streaming pipelines.

*So as ever your mileage varies*. If you want to migrate from your existing
Hadoop/Spark cluster to GCP, or take advantage of your existing workforce,
choose Dataproc or GKE. In many cases, a big consideration is that one
already has a codebase written against a particular framework, and one just
wants to deploy it on the GCP, so even if, say, the Beam programming
mode/dataflow is superior to Hadoop, someone with a lot of Hadoop code
might still choose Dataproc or GDE for the time being, rather than
rewriting their code on Beam to run on Dataflow.

 HTH

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

 https://en.everybodywiki.com/Mich_Talebzadeh

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On Mon, 14 Feb 2022 at 05:46, Gourav Sengupta <go...@gmail.com>
wrote:

> Hi,
> may be this is useful in case someone is testing SPARK in containers for
> developing SPARK.
>
> *From a production scale work point of view:*
> But if I am in AWS, I will just use GLUE if I want to use containers for
> SPARK, without massively increasing my costs for operations unnecessarily.
>
> Also, in case I am not wrong, GCP already has SPARK running in serverless
> mode.  Personally I would never create the overhead of additional costs and
> issues to my clients of deploying SPARK when those solutions are already
> available by Cloud vendors. Infact, that is one of the precise reasons why
> people use cloud - to reduce operational costs.
>
> Sorry, just trying to understand what is the scope of this work.
>
>
> Regards,
> Gourav Sengupta
>
> On Fri, Feb 11, 2022 at 8:35 PM Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
>> The equivalent of Google GKE autopilot
>> <https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-overview> in
>> AWS is AWS Fargate <https://aws.amazon.com/fargate/>
>>
>>
>> I have not used the AWS Fargate so I can only mension Google's GKE
>> Autopilot.
>>
>>
>> This is developed from the concept of containerization and microservices.
>> In the standard mode of creating a GKE cluster users can customize their
>> configurations based on the requirements, GKE manages the control plane and
>> users manually provision and manage their node infrastructure. So you
>> choose your hardware type and memory/CPU where your spark containers will
>> be running and they will be shown as VM hosts in your account. In GKE
>> Autopilot mode, GKE manages the nodes, pre-configures the cluster with
>> adds-on for auto-scaling, auto-upgrades, maintenance, Day 2 operations and
>> security hardening. So there is a lot there. You don't choose your nodes
>> and their sizes. You are effectively paying for the pods you use.
>>
>>
>> Within spark-submit, you still need to specify the number of executors,
>> driver and executor memory plus cores for each driver and executor when
>> doing spark-submit. The theory is that the k8s cluster will deploy suitable
>> nodes and will create enough pods on those nodes. With the standard k8s
>> cluster you choose your nodes and you ensure that one core on each node is
>> reserved for the OS itself. Otherwise if you allocate all cores to spark
>> with --conf spark.executor.cores, you will receive this error
>>
>>
>> kubctl describe pods -n spark
>>
>> ...
>>
>> Events:
>>
>>   Type     Reason             Age                 From
>> Message
>>
>>   ----     ------             ----                ----
>> -------
>>
>>   Warning  FailedScheduling   9s (x17 over 15m)   default-scheduler
>>  0/3 nodes are available: 3 Insufficient cpu.
>>
>> So with the standard k8s you have a choice of selecting your core sizes.
>> With autopilot this node selection is left to autopilot to deploy suitable
>> nodes and this will be a trial and error at the start (to get the
>> configuration right). You may be lucky if the history of executions are
>> kept current and the same job can be repeated. However, in my experience,
>> to procedure the driver pod in "running state" is expensive timewise and
>> without an executor in running state, there is no chance of spark job doing
>> anything
>>
>>
>> NAME                                         READY   STATUS    RESTARTS
>>  AGE
>>
>> randomdatabigquery-cebab77eea6de971-exec-1   0/1     Pending   0
>> 31s
>>
>> randomdatabigquery-cebab77eea6de971-exec-2   0/1     Pending   0
>> 31s
>>
>> randomdatabigquery-cebab77eea6de971-exec-3   0/1     Pending   0
>> 31s
>>
>> randomdatabigquery-cebab77eea6de971-exec-4   0/1     Pending   0
>> 31s
>>
>> randomdatabigquery-cebab77eea6de971-exec-5   0/1     Pending   0
>> 31s
>>
>> randomdatabigquery-cebab77eea6de971-exec-6   0/1     Pending   0
>> 31s
>>
>> sparkbq-37405a7eea6b9468-driver              1/1     Running   0
>> 3m4s
>>
>>
>> NAME                                         READY   STATUS
>> RESTARTS   AGE
>>
>> randomdatabigquery-cebab77eea6de971-exec-6   0/1     ContainerCreating
>>  0          112s
>>
>> sparkbq-37405a7eea6b9468-driver              1/1     Running
>>  0          4m25s
>>
>> NAME                                         READY   STATUS    RESTARTS
>>  AGE
>>
>> randomdatabigquery-cebab77eea6de971-exec-6   1/1     Running   0
>> 114s
>>
>> sparkbq-37405a7eea6b9468-driver              1/1     Running   0
>> 4m27s
>>
>> Basically I told Spak to have 6 executors but could only bring into
>> running state one executor after the driver pod spinning for 4 minutes.
>>
>> 22/02/11 20:16:18 INFO SparkKubernetesClientFactory: Auto-configuring K8S
>> client using current context from users K8S config file
>>
>> 22/02/11 20:16:19 INFO Utils: Using initial executors = 6, max of
>> spark.dynamicAllocation.initialExecutors,
>> spark.dynamicAllocation.minExecutors and spark.executor.instances
>>
>> 22/02/11 20:16:19 INFO ExecutorPodsAllocator: Going to request 3
>> executors from Kubernetes for ResourceProfile Id: 0, target: 6 running: 0.
>>
>> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
>> enabled, skipping shutdown script
>>
>> 22/02/11 20:16:20 INFO Utils: Successfully started service
>> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 7079.
>>
>> 22/02/11 20:16:20 INFO NettyBlockTransferService: Server created on
>> sparkbq-37405a7eea6b9468-driver-svc.spark.svc:7079
>>
>> 22/02/11 20:16:20 INFO BlockManager: Using
>> org.apache.spark.storage.RandomBlockReplicationPolicy for block replication
>> policy
>>
>> 22/02/11 20:16:20 INFO BlockManagerMaster: Registering BlockManager
>> BlockManagerId(driver, sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079,
>> None)
>>
>> 22/02/11 20:16:20 INFO BlockManagerMasterEndpoint: Registering block
>> manager sparkbq-37405a7eea6b9468-driver-svc.spark.svc:7079 with 366.3 MiB
>> RAM, BlockManagerId(driver,
>> sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079, None)
>>
>> 22/02/11 20:16:20 INFO BlockManagerMaster: Registered BlockManager
>> BlockManagerId(driver, sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079,
>> None)
>>
>> 22/02/11 20:16:20 INFO BlockManager: Initialized BlockManager:
>> BlockManagerId(driver, sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079,
>> None)
>>
>> 22/02/11 20:16:20 INFO Utils: Using initial executors = 6, max of
>> spark.dynamicAllocation.initialExecutors,
>> spark.dynamicAllocation.minExecutors and spark.executor.instances
>>
>> 22/02/11 20:16:20 WARN ExecutorAllocationManager: Dynamic allocation
>> without a shuffle service is an experimental feature.
>>
>> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
>> enabled, skipping shutdown script
>>
>> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
>> enabled, skipping shutdown script
>>
>> 22/02/11 20:16:20 INFO ExecutorPodsAllocator: Going to request 3
>> executors from Kubernetes for ResourceProfile Id: 0, target: 6 running: 3.
>>
>> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
>> enabled, skipping shutdown script
>>
>> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
>> enabled, skipping shutdown script
>>
>> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
>> enabled, skipping shutdown script
>>
>> 22/02/11 20:16:49 INFO KubernetesClusterSchedulerBackend:
>> SchedulerBackend is ready for scheduling beginning after waiting
>> maxRegisteredResourcesWaitingTime: 30000000000(ns)
>>
>> 22/02/11 20:16:49 INFO SharedState: Setting hive.metastore.warehouse.dir
>> ('null') to the value of spark.sql.warehouse.dir
>> ('file:/opt/spark/work-dir/spark-warehouse').
>>
>> 22/02/11 20:16:49 INFO SharedState: Warehouse path is
>> 'file:/opt/spark/work-dir/spark-warehouse'.
>>
>> OK there is a lot to digest here and I appreciate feedback from other
>> members that have experimented with GKE autopilot or AWS Fargate or are
>> familiar with k8s.
>>
>> Thanks
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>

Re: Deploying Spark on Google Kubernetes (GKE) autopilot, preliminary findings

Posted by Mich Talebzadeh <mi...@gmail.com>.

Good question. However, we ought to look at what options we have so to
speak.

Let us consider Spark on Dataproc, Spark on Kubernetes and Spark on Dataflow

Spark on DataProc <https://cloud.google.com/dataproc> is proven and it is
in use at many organizations, I have deployed it extensively. It is
infrastructure as a service provided including Spark, Hadoop and other
artefacts. You have to manage cluster creation, automate cluster creation
and tear down, submitting jobs etc. However, it is another stack that needs
to be managed. It now has autoscaling
<https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/autoscaling>
(enables cluster worker VM autoscaling ) policy as well.

Spark on GKE
<https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-overview>
is something newer. Worth adding that the Spark DEV team are working hard
to improve the performance of Spark on Kubernetes, for example, through Support
for Customized Kubernetes Scheduler
<https://docs.google.com/document/d/1xgQGRpaHQX6-QH_J9YV2C2Dh6RpXefUpLM7KGkzL6Fg>.
As I explained in the first thread, Spark on Kubernetes relies on
containerisation. Containers make applications more portable. Moreover,
they simplify the packaging of dependencies, especially with PySpark and
enable repeatable and reliable build workflows which is cost effective.
They also reduce the overall devops load and allow one to iterate on the
code faster. From a purely cost perspective it would be cheaper with Docker *as
you can share resources* with your other services. You can create Spark
docker with different versions of Spark, Scala, Java, OS etc. That docker
file is portable. Can be used on Prem, AWS, GCP etc in container registries
and devops and data science people can share it as well. Built once used by
many. Kubernetes with autopilo
<https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-overview#:~:text=Autopilot%20is%20a%20new%20mode,and%20yield%20higher%20workload%20availability.>t
helps scale the nodes of the Kubernetes cluster depending on the load. *That
is what I am currently looking into*.

With regard to Dataflow <https://cloud.google.com/dataflow/docs>, which I
believe is similar to AWS Glue
<https://aws.amazon.com/glue/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc>,
it is a managed service for executing data processing patterns. Patterns or
pipelines are built with the Apache Beam SDK
<https://beam.apache.org/documentation/runners/spark/>, which is an open
source programming model that supports Java, Python and GO. It enables
batch and streaming pipelines. You create your pipelines with an Apache
Beam program and then run them on the Dataflow service. The Apache Spark
Runner
<https://beam.apache.org/documentation/runners/spark/#:~:text=The%20Apache%20Spark%20Runner%20can,Beam%20pipelines%20using%20Apache%20Spark.&text=The%20Spark%20Runner%20executes%20Beam,same%20security%20features%20Spark%20provides.>
can be used to execute Beam pipelines using Spark. When you run a job on
Dataflow, it spins up a cluster of virtual machines, distributes the tasks
in the job to the VMs, and dynamically scales the cluster based on how the
job is performing. As I understand both iterative processing and notebooks
plus Machine learning with Spark ML are not currently supported by Dataflow

So we have three choices here. If you are migrating from on-prem
Hadoop/spark/YARN set-up, you may go for Dataproc which will provide the
same look and feel. If you want to use microservices and containers in your
event driven architecture, you can adopt docker images that run on
Kubernetes clusters, including Multi-Cloud Kubernetes Cluster. Dataflow is
probably best suited for green-field projects.  Less operational overhead,
unified approach for batch and streaming pipelines.

*So as ever your mileage varies*. If you want to migrate from your existing
Hadoop/Spark cluster to GCP, or take advantage of your existing workforce,
choose Dataproc or GKE. In many cases, a big consideration is that one
already has a codebase written against a particular framework, and one just
wants to deploy it on the GCP, so even if, say, the Beam programming
mode/dataflow is superior to Hadoop, someone with a lot of Hadoop code
might still choose Dataproc or GDE for the time being, rather than
rewriting their code on Beam to run on Dataflow.

 HTH

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

 https://en.everybodywiki.com/Mich_Talebzadeh

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On Mon, 14 Feb 2022 at 05:46, Gourav Sengupta <go...@gmail.com>
wrote:

> Hi,
> may be this is useful in case someone is testing SPARK in containers for
> developing SPARK.
>
> *From a production scale work point of view:*
> But if I am in AWS, I will just use GLUE if I want to use containers for
> SPARK, without massively increasing my costs for operations unnecessarily.
>
> Also, in case I am not wrong, GCP already has SPARK running in serverless
> mode.  Personally I would never create the overhead of additional costs and
> issues to my clients of deploying SPARK when those solutions are already
> available by Cloud vendors. Infact, that is one of the precise reasons why
> people use cloud - to reduce operational costs.
>
> Sorry, just trying to understand what is the scope of this work.
>
>
> Regards,
> Gourav Sengupta
>
> On Fri, Feb 11, 2022 at 8:35 PM Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
>> The equivalent of Google GKE autopilot
>> <https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-overview> in
>> AWS is AWS Fargate <https://aws.amazon.com/fargate/>
>>
>>
>> I have not used the AWS Fargate so I can only mension Google's GKE
>> Autopilot.
>>
>>
>> This is developed from the concept of containerization and microservices.
>> In the standard mode of creating a GKE cluster users can customize their
>> configurations based on the requirements, GKE manages the control plane and
>> users manually provision and manage their node infrastructure. So you
>> choose your hardware type and memory/CPU where your spark containers will
>> be running and they will be shown as VM hosts in your account. In GKE
>> Autopilot mode, GKE manages the nodes, pre-configures the cluster with
>> adds-on for auto-scaling, auto-upgrades, maintenance, Day 2 operations and
>> security hardening. So there is a lot there. You don't choose your nodes
>> and their sizes. You are effectively paying for the pods you use.
>>
>>
>> Within spark-submit, you still need to specify the number of executors,
>> driver and executor memory plus cores for each driver and executor when
>> doing spark-submit. The theory is that the k8s cluster will deploy suitable
>> nodes and will create enough pods on those nodes. With the standard k8s
>> cluster you choose your nodes and you ensure that one core on each node is
>> reserved for the OS itself. Otherwise if you allocate all cores to spark
>> with --conf spark.executor.cores, you will receive this error
>>
>>
>> kubctl describe pods -n spark
>>
>> ...
>>
>> Events:
>>
>>   Type     Reason             Age                 From
>> Message
>>
>>   ----     ------             ----                ----
>> -------
>>
>>   Warning  FailedScheduling   9s (x17 over 15m)   default-scheduler
>>  0/3 nodes are available: 3 Insufficient cpu.
>>
>> So with the standard k8s you have a choice of selecting your core sizes.
>> With autopilot this node selection is left to autopilot to deploy suitable
>> nodes and this will be a trial and error at the start (to get the
>> configuration right). You may be lucky if the history of executions are
>> kept current and the same job can be repeated. However, in my experience,
>> to procedure the driver pod in "running state" is expensive timewise and
>> without an executor in running state, there is no chance of spark job doing
>> anything
>>
>>
>> NAME                                         READY   STATUS    RESTARTS
>>  AGE
>>
>> randomdatabigquery-cebab77eea6de971-exec-1   0/1     Pending   0
>> 31s
>>
>> randomdatabigquery-cebab77eea6de971-exec-2   0/1     Pending   0
>> 31s
>>
>> randomdatabigquery-cebab77eea6de971-exec-3   0/1     Pending   0
>> 31s
>>
>> randomdatabigquery-cebab77eea6de971-exec-4   0/1     Pending   0
>> 31s
>>
>> randomdatabigquery-cebab77eea6de971-exec-5   0/1     Pending   0
>> 31s
>>
>> randomdatabigquery-cebab77eea6de971-exec-6   0/1     Pending   0
>> 31s
>>
>> sparkbq-37405a7eea6b9468-driver              1/1     Running   0
>> 3m4s
>>
>>
>> NAME                                         READY   STATUS
>> RESTARTS   AGE
>>
>> randomdatabigquery-cebab77eea6de971-exec-6   0/1     ContainerCreating
>>  0          112s
>>
>> sparkbq-37405a7eea6b9468-driver              1/1     Running
>>  0          4m25s
>>
>> NAME                                         READY   STATUS    RESTARTS
>>  AGE
>>
>> randomdatabigquery-cebab77eea6de971-exec-6   1/1     Running   0
>> 114s
>>
>> sparkbq-37405a7eea6b9468-driver              1/1     Running   0
>> 4m27s
>>
>> Basically I told Spak to have 6 executors but could only bring into
>> running state one executor after the driver pod spinning for 4 minutes.
>>
>> 22/02/11 20:16:18 INFO SparkKubernetesClientFactory: Auto-configuring K8S
>> client using current context from users K8S config file
>>
>> 22/02/11 20:16:19 INFO Utils: Using initial executors = 6, max of
>> spark.dynamicAllocation.initialExecutors,
>> spark.dynamicAllocation.minExecutors and spark.executor.instances
>>
>> 22/02/11 20:16:19 INFO ExecutorPodsAllocator: Going to request 3
>> executors from Kubernetes for ResourceProfile Id: 0, target: 6 running: 0.
>>
>> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
>> enabled, skipping shutdown script
>>
>> 22/02/11 20:16:20 INFO Utils: Successfully started service
>> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 7079.
>>
>> 22/02/11 20:16:20 INFO NettyBlockTransferService: Server created on
>> sparkbq-37405a7eea6b9468-driver-svc.spark.svc:7079
>>
>> 22/02/11 20:16:20 INFO BlockManager: Using
>> org.apache.spark.storage.RandomBlockReplicationPolicy for block replication
>> policy
>>
>> 22/02/11 20:16:20 INFO BlockManagerMaster: Registering BlockManager
>> BlockManagerId(driver, sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079,
>> None)
>>
>> 22/02/11 20:16:20 INFO BlockManagerMasterEndpoint: Registering block
>> manager sparkbq-37405a7eea6b9468-driver-svc.spark.svc:7079 with 366.3 MiB
>> RAM, BlockManagerId(driver,
>> sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079, None)
>>
>> 22/02/11 20:16:20 INFO BlockManagerMaster: Registered BlockManager
>> BlockManagerId(driver, sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079,
>> None)
>>
>> 22/02/11 20:16:20 INFO BlockManager: Initialized BlockManager:
>> BlockManagerId(driver, sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079,
>> None)
>>
>> 22/02/11 20:16:20 INFO Utils: Using initial executors = 6, max of
>> spark.dynamicAllocation.initialExecutors,
>> spark.dynamicAllocation.minExecutors and spark.executor.instances
>>
>> 22/02/11 20:16:20 WARN ExecutorAllocationManager: Dynamic allocation
>> without a shuffle service is an experimental feature.
>>
>> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
>> enabled, skipping shutdown script
>>
>> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
>> enabled, skipping shutdown script
>>
>> 22/02/11 20:16:20 INFO ExecutorPodsAllocator: Going to request 3
>> executors from Kubernetes for ResourceProfile Id: 0, target: 6 running: 3.
>>
>> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
>> enabled, skipping shutdown script
>>
>> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
>> enabled, skipping shutdown script
>>
>> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
>> enabled, skipping shutdown script
>>
>> 22/02/11 20:16:49 INFO KubernetesClusterSchedulerBackend:
>> SchedulerBackend is ready for scheduling beginning after waiting
>> maxRegisteredResourcesWaitingTime: 30000000000(ns)
>>
>> 22/02/11 20:16:49 INFO SharedState: Setting hive.metastore.warehouse.dir
>> ('null') to the value of spark.sql.warehouse.dir
>> ('file:/opt/spark/work-dir/spark-warehouse').
>>
>> 22/02/11 20:16:49 INFO SharedState: Warehouse path is
>> 'file:/opt/spark/work-dir/spark-warehouse'.
>>
>> OK there is a lot to digest here and I appreciate feedback from other
>> members that have experimented with GKE autopilot or AWS Fargate or are
>> familiar with k8s.
>>
>> Thanks
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>

Re: Deploying Spark on Google Kubernetes (GKE) autopilot, preliminary findings

Posted by Gourav Sengupta <go...@gmail.com>.

Hi,
may be this is useful in case someone is testing SPARK in containers for
developing SPARK.

*From a production scale work point of view:*
But if I am in AWS, I will just use GLUE if I want to use containers for
SPARK, without massively increasing my costs for operations unnecessarily.

Also, in case I am not wrong, GCP already has SPARK running in serverless
mode.  Personally I would never create the overhead of additional costs and
issues to my clients of deploying SPARK when those solutions are already
available by Cloud vendors. Infact, that is one of the precise reasons why
people use cloud - to reduce operational costs.

Sorry, just trying to understand what is the scope of this work.


Regards,
Gourav Sengupta

On Fri, Feb 11, 2022 at 8:35 PM Mich Talebzadeh <mi...@gmail.com>
wrote:

> The equivalent of Google GKE autopilot
> <https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-overview> in
> AWS is AWS Fargate <https://aws.amazon.com/fargate/>
>
>
> I have not used the AWS Fargate so I can only mension Google's GKE
> Autopilot.
>
>
> This is developed from the concept of containerization and microservices.
> In the standard mode of creating a GKE cluster users can customize their
> configurations based on the requirements, GKE manages the control plane and
> users manually provision and manage their node infrastructure. So you
> choose your hardware type and memory/CPU where your spark containers will
> be running and they will be shown as VM hosts in your account. In GKE
> Autopilot mode, GKE manages the nodes, pre-configures the cluster with
> adds-on for auto-scaling, auto-upgrades, maintenance, Day 2 operations and
> security hardening. So there is a lot there. You don't choose your nodes
> and their sizes. You are effectively paying for the pods you use.
>
>
> Within spark-submit, you still need to specify the number of executors,
> driver and executor memory plus cores for each driver and executor when
> doing spark-submit. The theory is that the k8s cluster will deploy suitable
> nodes and will create enough pods on those nodes. With the standard k8s
> cluster you choose your nodes and you ensure that one core on each node is
> reserved for the OS itself. Otherwise if you allocate all cores to spark
> with --conf spark.executor.cores, you will receive this error
>
>
> kubctl describe pods -n spark
>
> ...
>
> Events:
>
>   Type     Reason             Age                 From
> Message
>
>   ----     ------             ----                ----
> -------
>
>   Warning  FailedScheduling   9s (x17 over 15m)   default-scheduler   0/3
> nodes are available: 3 Insufficient cpu.
>
> So with the standard k8s you have a choice of selecting your core sizes.
> With autopilot this node selection is left to autopilot to deploy suitable
> nodes and this will be a trial and error at the start (to get the
> configuration right). You may be lucky if the history of executions are
> kept current and the same job can be repeated. However, in my experience,
> to procedure the driver pod in "running state" is expensive timewise and
> without an executor in running state, there is no chance of spark job doing
> anything
>
>
> NAME                                         READY   STATUS    RESTARTS
>  AGE
>
> randomdatabigquery-cebab77eea6de971-exec-1   0/1     Pending   0
> 31s
>
> randomdatabigquery-cebab77eea6de971-exec-2   0/1     Pending   0
> 31s
>
> randomdatabigquery-cebab77eea6de971-exec-3   0/1     Pending   0
> 31s
>
> randomdatabigquery-cebab77eea6de971-exec-4   0/1     Pending   0
> 31s
>
> randomdatabigquery-cebab77eea6de971-exec-5   0/1     Pending   0
> 31s
>
> randomdatabigquery-cebab77eea6de971-exec-6   0/1     Pending   0
> 31s
>
> sparkbq-37405a7eea6b9468-driver              1/1     Running   0
> 3m4s
>
>
> NAME                                         READY   STATUS
> RESTARTS   AGE
>
> randomdatabigquery-cebab77eea6de971-exec-6   0/1     ContainerCreating
>  0          112s
>
> sparkbq-37405a7eea6b9468-driver              1/1     Running
>  0          4m25s
>
> NAME                                         READY   STATUS    RESTARTS
>  AGE
>
> randomdatabigquery-cebab77eea6de971-exec-6   1/1     Running   0
> 114s
>
> sparkbq-37405a7eea6b9468-driver              1/1     Running   0
> 4m27s
>
> Basically I told Spak to have 6 executors but could only bring into
> running state one executor after the driver pod spinning for 4 minutes.
>
> 22/02/11 20:16:18 INFO SparkKubernetesClientFactory: Auto-configuring K8S
> client using current context from users K8S config file
>
> 22/02/11 20:16:19 INFO Utils: Using initial executors = 6, max of
> spark.dynamicAllocation.initialExecutors,
> spark.dynamicAllocation.minExecutors and spark.executor.instances
>
> 22/02/11 20:16:19 INFO ExecutorPodsAllocator: Going to request 3 executors
> from Kubernetes for ResourceProfile Id: 0, target: 6 running: 0.
>
> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
> enabled, skipping shutdown script
>
> 22/02/11 20:16:20 INFO Utils: Successfully started service
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 7079.
>
> 22/02/11 20:16:20 INFO NettyBlockTransferService: Server created on
> sparkbq-37405a7eea6b9468-driver-svc.spark.svc:7079
>
> 22/02/11 20:16:20 INFO BlockManager: Using
> org.apache.spark.storage.RandomBlockReplicationPolicy for block replication
> policy
>
> 22/02/11 20:16:20 INFO BlockManagerMaster: Registering BlockManager
> BlockManagerId(driver, sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079,
> None)
>
> 22/02/11 20:16:20 INFO BlockManagerMasterEndpoint: Registering block
> manager sparkbq-37405a7eea6b9468-driver-svc.spark.svc:7079 with 366.3 MiB
> RAM, BlockManagerId(driver,
> sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079, None)
>
> 22/02/11 20:16:20 INFO BlockManagerMaster: Registered BlockManager
> BlockManagerId(driver, sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079,
> None)
>
> 22/02/11 20:16:20 INFO BlockManager: Initialized BlockManager:
> BlockManagerId(driver, sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079,
> None)
>
> 22/02/11 20:16:20 INFO Utils: Using initial executors = 6, max of
> spark.dynamicAllocation.initialExecutors,
> spark.dynamicAllocation.minExecutors and spark.executor.instances
>
> 22/02/11 20:16:20 WARN ExecutorAllocationManager: Dynamic allocation
> without a shuffle service is an experimental feature.
>
> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
> enabled, skipping shutdown script
>
> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
> enabled, skipping shutdown script
>
> 22/02/11 20:16:20 INFO ExecutorPodsAllocator: Going to request 3 executors
> from Kubernetes for ResourceProfile Id: 0, target: 6 running: 3.
>
> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
> enabled, skipping shutdown script
>
> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
> enabled, skipping shutdown script
>
> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
> enabled, skipping shutdown script
>
> 22/02/11 20:16:49 INFO KubernetesClusterSchedulerBackend: SchedulerBackend
> is ready for scheduling beginning after waiting
> maxRegisteredResourcesWaitingTime: 30000000000(ns)
>
> 22/02/11 20:16:49 INFO SharedState: Setting hive.metastore.warehouse.dir
> ('null') to the value of spark.sql.warehouse.dir
> ('file:/opt/spark/work-dir/spark-warehouse').
>
> 22/02/11 20:16:49 INFO SharedState: Warehouse path is
> 'file:/opt/spark/work-dir/spark-warehouse'.
>
> OK there is a lot to digest here and I appreciate feedback from other
> members that have experimented with GKE autopilot or AWS Fargate or are
> familiar with k8s.
>
> Thanks
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>