You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by JHI Star <jh...@gmail.com> on 2021/11/23 14:08:00 UTC

Choosing architecture for on-premise Spark & HDFS on Kubernetes cluster

We are going to deploy 20 physical Linux servers for use as an on-premise
Spark & HDFS on Kubernetes cluster. My question is: within this
architecture, is it best to have the pods run directly on bare metal or
under VMs or system containers like LXC and/or under an on-premise instance
of something like OpenStack - or something else altogether ?

I am looking to garner any experience around this question relating
directly to the specific use case of Spark & HDFS on Kuberenetes - I know
there are also general points to consider regardless of the use case.

Re: Choosing architecture for on-premise Spark & HDFS on Kubernetes cluster

Posted by JHI Star <jh...@gmail.com>.

Thanks, I'll have a closer look at GKE and compare it with what some other
sites running similar to use have used (Openstack).

Well, no, I don't envisage any public cloud integration. There is no plan
to use Hive just PySpark using HDFS !

On Wed, Nov 24, 2021 at 10:31 AM Mich Talebzadeh <mi...@gmail.com>
wrote:

> Just to clarify it should say  ....The current Spark Kubernetes model ...
>
>
> You will also need to build or get the Spark docker image that you are
> going to use in k8s clusters based on spark version, java version, scala
> version, OS and so forth. Are you going to use Hive as your main storage?
>
>
> HTH
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 23 Nov 2021 at 19:39, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
>> OK  to your point below
>>
>> "... We are going to deploy 20 physical Linux servers for use as an
>> on-premise Spark & HDFS on Kubernetes cluster..
>>
>>  Kubernetes is really a cloud-native technology. However, the
>> cloud-native concept does not exclude the use of on-premises infrastructure
>> in cases where it makes sense. So the question is are you going to use a
>> mesh structure to integrate these microservices together, including
>> on-premise and in cloud?
>> Now you have 20 tin boxes on-prem that you want to deploy for
>> building your Spark & HDFS stack on top of them. You will gain benefit from
>> Kubernetes and your microservices by simplifying the deployment by
>> decoupling the dependencies and abstracting your infra-structure away with
>> the ability to port these infrastructures. As you have your hardware
>> (your Linux servers),running k8s on bare metal will give you native
>> hardware performance. However, with 20 linux servers, you may limit your
>> scalability (your number of k8s nodes). If you go this way, you will need
>> to invest in a bare metal automation platform such as platform9
>> <https://platform9.com/bare-metal/> . The likelihood is that  you may
>> decide to move to the public cloud at some point or integrate with the
>> public cloud. My advice would be to look at something like GKE on-prem
>> <https://cloud.google.com/anthos/clusters/docs/on-prem/1.3/overview>
>>
>>
>> Back to Spark, The current Kubernetes model works on the basis of the "one-container-per-Pod"
>> model  <https://kubernetes.io/docs/concepts/workloads/pods/> meaning
>> that for each node of the cluster you will have one node running the driver
>> and each remaining node running one executor each. My question would be
>> will you be integrating with public cloud (AWS, GCP etc) at some point? In
>> that case you should look at mesh technologies like Istio
>> <https://cloud.google.com/learn/what-is-istio>
>>
>>
>> HTH
>>
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 23 Nov 2021 at 14:09, JHI Star <jh...@gmail.com> wrote:
>>
>>> We are going to deploy 20 physical Linux servers for use as an
>>> on-premise Spark & HDFS on Kubernetes cluster. My question is: within this
>>> architecture, is it best to have the pods run directly on bare metal or
>>> under VMs or system containers like LXC and/or under an on-premise instance
>>> of something like OpenStack - or something else altogether ?
>>>
>>> I am looking to garner any experience around this question relating
>>> directly to the specific use case of Spark & HDFS on Kuberenetes - I know
>>> there are also general points to consider regardless of the use case.
>>>
>>

Re: Choosing architecture for on-premise Spark & HDFS on Kubernetes cluster

Posted by Mich Talebzadeh <mi...@gmail.com>.

Just to clarify it should say  ....The current Spark Kubernetes model ...


You will also need to build or get the Spark docker image that you are
going to use in k8s clusters based on spark version, java version, scala
version, OS and so forth. Are you going to use Hive as your main storage?


HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 23 Nov 2021 at 19:39, Mich Talebzadeh <mi...@gmail.com>
wrote:

> OK  to your point below
>
> "... We are going to deploy 20 physical Linux servers for use as an
> on-premise Spark & HDFS on Kubernetes cluster..
>
>  Kubernetes is really a cloud-native technology. However, the
> cloud-native concept does not exclude the use of on-premises infrastructure
> in cases where it makes sense. So the question is are you going to use a
> mesh structure to integrate these microservices together, including
> on-premise and in cloud?
> Now you have 20 tin boxes on-prem that you want to deploy for
> building your Spark & HDFS stack on top of them. You will gain benefit from
> Kubernetes and your microservices by simplifying the deployment by
> decoupling the dependencies and abstracting your infra-structure away with
> the ability to port these infrastructures. As you have your hardware
> (your Linux servers),running k8s on bare metal will give you native
> hardware performance. However, with 20 linux servers, you may limit your
> scalability (your number of k8s nodes). If you go this way, you will need
> to invest in a bare metal automation platform such as platform9
> <https://platform9.com/bare-metal/> . The likelihood is that  you may
> decide to move to the public cloud at some point or integrate with the
> public cloud. My advice would be to look at something like GKE on-prem
> <https://cloud.google.com/anthos/clusters/docs/on-prem/1.3/overview>
>
>
> Back to Spark, The current Kubernetes model works on the basis of the "one-container-per-Pod"
> model  <https://kubernetes.io/docs/concepts/workloads/pods/> meaning that
> for each node of the cluster you will have one node running the driver and
> each remaining node running one executor each. My question would be will
> you be integrating with public cloud (AWS, GCP etc) at some point? In that
> case you should look at mesh technologies like Istio
> <https://cloud.google.com/learn/what-is-istio>
>
>
> HTH
>
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 23 Nov 2021 at 14:09, JHI Star <jh...@gmail.com> wrote:
>
>> We are going to deploy 20 physical Linux servers for use as an on-premise
>> Spark & HDFS on Kubernetes cluster. My question is: within this
>> architecture, is it best to have the pods run directly on bare metal or
>> under VMs or system containers like LXC and/or under an on-premise instance
>> of something like OpenStack - or something else altogether ?
>>
>> I am looking to garner any experience around this question relating
>> directly to the specific use case of Spark & HDFS on Kuberenetes - I know
>> there are also general points to consider regardless of the use case.
>>
>

Re: Choosing architecture for on-premise Spark & HDFS on Kubernetes cluster

Posted by Mich Talebzadeh <mi...@gmail.com>.

OK  to your point below

"... We are going to deploy 20 physical Linux servers for use as an
on-premise Spark & HDFS on Kubernetes cluster..

 Kubernetes is really a cloud-native technology. However, the cloud-native
concept does not exclude the use of on-premises infrastructure in cases
where it makes sense. So the question is are you going to use a mesh
structure to integrate these microservices together, including on-premise
and in cloud?
Now you have 20 tin boxes on-prem that you want to deploy for building your
Spark & HDFS stack on top of them. You will gain benefit from Kubernetes
and your microservices by simplifying the deployment by decoupling the
dependencies and abstracting your infra-structure away with the ability to
port these infrastructures. As you have your hardware (your Linux
servers),running k8s on bare metal will give you native hardware
performance. However, with 20 linux servers, you may limit your scalability
(your number of k8s nodes). If you go this way, you will need to invest in
a bare metal automation platform such as platform9
<https://platform9.com/bare-metal/> . The likelihood is that  you may
decide to move to the public cloud at some point or integrate with the
public cloud. My advice would be to look at something like GKE on-prem
<https://cloud.google.com/anthos/clusters/docs/on-prem/1.3/overview>

Back to Spark, The current Kubernetes model works on the basis of the
"one-container-per-Pod"
model  <https://kubernetes.io/docs/concepts/workloads/pods/> meaning that
for each node of the cluster you will have one node running the driver and
each remaining node running one executor each. My question would be will
you be integrating with public cloud (AWS, GCP etc) at some point? In that
case you should look at mesh technologies like Istio
<https://cloud.google.com/learn/what-is-istio>

HTH

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On Tue, 23 Nov 2021 at 14:09, JHI Star <jh...@gmail.com> wrote:

> We are going to deploy 20 physical Linux servers for use as an on-premise
> Spark & HDFS on Kubernetes cluster. My question is: within this
> architecture, is it best to have the pods run directly on bare metal or
> under VMs or system containers like LXC and/or under an on-premise instance
> of something like OpenStack - or something else altogether ?
>
> I am looking to garner any experience around this question relating
> directly to the specific use case of Spark & HDFS on Kuberenetes - I know
> there are also general points to consider regardless of the use case.
>