You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Mich Talebzadeh <mi...@gmail.com> on 2023/04/12 16:00:49 UTC

Accessing python runner file in AWS EKS kubernetes cluster as in local://

Hi,

In my spark-submit to eks cluster, I use the standard code to submit to the
cluster as below:

spark-submit --verbose \
   --master k8s://$KUBERNETES_MASTER_IP:443 \
   --deploy-mode cluster \
   --name sparkOnEks \
   --py-files local://$CODE_DIRECTORY/spark_on_eks.zip \
  local:///home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py

In Google Kubernetes Engine (GKE) I simply load them from gs:// storage
bucket.and it works fine.

I am getting the following error in driver pod

 + CMD=("$SPARK_HOME/bin/spark-submit" --conf
"spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode
client "$@")
    + exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf
spark.driver.bindAddress=192.168.39.251 --deploy-mode client
--properties-file /opt/spark/conf/spark.properties --class
org.apache.spark.deploy.PythonRunner
local:///home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py
    23/04/11 23:07:23 WARN NativeCodeLoader: Unable to load
native-hadoop library for your platform... using builtin-java classes
where applicable
    /usr/bin/python3: can't open file
'/home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py':
[Errno 2] No such file or directory
    log4j:WARN No appenders could be found for logger
(org.apache.spark.util.ShutdownHookManager).
It says  can't open file
'/home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py':


[Errno 2] No such file or directory but it is there!

ls -l /home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py
    -rw-rw-rw- 1 hduser hadoop 5060 Mar 18 14:16
/home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py
So not sure what is going on. I have suspicion that it is looking
inside the docker itself for this file?


Is that a correct assumption?


Thanks


Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Re: Accessing python runner file in AWS EKS kubernetes cluster as in local://

Posted by Mich Talebzadeh <mi...@gmail.com>.

OK I managed to load the Python zipped file and the run py.file onto s3 for
AWS EKS to work

It is a bit of nightmare compared to the same on Google SDK which is simpler

Anyhow you will require additional jar files to be added to
$SPARK_HOME/jars. These two files will be picked up after you build the
docker image and will be available to pods.


   1. hadoop-aws-3.2.0.jar
   2. aws-java-sdk-bundle-1.11.375.jar

Then build your docker image and push the image to ecr registry on AWS.

This will allow you to refer to both the zipped package and your source
file as

     spark-submit --verbose \
           --master k8s://$KUBERNETES_MASTER_IP:443 \
           --deploy-mode cluster \
           --py-files s3a://spark-on-k8s/codes/spark_on_eks.zip \
           s3a://1spark-on-k8s/codes/<pyfile>

Note that you refer to the bucket as* s3a rather than s3*

Output from driver log

kubectl logs <DRIVER-POD>  -n spark

Started at
14/04/2023 15:08:11.11
starting at ID =  1 ,ending on =  100
root
 |-- ID: integer (nullable = false)
 |-- CLUSTERED: float (nullable = true)
 |-- SCATTERED: float (nullable = true)
 |-- RANDOMISED: float (nullable = true)
 |-- RANDOM_STRING: string (nullable = true)
 |-- SMALL_VC: string (nullable = true)
 |-- PADDING: string (nullable = true)
 |-- op_type: integer (nullable = false)
 |-- op_time: timestamp (nullable = false)

+---+---------+---------+----------+--------------------------------------------------+----------+----------+-------+-----------------------+
|ID |CLUSTERED|SCATTERED|RANDOMISED|RANDOM_STRING
          |SMALL_VC  |PADDING   |op_type|op_time                |
+---+---------+---------+----------+--------------------------------------------------+----------+----------+-------+-----------------------+
|1  |0.0      |0.0      |17.0
 |KZWeqhFWCEPyYngFbyBMWXaSCrUZoLgubbbPIayRnBUbHoWCFJ|
1|xxxxxxxxxx|1      |2023-04-14 15:08:15.534|
|2  |0.01     |1.0      |7.0
|ffxkVZQtqMnMcLRkBOzZUGxICGrcbxDuyBHkJlpobluliGGxGR|         2|xxxxxxxxxx|1
     |2023-04-14 15:08:15.534|
|3  |0.02     |2.0      |30.0
 |LIixMEOLeMaEqJomTEIJEzOjoOjHyVaQXekWLctXbrEMUyTYBz|
3|xxxxxxxxxx|1      |2023-04-14 15:08:15.534|
|4  |0.03     |3.0      |30.0
 |tgUzEjfebzJsZWdoHIxrXlgqnbPZqZrmktsOUxfMvQyGplpErf|
4|xxxxxxxxxx|1      |2023-04-14 15:08:15.534|
|5  |0.04     |4.0      |79.0
 |qVwYSVPHbDXpPdkhxEpyIgKpaUnArlXykWZeiNNCiiaanXnkks|
5|xxxxxxxxxx|1      |2023-04-14 15:08:15.534|
|6  |0.05     |5.0      |73.0
 |fFWqcajQLEWVxuXbrFZmUAIIRgmKJSZUqQZNRfBvfxZAZqCSgW|
6|xxxxxxxxxx|1      |2023-04-14 15:08:15.534|
|7  |0.06     |6.0      |41.0
 |jzPdeIgxLdGncfBAepfJBdKhoOOLdKLzdocJisAjIhKtJRlgLK|
7|xxxxxxxxxx|1      |2023-04-14 15:08:15.534|
|8  |0.07     |7.0      |29.0
 |xyimTcfipZGnzPbDFDyFKmzfFoWbSrHAEyUhQqgeyNygQdvpSf|
8|xxxxxxxxxx|1      |2023-04-14 15:08:15.534|
|9  |0.08     |8.0      |59.0
 |NxrilRavGDMfvJNScUykTCUBkkpdhiGLeXSyYVgsnRoUYAfXrn|
9|xxxxxxxxxx|1      |2023-04-14 15:08:15.534|
|10 |0.09     |9.0      |73.0
 |cBEKanDFrPZkcHFuepVxcAiMwyAsRqDlRtQxiDXpCNycLapimt|
 10|xxxxxxxxxx|1      |2023-04-14 15:08:15.534|
+---+---------+---------+----------+--------------------------------------------------+----------+----------+-------+-----------------------+
only showing top 10 rows

Finished at
14/04/2023 15:08:16.16

I will provide the details under section *spark-on-aws *in
http://sparkcommunitytalk.slack.com/

HTH

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 12 Apr 2023 at 19:04, Mich Talebzadeh <mi...@gmail.com>
wrote:

> Thanks! I will have a look.
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 12 Apr 2023 at 18:26, Bjørn Jørgensen <bj...@gmail.com>
> wrote:
>
>> Yes, it looks inside the docker containers folder. It will work if you
>> are using s3 og gs.
>>
>> ons. 12. apr. 2023, 18:02 skrev Mich Talebzadeh <
>> mich.talebzadeh@gmail.com>:
>>
>>> Hi,
>>>
>>> In my spark-submit to eks cluster, I use the standard code to submit to
>>> the cluster as below:
>>>
>>> spark-submit --verbose \
>>>    --master k8s://$KUBERNETES_MASTER_IP:443 \
>>>    --deploy-mode cluster \
>>>    --name sparkOnEks \
>>>    --py-files local://$CODE_DIRECTORY/spark_on_eks.zip \
>>>   local:///home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py
>>>
>>> In Google Kubernetes Engine (GKE) I simply load them from gs:// storage
>>> bucket.and it works fine.
>>>
>>> I am getting the following error in driver pod
>>>
>>>  + CMD=("$SPARK_HOME/bin/spark-submit" --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@")
>>>     + exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=192.168.39.251 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner local:///home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py
>>>     23/04/11 23:07:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
>>>     /usr/bin/python3: can't open file '/home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py': [Errno 2] No such file or directory
>>>     log4j:WARN No appenders could be found for logger (org.apache.spark.util.ShutdownHookManager).
>>> It says  can't open file '/home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py':
>>>
>>>
>>> [Errno 2] No such file or directory but it is there!
>>>
>>> ls -l /home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py
>>>     -rw-rw-rw- 1 hduser hadoop 5060 Mar 18 14:16 /home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py
>>> So not sure what is going on. I have suspicion that it is looking inside the docker itself for this file?
>>>
>>>
>>> Is that a correct assumption?
>>>
>>>
>>> Thanks
>>>
>>>
>>> Mich Talebzadeh,
>>> Lead Solutions Architect/Engineering Lead
>>> Palantir Technologies Limited
>>> London
>>> United Kingdom
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>

Re: Accessing python runner file in AWS EKS kubernetes cluster as in local://

Posted by Mich Talebzadeh <mi...@gmail.com>.

Thanks! I will have a look.

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 12 Apr 2023 at 18:26, Bjørn Jørgensen <bj...@gmail.com>
wrote:

> Yes, it looks inside the docker containers folder. It will work if you are
> using s3 og gs.
>
> ons. 12. apr. 2023, 18:02 skrev Mich Talebzadeh <mich.talebzadeh@gmail.com
> >:
>
>> Hi,
>>
>> In my spark-submit to eks cluster, I use the standard code to submit to
>> the cluster as below:
>>
>> spark-submit --verbose \
>>    --master k8s://$KUBERNETES_MASTER_IP:443 \
>>    --deploy-mode cluster \
>>    --name sparkOnEks \
>>    --py-files local://$CODE_DIRECTORY/spark_on_eks.zip \
>>   local:///home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py
>>
>> In Google Kubernetes Engine (GKE) I simply load them from gs:// storage
>> bucket.and it works fine.
>>
>> I am getting the following error in driver pod
>>
>>  + CMD=("$SPARK_HOME/bin/spark-submit" --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@")
>>     + exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=192.168.39.251 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner local:///home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py
>>     23/04/11 23:07:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
>>     /usr/bin/python3: can't open file '/home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py': [Errno 2] No such file or directory
>>     log4j:WARN No appenders could be found for logger (org.apache.spark.util.ShutdownHookManager).
>> It says  can't open file '/home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py':
>>
>>
>> [Errno 2] No such file or directory but it is there!
>>
>> ls -l /home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py
>>     -rw-rw-rw- 1 hduser hadoop 5060 Mar 18 14:16 /home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py
>> So not sure what is going on. I have suspicion that it is looking inside the docker itself for this file?
>>
>>
>> Is that a correct assumption?
>>
>>
>> Thanks
>>
>>
>> Mich Talebzadeh,
>> Lead Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>> London
>> United Kingdom
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>

Re: Accessing python runner file in AWS EKS kubernetes cluster as in local://

Posted by Bjørn Jørgensen <bj...@gmail.com>.

Yes, it looks inside the docker containers folder. It will work if you are
using s3 og gs.

ons. 12. apr. 2023, 18:02 skrev Mich Talebzadeh <mi...@gmail.com>:

> Hi,
>
> In my spark-submit to eks cluster, I use the standard code to submit to
> the cluster as below:
>
> spark-submit --verbose \
>    --master k8s://$KUBERNETES_MASTER_IP:443 \
>    --deploy-mode cluster \
>    --name sparkOnEks \
>    --py-files local://$CODE_DIRECTORY/spark_on_eks.zip \
>   local:///home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py
>
> In Google Kubernetes Engine (GKE) I simply load them from gs:// storage
> bucket.and it works fine.
>
> I am getting the following error in driver pod
>
>  + CMD=("$SPARK_HOME/bin/spark-submit" --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@")
>     + exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=192.168.39.251 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner local:///home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py
>     23/04/11 23:07:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
>     /usr/bin/python3: can't open file '/home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py': [Errno 2] No such file or directory
>     log4j:WARN No appenders could be found for logger (org.apache.spark.util.ShutdownHookManager).
> It says  can't open file '/home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py':
>
>
> [Errno 2] No such file or directory but it is there!
>
> ls -l /home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py
>     -rw-rw-rw- 1 hduser hadoop 5060 Mar 18 14:16 /home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py
> So not sure what is going on. I have suspicion that it is looking inside the docker itself for this file?
>
>
> Is that a correct assumption?
>
>
> Thanks
>
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>