You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by John Paul Jayme <jo...@tdcx.com.INVALID> on 2023/06/20 05:55:56 UTC

How to read excel file in PySpark

Good day,

I have a task to read excel files in databricks but I cannot seem to proceed. I am referencing the API documents -  read_excel<https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.read_excel.html> , but there is an error sparksession object has no attribute 'read_excel'. Can you advise?

JOHN PAUL JAYME
Data Engineer
[https://app.tdcx.com/email-signature/assets/img/tdcx-logo.png]
m. +639055716384  w. www.tdcx.com<http://www.tdcx.com/>

Winner of over 350 Industry Awards
[Linkedin]<https://www.linkedin.com/company/tdcxgroup/> [Facebook] <https://www.facebook.com/tdcxgroup/>  [Twitter] <https://twitter.com/tdcxgroup/>  [Youtube] <https://www.youtube.com/c/TDCXgroup>  [Instagram] <https://www.instagram.com/tdcxgroup/>

This is a confidential email that may be privileged or legally protected. You are not authorized to copy or disclose the contents of this email. If you are not the intended addressee, please inform the sender and delete this email.

Re: How to read excel file in PySpark

Posted by Mich Talebzadeh <mi...@gmail.com>.

OK thanks for the info.

Regards

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 20 Jun 2023 at 21:27, Bjørn Jørgensen <bj...@gmail.com>
wrote:

> yes, p_df = DF.toPandas() that is THE pandas the one you know.
>
> change p_df = DF.toPandas() to
> p_df = DF.pandas_on_spark()
> or
> p_df = DF.to_pandas_on_spark()
> or
> p_df = DF.pandas_api()
> or
> p_df = DF.to_koalas()
>
>
>
> https://spark.apache.org/docs/latest/api/python/migration_guide/koalas_to_pyspark.html
>
> Then you will have yours pyspark df to panda API on spark.
>
> tir. 20. juni 2023 kl. 22:16 skrev Mich Talebzadeh <
> mich.talebzadeh@gmail.com>:
>
>> OK thanks
>>
>> So the issue seems to be creating  a Panda DF from Spark DF (I do it for
>> plotting with something like
>>
>> import matplotlib.pyplot as plt
>> p_df = DF.toPandas()
>> p_df.plt(....)
>>
>> I guess that stays in the driver.
>>
>>
>> Mich Talebzadeh,
>> Lead Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>> London
>> United Kingdom
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 20 Jun 2023 at 20:46, Sean Owen <sr...@gmail.com> wrote:
>>
>>> No, a pandas on Spark DF is distributed.
>>>
>>> On Tue, Jun 20, 2023, 1:45 PM Mich Talebzadeh <mi...@gmail.com>
>>> wrote:
>>>
>>>> Thanks but if you create a Spark DF from Pandas DF that Spark DF is not
>>>> distributed and remains on the driver. I recall a while back we had this
>>>> conversation. I don't think anything has changed.
>>>>
>>>> Happy to be corrected
>>>>
>>>> Mich Talebzadeh,
>>>> Lead Solutions Architect/Engineering Lead
>>>> Palantir Technologies Limited
>>>> London
>>>> United Kingdom
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, 20 Jun 2023 at 20:09, Bjørn Jørgensen <bj...@gmail.com>
>>>> wrote:
>>>>
>>>>> Pandas API on spark is an API so that users can use spark as they use
>>>>> pandas. This was known as koalas.
>>>>>
>>>>> Is this limitation still valid for Pandas?
>>>>> For pandas, yes. But what I did show wos pandas API on spark so its
>>>>> spark.
>>>>>
>>>>>  Additionally when we convert from Panda DF to Spark DF, what process
>>>>> is involved under the bonnet?
>>>>> I gess pyarrow and drop the index column.
>>>>>
>>>>> Have a look at
>>>>> https://github.com/apache/spark/tree/master/python/pyspark/pandas
>>>>>
>>>>> tir. 20. juni 2023 kl. 19:05 skrev Mich Talebzadeh <
>>>>> mich.talebzadeh@gmail.com>:
>>>>>
>>>>>> Whenever someone mentions Pandas I automatically think of it as an
>>>>>> excel sheet for Python.
>>>>>>
>>>>>> OK my point below needs some qualification
>>>>>>
>>>>>> Why Spark here. Generally, parallel architecture comes into play when
>>>>>> the data size is significantly large which cannot be handled on a single
>>>>>> machine, hence, the use of Spark becomes meaningful. In cases where (the
>>>>>> generated) data size is going to be very large (which is often norm rather
>>>>>> than the exception these days), the data cannot be processed and stored in
>>>>>> Pandas data frames as these data frames store data in RAM. Then, the whole
>>>>>> dataset from a storage like HDFS or cloud storage cannot be collected,
>>>>>> because it will take significant time and space and probably won't fit in a
>>>>>> single machine RAM. (in this the driver memory)
>>>>>>
>>>>>> Is this limitation still valid for Pandas? Additionally when we
>>>>>> convert from Panda DF to Spark DF, what process is involved under the
>>>>>> bonnet?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Mich Talebzadeh,
>>>>>> Lead Solutions Architect/Engineering Lead
>>>>>> Palantir Technologies Limited
>>>>>> London
>>>>>> United Kingdom
>>>>>>
>>>>>>
>>>>>>    view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, 20 Jun 2023 at 13:07, Bjørn Jørgensen <
>>>>>> bjornjorgensen@gmail.com> wrote:
>>>>>>
>>>>>>> This is pandas API on spark
>>>>>>>
>>>>>>> from pyspark import pandas as ps
>>>>>>> df = ps.read_excel("testexcel.xlsx")
>>>>>>> [image: image.png]
>>>>>>> this will convert it to pyspark
>>>>>>> [image: image.png]
>>>>>>>
>>>>>>> tir. 20. juni 2023 kl. 13:42 skrev John Paul Jayme
>>>>>>> <jo...@tdcx.com.invalid>:
>>>>>>>
>>>>>>>> Good day,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I have a task to read excel files in databricks but I cannot seem
>>>>>>>> to proceed. I am referencing the API documents -  read_excel
>>>>>>>> <https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.read_excel.html>
>>>>>>>> , but there is an error sparksession object has no attribute
>>>>>>>> 'read_excel'. Can you advise?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *JOHN PAUL JAYME*
>>>>>>>> Data Engineer
>>>>>>>>
>>>>>>>> m. +639055716384  w. www.tdcx.com
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *Winner of over 350 Industry Awards*
>>>>>>>>
>>>>>>>> [image: Linkedin] <https://www.linkedin.com/company/tdcxgroup/> [image:
>>>>>>>> Facebook] <https://www.facebook.com/tdcxgroup/> [image: Twitter]
>>>>>>>> <https://twitter.com/tdcxgroup/> [image: Youtube]
>>>>>>>> <https://www.youtube.com/c/TDCXgroup> [image: Instagram]
>>>>>>>> <https://www.instagram.com/tdcxgroup/>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> This is a confidential email that may be privileged or legally
>>>>>>>> protected. You are not authorized to copy or disclose the contents of this
>>>>>>>> email. If you are not the intended addressee, please inform the sender and
>>>>>>>> delete this email.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Bjørn Jørgensen
>>>>>>> Vestre Aspehaug 4, 6010 Ålesund
>>>>>>> Norge
>>>>>>>
>>>>>>> +47 480 94 297
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Bjørn Jørgensen
>>>>> Vestre Aspehaug 4, 6010 Ålesund
>>>>> Norge
>>>>>
>>>>> +47 480 94 297
>>>>>
>>>>
>
> --
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>

Re: How to read excel file in PySpark

Posted by Bjørn Jørgensen <bj...@gmail.com>.

yes, p_df = DF.toPandas() that is THE pandas the one you know.

change p_df = DF.toPandas() to
p_df = DF.pandas_on_spark()
or
p_df = DF.to_pandas_on_spark()
or
p_df = DF.pandas_api()
or
p_df = DF.to_koalas()


https://spark.apache.org/docs/latest/api/python/migration_guide/koalas_to_pyspark.html

Then you will have yours pyspark df to panda API on spark.

tir. 20. juni 2023 kl. 22:16 skrev Mich Talebzadeh <
mich.talebzadeh@gmail.com>:

> OK thanks
>
> So the issue seems to be creating  a Panda DF from Spark DF (I do it for
> plotting with something like
>
> import matplotlib.pyplot as plt
> p_df = DF.toPandas()
> p_df.plt(....)
>
> I guess that stays in the driver.
>
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 20 Jun 2023 at 20:46, Sean Owen <sr...@gmail.com> wrote:
>
>> No, a pandas on Spark DF is distributed.
>>
>> On Tue, Jun 20, 2023, 1:45 PM Mich Talebzadeh <mi...@gmail.com>
>> wrote:
>>
>>> Thanks but if you create a Spark DF from Pandas DF that Spark DF is not
>>> distributed and remains on the driver. I recall a while back we had this
>>> conversation. I don't think anything has changed.
>>>
>>> Happy to be corrected
>>>
>>> Mich Talebzadeh,
>>> Lead Solutions Architect/Engineering Lead
>>> Palantir Technologies Limited
>>> London
>>> United Kingdom
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 20 Jun 2023 at 20:09, Bjørn Jørgensen <bj...@gmail.com>
>>> wrote:
>>>
>>>> Pandas API on spark is an API so that users can use spark as they use
>>>> pandas. This was known as koalas.
>>>>
>>>> Is this limitation still valid for Pandas?
>>>> For pandas, yes. But what I did show wos pandas API on spark so its
>>>> spark.
>>>>
>>>>  Additionally when we convert from Panda DF to Spark DF, what process
>>>> is involved under the bonnet?
>>>> I gess pyarrow and drop the index column.
>>>>
>>>> Have a look at
>>>> https://github.com/apache/spark/tree/master/python/pyspark/pandas
>>>>
>>>> tir. 20. juni 2023 kl. 19:05 skrev Mich Talebzadeh <
>>>> mich.talebzadeh@gmail.com>:
>>>>
>>>>> Whenever someone mentions Pandas I automatically think of it as an
>>>>> excel sheet for Python.
>>>>>
>>>>> OK my point below needs some qualification
>>>>>
>>>>> Why Spark here. Generally, parallel architecture comes into play when
>>>>> the data size is significantly large which cannot be handled on a single
>>>>> machine, hence, the use of Spark becomes meaningful. In cases where (the
>>>>> generated) data size is going to be very large (which is often norm rather
>>>>> than the exception these days), the data cannot be processed and stored in
>>>>> Pandas data frames as these data frames store data in RAM. Then, the whole
>>>>> dataset from a storage like HDFS or cloud storage cannot be collected,
>>>>> because it will take significant time and space and probably won't fit in a
>>>>> single machine RAM. (in this the driver memory)
>>>>>
>>>>> Is this limitation still valid for Pandas? Additionally when we
>>>>> convert from Panda DF to Spark DF, what process is involved under the
>>>>> bonnet?
>>>>>
>>>>> Thanks
>>>>>
>>>>> Mich Talebzadeh,
>>>>> Lead Solutions Architect/Engineering Lead
>>>>> Palantir Technologies Limited
>>>>> London
>>>>> United Kingdom
>>>>>
>>>>>
>>>>>    view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, 20 Jun 2023 at 13:07, Bjørn Jørgensen <
>>>>> bjornjorgensen@gmail.com> wrote:
>>>>>
>>>>>> This is pandas API on spark
>>>>>>
>>>>>> from pyspark import pandas as ps
>>>>>> df = ps.read_excel("testexcel.xlsx")
>>>>>> [image: image.png]
>>>>>> this will convert it to pyspark
>>>>>> [image: image.png]
>>>>>>
>>>>>> tir. 20. juni 2023 kl. 13:42 skrev John Paul Jayme
>>>>>> <jo...@tdcx.com.invalid>:
>>>>>>
>>>>>>> Good day,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I have a task to read excel files in databricks but I cannot seem to
>>>>>>> proceed. I am referencing the API documents -  read_excel
>>>>>>> <https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.read_excel.html>
>>>>>>> , but there is an error sparksession object has no attribute
>>>>>>> 'read_excel'. Can you advise?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *JOHN PAUL JAYME*
>>>>>>> Data Engineer
>>>>>>>
>>>>>>> m. +639055716384  w. www.tdcx.com
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Winner of over 350 Industry Awards*
>>>>>>>
>>>>>>> [image: Linkedin] <https://www.linkedin.com/company/tdcxgroup/> [image:
>>>>>>> Facebook] <https://www.facebook.com/tdcxgroup/> [image: Twitter]
>>>>>>> <https://twitter.com/tdcxgroup/> [image: Youtube]
>>>>>>> <https://www.youtube.com/c/TDCXgroup> [image: Instagram]
>>>>>>> <https://www.instagram.com/tdcxgroup/>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> This is a confidential email that may be privileged or legally
>>>>>>> protected. You are not authorized to copy or disclose the contents of this
>>>>>>> email. If you are not the intended addressee, please inform the sender and
>>>>>>> delete this email.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Bjørn Jørgensen
>>>>>> Vestre Aspehaug 4, 6010 Ålesund
>>>>>> Norge
>>>>>>
>>>>>> +47 480 94 297
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Bjørn Jørgensen
>>>> Vestre Aspehaug 4, 6010 Ålesund
>>>> Norge
>>>>
>>>> +47 480 94 297
>>>>
>>>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297

Re: How to read excel file in PySpark

Posted by Mich Talebzadeh <mi...@gmail.com>.

OK thanks

So the issue seems to be creating  a Panda DF from Spark DF (I do it for
plotting with something like

import matplotlib.pyplot as plt
p_df = DF.toPandas()
p_df.plt(....)

I guess that stays in the driver.


Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 20 Jun 2023 at 20:46, Sean Owen <sr...@gmail.com> wrote:

> No, a pandas on Spark DF is distributed.
>
> On Tue, Jun 20, 2023, 1:45 PM Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
>> Thanks but if you create a Spark DF from Pandas DF that Spark DF is not
>> distributed and remains on the driver. I recall a while back we had this
>> conversation. I don't think anything has changed.
>>
>> Happy to be corrected
>>
>> Mich Talebzadeh,
>> Lead Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>> London
>> United Kingdom
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 20 Jun 2023 at 20:09, Bjørn Jørgensen <bj...@gmail.com>
>> wrote:
>>
>>> Pandas API on spark is an API so that users can use spark as they use
>>> pandas. This was known as koalas.
>>>
>>> Is this limitation still valid for Pandas?
>>> For pandas, yes. But what I did show wos pandas API on spark so its
>>> spark.
>>>
>>>  Additionally when we convert from Panda DF to Spark DF, what process
>>> is involved under the bonnet?
>>> I gess pyarrow and drop the index column.
>>>
>>> Have a look at
>>> https://github.com/apache/spark/tree/master/python/pyspark/pandas
>>>
>>> tir. 20. juni 2023 kl. 19:05 skrev Mich Talebzadeh <
>>> mich.talebzadeh@gmail.com>:
>>>
>>>> Whenever someone mentions Pandas I automatically think of it as an
>>>> excel sheet for Python.
>>>>
>>>> OK my point below needs some qualification
>>>>
>>>> Why Spark here. Generally, parallel architecture comes into play when
>>>> the data size is significantly large which cannot be handled on a single
>>>> machine, hence, the use of Spark becomes meaningful. In cases where (the
>>>> generated) data size is going to be very large (which is often norm rather
>>>> than the exception these days), the data cannot be processed and stored in
>>>> Pandas data frames as these data frames store data in RAM. Then, the whole
>>>> dataset from a storage like HDFS or cloud storage cannot be collected,
>>>> because it will take significant time and space and probably won't fit in a
>>>> single machine RAM. (in this the driver memory)
>>>>
>>>> Is this limitation still valid for Pandas? Additionally when we convert
>>>> from Panda DF to Spark DF, what process is involved under the bonnet?
>>>>
>>>> Thanks
>>>>
>>>> Mich Talebzadeh,
>>>> Lead Solutions Architect/Engineering Lead
>>>> Palantir Technologies Limited
>>>> London
>>>> United Kingdom
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, 20 Jun 2023 at 13:07, Bjørn Jørgensen <bj...@gmail.com>
>>>> wrote:
>>>>
>>>>> This is pandas API on spark
>>>>>
>>>>> from pyspark import pandas as ps
>>>>> df = ps.read_excel("testexcel.xlsx")
>>>>> [image: image.png]
>>>>> this will convert it to pyspark
>>>>> [image: image.png]
>>>>>
>>>>> tir. 20. juni 2023 kl. 13:42 skrev John Paul Jayme
>>>>> <jo...@tdcx.com.invalid>:
>>>>>
>>>>>> Good day,
>>>>>>
>>>>>>
>>>>>>
>>>>>> I have a task to read excel files in databricks but I cannot seem to
>>>>>> proceed. I am referencing the API documents -  read_excel
>>>>>> <https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.read_excel.html>
>>>>>> , but there is an error sparksession object has no attribute
>>>>>> 'read_excel'. Can you advise?
>>>>>>
>>>>>>
>>>>>>
>>>>>> *JOHN PAUL JAYME*
>>>>>> Data Engineer
>>>>>>
>>>>>> m. +639055716384  w. www.tdcx.com
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Winner of over 350 Industry Awards*
>>>>>>
>>>>>> [image: Linkedin] <https://www.linkedin.com/company/tdcxgroup/> [image:
>>>>>> Facebook] <https://www.facebook.com/tdcxgroup/> [image: Twitter]
>>>>>> <https://twitter.com/tdcxgroup/> [image: Youtube]
>>>>>> <https://www.youtube.com/c/TDCXgroup> [image: Instagram]
>>>>>> <https://www.instagram.com/tdcxgroup/>
>>>>>>
>>>>>>
>>>>>>
>>>>>> This is a confidential email that may be privileged or legally
>>>>>> protected. You are not authorized to copy or disclose the contents of this
>>>>>> email. If you are not the intended addressee, please inform the sender and
>>>>>> delete this email.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Bjørn Jørgensen
>>>>> Vestre Aspehaug 4, 6010 Ålesund
>>>>> Norge
>>>>>
>>>>> +47 480 94 297
>>>>>
>>>>
>>>
>>> --
>>> Bjørn Jørgensen
>>> Vestre Aspehaug 4, 6010 Ålesund
>>> Norge
>>>
>>> +47 480 94 297
>>>
>>

Re: How to read excel file in PySpark

Posted by Sean Owen <sr...@gmail.com>.

No, a pandas on Spark DF is distributed.

On Tue, Jun 20, 2023, 1:45 PM Mich Talebzadeh <mi...@gmail.com>
wrote:

> Thanks but if you create a Spark DF from Pandas DF that Spark DF is not
> distributed and remains on the driver. I recall a while back we had this
> conversation. I don't think anything has changed.
>
> Happy to be corrected
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 20 Jun 2023 at 20:09, Bjørn Jørgensen <bj...@gmail.com>
> wrote:
>
>> Pandas API on spark is an API so that users can use spark as they use
>> pandas. This was known as koalas.
>>
>> Is this limitation still valid for Pandas?
>> For pandas, yes. But what I did show wos pandas API on spark so its spark.
>>
>>  Additionally when we convert from Panda DF to Spark DF, what process is
>> involved under the bonnet?
>> I gess pyarrow and drop the index column.
>>
>> Have a look at
>> https://github.com/apache/spark/tree/master/python/pyspark/pandas
>>
>> tir. 20. juni 2023 kl. 19:05 skrev Mich Talebzadeh <
>> mich.talebzadeh@gmail.com>:
>>
>>> Whenever someone mentions Pandas I automatically think of it as an excel
>>> sheet for Python.
>>>
>>> OK my point below needs some qualification
>>>
>>> Why Spark here. Generally, parallel architecture comes into play when
>>> the data size is significantly large which cannot be handled on a single
>>> machine, hence, the use of Spark becomes meaningful. In cases where (the
>>> generated) data size is going to be very large (which is often norm rather
>>> than the exception these days), the data cannot be processed and stored in
>>> Pandas data frames as these data frames store data in RAM. Then, the whole
>>> dataset from a storage like HDFS or cloud storage cannot be collected,
>>> because it will take significant time and space and probably won't fit in a
>>> single machine RAM. (in this the driver memory)
>>>
>>> Is this limitation still valid for Pandas? Additionally when we convert
>>> from Panda DF to Spark DF, what process is involved under the bonnet?
>>>
>>> Thanks
>>>
>>> Mich Talebzadeh,
>>> Lead Solutions Architect/Engineering Lead
>>> Palantir Technologies Limited
>>> London
>>> United Kingdom
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 20 Jun 2023 at 13:07, Bjørn Jørgensen <bj...@gmail.com>
>>> wrote:
>>>
>>>> This is pandas API on spark
>>>>
>>>> from pyspark import pandas as ps
>>>> df = ps.read_excel("testexcel.xlsx")
>>>> [image: image.png]
>>>> this will convert it to pyspark
>>>> [image: image.png]
>>>>
>>>> tir. 20. juni 2023 kl. 13:42 skrev John Paul Jayme
>>>> <jo...@tdcx.com.invalid>:
>>>>
>>>>> Good day,
>>>>>
>>>>>
>>>>>
>>>>> I have a task to read excel files in databricks but I cannot seem to
>>>>> proceed. I am referencing the API documents -  read_excel
>>>>> <https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.read_excel.html>
>>>>> , but there is an error sparksession object has no attribute
>>>>> 'read_excel'. Can you advise?
>>>>>
>>>>>
>>>>>
>>>>> *JOHN PAUL JAYME*
>>>>> Data Engineer
>>>>>
>>>>> m. +639055716384  w. www.tdcx.com
>>>>>
>>>>>
>>>>>
>>>>> *Winner of over 350 Industry Awards*
>>>>>
>>>>> [image: Linkedin] <https://www.linkedin.com/company/tdcxgroup/> [image:
>>>>> Facebook] <https://www.facebook.com/tdcxgroup/> [image: Twitter]
>>>>> <https://twitter.com/tdcxgroup/> [image: Youtube]
>>>>> <https://www.youtube.com/c/TDCXgroup> [image: Instagram]
>>>>> <https://www.instagram.com/tdcxgroup/>
>>>>>
>>>>>
>>>>>
>>>>> This is a confidential email that may be privileged or legally
>>>>> protected. You are not authorized to copy or disclose the contents of this
>>>>> email. If you are not the intended addressee, please inform the sender and
>>>>> delete this email.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Bjørn Jørgensen
>>>> Vestre Aspehaug 4, 6010 Ålesund
>>>> Norge
>>>>
>>>> +47 480 94 297
>>>>
>>>
>>
>> --
>> Bjørn Jørgensen
>> Vestre Aspehaug 4, 6010 Ålesund
>> Norge
>>
>> +47 480 94 297
>>
>

Re: How to read excel file in PySpark

Posted by Mich Talebzadeh <mi...@gmail.com>.

Thanks but if you create a Spark DF from Pandas DF that Spark DF is not
distributed and remains on the driver. I recall a while back we had this
conversation. I don't think anything has changed.

Happy to be corrected

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 20 Jun 2023 at 20:09, Bjørn Jørgensen <bj...@gmail.com>
wrote:

> Pandas API on spark is an API so that users can use spark as they use
> pandas. This was known as koalas.
>
> Is this limitation still valid for Pandas?
> For pandas, yes. But what I did show wos pandas API on spark so its spark.
>
>  Additionally when we convert from Panda DF to Spark DF, what process is
> involved under the bonnet?
> I gess pyarrow and drop the index column.
>
> Have a look at
> https://github.com/apache/spark/tree/master/python/pyspark/pandas
>
> tir. 20. juni 2023 kl. 19:05 skrev Mich Talebzadeh <
> mich.talebzadeh@gmail.com>:
>
>> Whenever someone mentions Pandas I automatically think of it as an excel
>> sheet for Python.
>>
>> OK my point below needs some qualification
>>
>> Why Spark here. Generally, parallel architecture comes into play when the
>> data size is significantly large which cannot be handled on a single
>> machine, hence, the use of Spark becomes meaningful. In cases where (the
>> generated) data size is going to be very large (which is often norm rather
>> than the exception these days), the data cannot be processed and stored in
>> Pandas data frames as these data frames store data in RAM. Then, the whole
>> dataset from a storage like HDFS or cloud storage cannot be collected,
>> because it will take significant time and space and probably won't fit in a
>> single machine RAM. (in this the driver memory)
>>
>> Is this limitation still valid for Pandas? Additionally when we convert
>> from Panda DF to Spark DF, what process is involved under the bonnet?
>>
>> Thanks
>>
>> Mich Talebzadeh,
>> Lead Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>> London
>> United Kingdom
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 20 Jun 2023 at 13:07, Bjørn Jørgensen <bj...@gmail.com>
>> wrote:
>>
>>> This is pandas API on spark
>>>
>>> from pyspark import pandas as ps
>>> df = ps.read_excel("testexcel.xlsx")
>>> [image: image.png]
>>> this will convert it to pyspark
>>> [image: image.png]
>>>
>>> tir. 20. juni 2023 kl. 13:42 skrev John Paul Jayme
>>> <jo...@tdcx.com.invalid>:
>>>
>>>> Good day,
>>>>
>>>>
>>>>
>>>> I have a task to read excel files in databricks but I cannot seem to
>>>> proceed. I am referencing the API documents -  read_excel
>>>> <https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.read_excel.html>
>>>> , but there is an error sparksession object has no attribute
>>>> 'read_excel'. Can you advise?
>>>>
>>>>
>>>>
>>>> *JOHN PAUL JAYME*
>>>> Data Engineer
>>>>
>>>> m. +639055716384  w. www.tdcx.com
>>>>
>>>>
>>>>
>>>> *Winner of over 350 Industry Awards*
>>>>
>>>> [image: Linkedin] <https://www.linkedin.com/company/tdcxgroup/> [image:
>>>> Facebook] <https://www.facebook.com/tdcxgroup/> [image: Twitter]
>>>> <https://twitter.com/tdcxgroup/> [image: Youtube]
>>>> <https://www.youtube.com/c/TDCXgroup> [image: Instagram]
>>>> <https://www.instagram.com/tdcxgroup/>
>>>>
>>>>
>>>>
>>>> This is a confidential email that may be privileged or legally
>>>> protected. You are not authorized to copy or disclose the contents of this
>>>> email. If you are not the intended addressee, please inform the sender and
>>>> delete this email.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Bjørn Jørgensen
>>> Vestre Aspehaug 4, 6010 Ålesund
>>> Norge
>>>
>>> +47 480 94 297
>>>
>>
>
> --
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>

Re: How to read excel file in PySpark

Posted by Bjørn Jørgensen <bj...@gmail.com>.

Pandas API on spark is an API so that users can use spark as they use
pandas. This was known as koalas.

Is this limitation still valid for Pandas?
For pandas, yes. But what I did show wos pandas API on spark so its spark.

 Additionally when we convert from Panda DF to Spark DF, what process is
involved under the bonnet?
I gess pyarrow and drop the index column.

Have a look at
https://github.com/apache/spark/tree/master/python/pyspark/pandas

tir. 20. juni 2023 kl. 19:05 skrev Mich Talebzadeh <
mich.talebzadeh@gmail.com>:

> Whenever someone mentions Pandas I automatically think of it as an excel
> sheet for Python.
>
> OK my point below needs some qualification
>
> Why Spark here. Generally, parallel architecture comes into play when the
> data size is significantly large which cannot be handled on a single
> machine, hence, the use of Spark becomes meaningful. In cases where (the
> generated) data size is going to be very large (which is often norm rather
> than the exception these days), the data cannot be processed and stored in
> Pandas data frames as these data frames store data in RAM. Then, the whole
> dataset from a storage like HDFS or cloud storage cannot be collected,
> because it will take significant time and space and probably won't fit in a
> single machine RAM. (in this the driver memory)
>
> Is this limitation still valid for Pandas? Additionally when we convert
> from Panda DF to Spark DF, what process is involved under the bonnet?
>
> Thanks
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 20 Jun 2023 at 13:07, Bjørn Jørgensen <bj...@gmail.com>
> wrote:
>
>> This is pandas API on spark
>>
>> from pyspark import pandas as ps
>> df = ps.read_excel("testexcel.xlsx")
>> [image: image.png]
>> this will convert it to pyspark
>> [image: image.png]
>>
>> tir. 20. juni 2023 kl. 13:42 skrev John Paul Jayme
>> <jo...@tdcx.com.invalid>:
>>
>>> Good day,
>>>
>>>
>>>
>>> I have a task to read excel files in databricks but I cannot seem to
>>> proceed. I am referencing the API documents -  read_excel
>>> <https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.read_excel.html>
>>> , but there is an error sparksession object has no attribute
>>> 'read_excel'. Can you advise?
>>>
>>>
>>>
>>> *JOHN PAUL JAYME*
>>> Data Engineer
>>>
>>> m. +639055716384  w. www.tdcx.com
>>>
>>>
>>>
>>> *Winner of over 350 Industry Awards*
>>>
>>> [image: Linkedin] <https://www.linkedin.com/company/tdcxgroup/> [image:
>>> Facebook] <https://www.facebook.com/tdcxgroup/> [image: Twitter]
>>> <https://twitter.com/tdcxgroup/> [image: Youtube]
>>> <https://www.youtube.com/c/TDCXgroup> [image: Instagram]
>>> <https://www.instagram.com/tdcxgroup/>
>>>
>>>
>>>
>>> This is a confidential email that may be privileged or legally
>>> protected. You are not authorized to copy or disclose the contents of this
>>> email. If you are not the intended addressee, please inform the sender and
>>> delete this email.
>>>
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Bjørn Jørgensen
>> Vestre Aspehaug 4, 6010 Ålesund
>> Norge
>>
>> +47 480 94 297
>>
>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297

Re: How to read excel file in PySpark

Posted by Mich Talebzadeh <mi...@gmail.com>.

Whenever someone mentions Pandas I automatically think of it as an excel
sheet for Python.

OK my point below needs some qualification

Why Spark here. Generally, parallel architecture comes into play when the
data size is significantly large which cannot be handled on a single
machine, hence, the use of Spark becomes meaningful. In cases where (the
generated) data size is going to be very large (which is often norm rather
than the exception these days), the data cannot be processed and stored in
Pandas data frames as these data frames store data in RAM. Then, the whole
dataset from a storage like HDFS or cloud storage cannot be collected,
because it will take significant time and space and probably won't fit in a
single machine RAM. (in this the driver memory)

Is this limitation still valid for Pandas? Additionally when we convert
from Panda DF to Spark DF, what process is involved under the bonnet?

Thanks

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

 https://en.everybodywiki.com/Mich_Talebzadeh

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On Tue, 20 Jun 2023 at 13:07, Bjørn Jørgensen <bj...@gmail.com>
wrote:

> This is pandas API on spark
>
> from pyspark import pandas as ps
> df = ps.read_excel("testexcel.xlsx")
> [image: image.png]
> this will convert it to pyspark
> [image: image.png]
>
> tir. 20. juni 2023 kl. 13:42 skrev John Paul Jayme
> <jo...@tdcx.com.invalid>:
>
>> Good day,
>>
>>
>>
>> I have a task to read excel files in databricks but I cannot seem to
>> proceed. I am referencing the API documents -  read_excel
>> <https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.read_excel.html>
>> , but there is an error sparksession object has no attribute
>> 'read_excel'. Can you advise?
>>
>>
>>
>> *JOHN PAUL JAYME*
>> Data Engineer
>>
>> m. +639055716384  w. www.tdcx.com
>>
>>
>>
>> *Winner of over 350 Industry Awards*
>>
>> [image: Linkedin] <https://www.linkedin.com/company/tdcxgroup/> [image:
>> Facebook] <https://www.facebook.com/tdcxgroup/> [image: Twitter]
>> <https://twitter.com/tdcxgroup/> [image: Youtube]
>> <https://www.youtube.com/c/TDCXgroup> [image: Instagram]
>> <https://www.instagram.com/tdcxgroup/>
>>
>>
>>
>> This is a confidential email that may be privileged or legally protected.
>> You are not authorized to copy or disclose the contents of this email. If
>> you are not the intended addressee, please inform the sender and delete
>> this email.
>>
>>
>>
>>
>>
>
>
> --
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>

Re: How to read excel file in PySpark

Posted by Bjørn Jørgensen <bj...@gmail.com>.

This is pandas API on spark

from pyspark import pandas as ps
df = ps.read_excel("testexcel.xlsx")
[image: image.png]
this will convert it to pyspark
[image: image.png]

tir. 20. juni 2023 kl. 13:42 skrev John Paul Jayme
<jo...@tdcx.com.invalid>:

> Good day,
>
>
>
> I have a task to read excel files in databricks but I cannot seem to
> proceed. I am referencing the API documents -  read_excel
> <https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.read_excel.html>
> , but there is an error sparksession object has no attribute
> 'read_excel'. Can you advise?
>
>
>
> *JOHN PAUL JAYME*
> Data Engineer
>
> m. +639055716384  w. www.tdcx.com
>
>
>
> *Winner of over 350 Industry Awards*
>
> [image: Linkedin] <https://www.linkedin.com/company/tdcxgroup/> [image:
> Facebook] <https://www.facebook.com/tdcxgroup/> [image: Twitter]
> <https://twitter.com/tdcxgroup/> [image: Youtube]
> <https://www.youtube.com/c/TDCXgroup> [image: Instagram]
> <https://www.instagram.com/tdcxgroup/>
>
>
>
> This is a confidential email that may be privileged or legally protected.
> You are not authorized to copy or disclose the contents of this email. If
> you are not the intended addressee, please inform the sender and delete
> this email.
>
>
>
>
>


-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297

Re: How to read excel file in PySpark

Posted by Sean Owen <sr...@gmail.com>.

It is indeed not part of SparkSession. See the link you cite. It is part of
the pyspark pandas API

On Tue, Jun 20, 2023, 5:42 AM John Paul Jayme <jo...@tdcx.com.invalid>
wrote:

> Good day,
>
>
>
> I have a task to read excel files in databricks but I cannot seem to
> proceed. I am referencing the API documents -  read_excel
> <https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.read_excel.html>
> , but there is an error sparksession object has no attribute
> 'read_excel'. Can you advise?
>
>
>
> *JOHN PAUL JAYME*
> Data Engineer
>
> m. +639055716384  w. www.tdcx.com
>
>
>
> *Winner of over 350 Industry Awards*
>
> [image: Linkedin] <https://www.linkedin.com/company/tdcxgroup/> [image:
> Facebook] <https://www.facebook.com/tdcxgroup/> [image: Twitter]
> <https://twitter.com/tdcxgroup/> [image: Youtube]
> <https://www.youtube.com/c/TDCXgroup> [image: Instagram]
> <https://www.instagram.com/tdcxgroup/>
>
>
>
> This is a confidential email that may be privileged or legally protected.
> You are not authorized to copy or disclose the contents of this email. If
> you are not the intended addressee, please inform the sender and delete
> this email.
>
>
>
>
>