You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Mich Talebzadeh <mi...@gmail.com> on 2022/03/15 16:48:40 UTC

pivoting panda dataframe

hi,


Is it possible to pivot a panda dataframe by making the row column heading?


thanks



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Re: pivoting panda dataframe

Posted by Bjørn Jørgensen <bj...@gmail.com>.

Colums bind in r is concatinat in pandas
https://www.datasciencemadesimple.com/append-concatenate-columns-python-pandas-column-bind/


Please start a now thread for each questions.

tir. 15. mar. 2022, 22:59 skrev Andrew Davidson <ae...@ucsc.edu>:

> Many many thanks!
>
>
>
> I have been looking for a pyspark data frame  column_bind() solution for
> several months. Hopefully pyspark.pandas  works. The only other solutions I
> was aware of was to use spark.dataframe.join(). This does not scale for
> obvious reason.
>
>
>
> Andy
>
>
>
>
>
> *From: *Bjørn Jørgensen <bj...@gmail.com>
> *Date: *Tuesday, March 15, 2022 at 2:19 PM
> *To: *Andrew Davidson <ae...@ucsc.edu>
> *Cc: *Mich Talebzadeh <mi...@gmail.com>, "user @spark" <
> user@spark.apache.org>
> *Subject: *Re: pivoting panda dataframe
>
>
>
> Hi Andrew. Mitch asked, and I answered transpose()
> https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.transpose.html
> .
>
>
>
> And now you are asking in the same thread about pandas API on spark and
> the transform().
>
>
>
> Apache Spark have pandas API on Spark.
>
>
>
> Which means that spark has an API call for pandas functions, and when you
> use pandas API on spark it is spark you are using then.
>
>
>
> Add this line in yours import
>
>
>
> from pyspark import pandas as ps
>
>
>
>
>
> Now you can pass yours dataframe back and forward to pandas API on spark
> by using
>
>
>
> pf01 = f01.to_pandas_on_spark()
>
>
> f01 = pf01.to_spark()
>
>
>
>
>
> Note that I have changed pd to ps here.
>
>
>
> df = ps.DataFrame({'A': range(3), 'B': range(1, 4)})
>
>
>
> df.transform(lambda x: x + 1)
>
>
>
> You will now see that all numbers are +1
>
>
>
> You can find more information about pandas API on spark transform
> https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.transform.html?highlight=pyspark%20pandas%20dataframe%20transform#pyspark.pandas.DataFrame.transform
>
> or in yours notbook
>
> df.transform?
>
>
>
> Signature:
>
> df.transform(
>
>     func: Callable[..., ForwardRef('Series')],
>
>     axis: Union[int, str] = 0,
>
>     *args: Any,
>
>     **kwargs: Any,
>
> ) -> 'DataFrame'
>
> Docstring:
>
> Call ``func`` on self producing a Series with transformed values
>
> and that has the same length as its input.
>
>
>
> See also `Transform and apply a function
>
> <https://koalas.readthedocs.io/en/latest/user_guide/transform_apply.html>`_.
>
>
>
> .. note:: this API executes the function once to infer the type which is
>
>      potentially expensive, for instance, when the dataset is created after
>
>      aggregations or sorting.
>
>
>
>      To avoid this, specify return type in ``func``, for instance, as below:
>
>
>
>      >>> def square(x) -> ps.Series[np.int32]:
>
>      ...     return x ** 2
>
>
>
>      pandas-on-Spark uses return type hint and does not try to infer the type.
>
>
>
> .. note:: the series within ``func`` is actually multiple pandas series as the
>
>     segments of the whole pandas-on-Spark series; therefore, the length of each series
>
>     is not guaranteed. As an example, an aggregation against each series
>
>     does work as a global aggregation but an aggregation of each segment. See
>
>     below:
>
>
>
>     >>> def func(x) -> ps.Series[np.int32]:
>
>     ...     return x + sum(x)
>
>
>
> Parameters
>
> ----------
>
> func : function
>
>     Function to use for transforming the data. It must work when pandas Series
>
>     is passed.
>
> axis : int, default 0 or 'index'
>
>     Can only be set to 0 at the moment.
>
> *args
>
>     Positional arguments to pass to func.
>
> **kwargs
>
>     Keyword arguments to pass to func.
>
>
>
> Returns
>
> -------
>
> DataFrame
>
>     A DataFrame that must have the same length as self.
>
>
>
> Raises
>
> ------
>
> Exception : If the returned DataFrame has a different length than self.
>
>
>
> See Also
>
> --------
>
> DataFrame.aggregate : Only perform aggregating type operations.
>
> DataFrame.apply : Invoke function on DataFrame.
>
> Series.transform : The equivalent function for Series.
>
>
>
> Examples
>
> --------
>
> >>> df = ps.DataFrame({'A': range(3), 'B': range(1, 4)}, columns=['A', 'B'])
>
> >>> df
>
>    A  B
>
> 0  0  1
>
> 1  1  2
>
> 2  2  3
>
>
>
> >>> def square(x) -> ps.Series[np.int32]:
>
> ...     return x ** 2
>
> >>> df.transform(square)
>
>    A  B
>
> 0  0  1
>
> 1  1  4
>
> 2  4  9
>
>
>
> You can omit the type hint and let pandas-on-Spark infer its type.
>
>
>
> >>> df.transform(lambda x: x ** 2)
>
>    A  B
>
> 0  0  1
>
> 1  1  4
>
> 2  4  9
>
>
>
> For multi-index columns:
>
>
>
> >>> df.columns = [('X', 'A'), ('X', 'B')]
>
> >>> df.transform(square)  # doctest: +NORMALIZE_WHITESPACE
>
>    X
>
>    A  B
>
> 0  0  1
>
> 1  1  4
>
> 2  4  9
>
>
>
> >>> (df * -1).transform(abs)  # doctest: +NORMALIZE_WHITESPACE
>
>    X
>
>    A  B
>
> 0  0  1
>
> 1  1  2
>
> 2  2  3
>
>
>
> You can also specify extra arguments.
>
>
>
> >>> def calculation(x, y, z) -> ps.Series[int]:
>
> ...     return x ** y + z
>
> >>> df.transform(calculation, y=10, z=20)  # doctest: +NORMALIZE_WHITESPACE
>
>       X
>
>       A      B
>
> 0    20     21
>
> 1    21   1044
>
> 2  1044  59069
>
> File:      /opt/spark/python/pyspark/pandas/frame.py
>
> Type:      method
>
>
>
>
>
>
>
>
>
> tir. 15. mar. 2022 kl. 19:33 skrev Andrew Davidson <ae...@ucsc.edu>:
>
> Hi Bjorn
>
>
>
> I have been looking for spark transform for a while. Can you send me a
> link to the pyspark function?
>
>
>
> I assume pandas transform is not really an option. I think it will try to
> pull the entire dataframe into the drivers memory.
>
>
>
> Kind regards
>
>
>
> Andy
>
>
>
> p.s. My real problem is that spark does not allow you to bind columns. You
> can use union() to bind rows. I could get the equivalent of cbind() using
> union().transform()
>
>
>
> *From: *Bjørn Jørgensen <bj...@gmail.com>
> *Date: *Tuesday, March 15, 2022 at 10:37 AM
> *To: *Mich Talebzadeh <mi...@gmail.com>
> *Cc: *"user @spark" <us...@spark.apache.org>
> *Subject: *Re: pivoting panda dataframe
>
>
>
>
> https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transpose.html we
> have that transpose in pandas api for spark to.
>
>
>
> You also have stack() and multilevel
> https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html
>
>
>
>
>
>
>
> tir. 15. mar. 2022 kl. 17:50 skrev Mich Talebzadeh <
> mich.talebzadeh@gmail.com>:
>
>
> hi,
>
>
>
> Is it possible to pivot a panda dataframe by making the row column
> heading?
>
>
>
> thanks
>
>
>
>
>
>  *Error! Filename not specified.*  view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
> --
>
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>
>
>
>
> --
>
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>

Re: pivoting panda dataframe

Posted by Andrew Davidson <ae...@ucsc.edu.INVALID>.

Many many thanks!

I have been looking for a pyspark data frame  column_bind() solution for several months. Hopefully pyspark.pandas  works. The only other solutions I was aware of was to use spark.dataframe.join(). This does not scale for obvious reason.

Andy

From: Bjørn Jørgensen <bj...@gmail.com>
Date: Tuesday, March 15, 2022 at 2:19 PM
To: Andrew Davidson <ae...@ucsc.edu>
Cc: Mich Talebzadeh <mi...@gmail.com>, "user @spark" <us...@spark.apache.org>
Subject: Re: pivoting panda dataframe

Hi Andrew. Mitch asked, and I answered transpose() https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.transpose.html .

And now you are asking in the same thread about pandas API on spark and the transform().

Apache Spark have pandas API on Spark.

Which means that spark has an API call for pandas functions, and when you use pandas API on spark it is spark you are using then.

Add this line in yours import

from pyspark import pandas as ps

Now you can pass yours dataframe back and forward to pandas API on spark by using

pf01 = f01.to_pandas_on_spark()

f01 = pf01.to_spark()

Note that I have changed pd to ps here.

df = ps.DataFrame({'A': range(3), 'B': range(1, 4)})

df.transform(lambda x: x + 1)

You will now see that all numbers are +1

You can find more information about pandas API on spark transform https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.transform.html?highlight=pyspark%20pandas%20dataframe%20transform#pyspark.pandas.DataFrame.transform
or in yours notbook
df.transform?

Signature:

df.transform(

    func: Callable[..., ForwardRef('Series')],

    axis: Union[int, str] = 0,

    *args: Any,

    **kwargs: Any,

) -> 'DataFrame'

Docstring:

Call ``func`` on self producing a Series with transformed values

and that has the same length as its input.

See also `Transform and apply a function

<https://koalas.readthedocs.io/en/latest/user_guide/transform_apply.html>`_<https://koalas.readthedocs.io/en/latest/user_guide/transform_apply.html%3E%60_>.

.. note:: this API executes the function once to infer the type which is

     potentially expensive, for instance, when the dataset is created after

     aggregations or sorting.

     To avoid this, specify return type in ``func``, for instance, as below:

     >>> def square(x) -> ps.Series[np.int32]:

     ...     return x ** 2

     pandas-on-Spark uses return type hint and does not try to infer the type.

.. note:: the series within ``func`` is actually multiple pandas series as the

    segments of the whole pandas-on-Spark series; therefore, the length of each series

    is not guaranteed. As an example, an aggregation against each series

    does work as a global aggregation but an aggregation of each segment. See

    below:

    >>> def func(x) -> ps.Series[np.int32]:

    ...     return x + sum(x)

Parameters

----------

func : function

    Function to use for transforming the data. It must work when pandas Series

    is passed.

axis : int, default 0 or 'index'

    Can only be set to 0 at the moment.

*args

    Positional arguments to pass to func.

**kwargs

    Keyword arguments to pass to func.

Returns

-------

DataFrame

    A DataFrame that must have the same length as self.

Raises

------

Exception : If the returned DataFrame has a different length than self.

See Also

--------

DataFrame.aggregate : Only perform aggregating type operations.

DataFrame.apply : Invoke function on DataFrame.

Series.transform : The equivalent function for Series.

Examples

--------

>>> df = ps.DataFrame({'A': range(3), 'B': range(1, 4)}, columns=['A', 'B'])

>>> df

   A  B

0  0  1

1  1  2

2  2  3

>>> def square(x) -> ps.Series[np.int32]:

...     return x ** 2

>>> df.transform(square)

   A  B

0  0  1

1  1  4

2  4  9

You can omit the type hint and let pandas-on-Spark infer its type.

>>> df.transform(lambda x: x ** 2)

   A  B

0  0  1

1  1  4

2  4  9

For multi-index columns:

>>> df.columns = [('X', 'A'), ('X', 'B')]

>>> df.transform(square)  # doctest: +NORMALIZE_WHITESPACE

   X

   A  B

0  0  1

1  1  4

2  4  9

>>> (df * -1).transform(abs)  # doctest: +NORMALIZE_WHITESPACE

   X

   A  B

0  0  1

1  1  2

2  2  3

You can also specify extra arguments.

>>> def calculation(x, y, z) -> ps.Series[int]:

...     return x ** y + z

>>> df.transform(calculation, y=10, z=20)  # doctest: +NORMALIZE_WHITESPACE

      X

      A      B

0    20     21

1    21   1044

2  1044  59069

File:      /opt/spark/python/pyspark/pandas/frame.py

Type:      method

tir. 15. mar. 2022 kl. 19:33 skrev Andrew Davidson <ae...@ucsc.edu>>:
Hi Bjorn

I have been looking for spark transform for a while. Can you send me a link to the pyspark function?

I assume pandas transform is not really an option. I think it will try to pull the entire dataframe into the drivers memory.

Kind regards

Andy

p.s. My real problem is that spark does not allow you to bind columns. You can use union() to bind rows. I could get the equivalent of cbind() using union().transform()

From: Bjørn Jørgensen <bj...@gmail.com>>
Date: Tuesday, March 15, 2022 at 10:37 AM
To: Mich Talebzadeh <mi...@gmail.com>>
Cc: "user @spark" <us...@spark.apache.org>>
Subject: Re: pivoting panda dataframe

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transpose.html we have that transpose in pandas api for spark to.

You also have stack() and multilevel https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html

tir. 15. mar. 2022 kl. 17:50 skrev Mich Talebzadeh <mi...@gmail.com>>:

hi,

Is it possible to pivot a panda dataframe by making the row column heading?

thanks

 Error! Filename not specified.  view my Linkedin profile<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

 https://en.everybodywiki.com/Mich_Talebzadeh

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

--
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297

--
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297

Re: pivoting panda dataframe

Posted by ayan guha <gu...@gmail.com>.

Column bind is called join in relational world, spark uses the same.

Pivot in true sense is harder to achieve because you really dont know how
many columns you will end up with, but spark has a pivot function

On Thu, 17 Mar 2022 at 9:16 am, Mich Talebzadeh <mi...@gmail.com>
wrote:

> OK this is the version that works with Panda only without Spark
>
> import random
> import string
> import math
> import datetime
> import time
> import pandas as pd
>
> class UsedFunctions:
>
>   def randomString(self,length):
>     letters = string.ascii_letters
>     result_str = ''.join(random.choice(letters) for i in range(length))
>     return result_str
>
>   def clustered(self,x,numRows):
>     return math.floor(x -1)/numRows
>
>   def scattered(self,x,numRows):
>     return abs((x -1 % numRows))* 1.0
>
>   def randomised(self,seed,numRows):
>     random.seed(seed)
>     return abs(random.randint(0, numRows) % numRows) * 1.0
>
>   def padString(self,x,chars,length):
>     n = int(math.log10(x) + 1)
>     result_str = ''.join(random.choice(chars) for i in range(length-n)) + str(x)
>     return result_str
>
>   def padSingleChar(self,chars,length):
>     result_str = ''.join(chars for i in range(length))
>     return result_str
>
>   def println(self,lst):
>     for ll in lst:
>       print(ll[0])
>
>   def createSomeChars(self):
>       string.ascii_letters = 'ABCDEFGHIJ'
>       return random.choice(string.ascii_letters)
>
> usedFunctions = UsedFunctions()
>
> def main():
>     appName = "RandomDataGenerator"
>     start_time = time.time()
>     randomdata = RandomData()
>     dfRandom = randomdata.generateRamdomData()
>
>
> class RandomData:
>     def generateRamdomData(self):
>       uf = UsedFunctions()
>       numRows = 10
>       start = 1
>       end = start + numRows - 1
>       print("starting at ID = ", start, ",ending on = ", end)
>       Range = range(start, end)
>       df = pd.DataFrame(map(lambda x: (x, usedFunctions.clustered(x, numRows), \
>                                        usedFunctions.scattered(x, numRows), \
>                                        usedFunctions.randomised(x, numRows), \
>                                        usedFunctions.randomString(10), \
>                                        usedFunctions.padString(x, " ", 20), \
>                                        usedFunctions.padSingleChar("z", 20), \
>                                        usedFunctions.createSomeChars()), Range))
>       pd.set_option("display.max_rows", None, "display.max_columns", None)
>       for col_name in df.columns:
>           print(col_name)
>       print(df.groupby(7).groups)
>       ##print(df)
>
> if __name__ == "__main__":
>   main()
>
> and comes back with this
>
>
> starting at ID =  1 ,ending on =  10
>
> 0
>
> 1
>
> 2
>
> 3
>
> 4
>
> 5
>
> 6
>
> 7
>
> {'B': [5, 7], 'D': [4], 'F': [1], 'G': [0, 3, 6, 8], 'J': [2]}
>
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 15 Mar 2022 at 22:19, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
>> Thanks, I don't want to use Spark, otherwise I can do this.
>>
>> p_dfm = df.toPandas()  # converting spark DF to Pandas DF
>>
>>
>> Can I do it without using Spark?
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 15 Mar 2022 at 22:08, Bjørn Jørgensen <bj...@gmail.com>
>> wrote:
>>
>>> You have a pyspark dataframe and you want to convert it to pandas?
>>>
>>> Convert it first to pandas api on spark
>>>
>>>
>>> pf01 = f01.to_pandas_on_spark()
>>>
>>>
>>> Then convert it to pandas
>>>
>>>
>>> pf01 = f01.to_pandas()
>>>
>>> Or?
>>>
>>> tir. 15. mar. 2022, 22:56 skrev Mich Talebzadeh <
>>> mich.talebzadeh@gmail.com>:
>>>
>>>> Thanks everyone.
>>>>
>>>> I want to do the following in pandas and numpy without using spark.
>>>>
>>>> This is what I do in spark to generate some random data using class
>>>> UsedFunctions (not important).
>>>>
>>>> class UsedFunctions:
>>>>   def randomString(self,length):
>>>>     letters = string.ascii_letters
>>>>     result_str = ''.join(random.choice(letters) for i in range(length))
>>>>     return result_str
>>>>   def clustered(self,x,numRows):
>>>>     return math.floor(x -1)/numRows
>>>>   def scattered(self,x,numRows):
>>>>     return abs((x -1 % numRows))* 1.0
>>>>   def randomised(self,seed,numRows):
>>>>     random.seed(seed)
>>>>     return abs(random.randint(0, numRows) % numRows) * 1.0
>>>>   def padString(self,x,chars,length):
>>>>     n = int(math.log10(x) + 1)
>>>>     result_str = ''.join(random.choice(chars) for i in range(length-n))
>>>> + str(x)
>>>>     return result_str
>>>>   def padSingleChar(self,chars,length):
>>>>     result_str = ''.join(chars for i in range(length))
>>>>     return result_str
>>>>   def println(self,lst):
>>>>     for ll in lst:
>>>>       print(ll[0])
>>>>
>>>>
>>>> usedFunctions = UsedFunctions()
>>>>
>>>> start = 1
>>>> end = start + 9
>>>> print ("starting at ID = ",start, ",ending on = ",end)
>>>> Range = range(start, end)
>>>> rdd = sc.parallelize(Range). \
>>>>          map(lambda x: (x, usedFunctions.clustered(x,numRows), \
>>>>                            usedFunctions.scattered(x,numRows), \
>>>>                            usedFunctions.randomised(x,numRows), \
>>>>                            usedFunctions.randomString(50), \
>>>>                            usedFunctions.padString(x," ",50), \
>>>>                            usedFunctions.padSingleChar("x",4000)))
>>>> df = rdd.toDF()
>>>>
>>>> OK how can I create a panda DataFrame df without using Spark?
>>>>
>>>> Thanks
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, 15 Mar 2022 at 21:19, Bjørn Jørgensen <bj...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Andrew. Mitch asked, and I answered transpose()
>>>>> https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.transpose.html
>>>>> .
>>>>>
>>>>> And now you are asking in the same thread about pandas API on spark
>>>>> and the transform().
>>>>>
>>>>> Apache Spark have pandas API on Spark.
>>>>>
>>>>> Which means that spark has an API call for pandas functions, and when
>>>>> you use pandas API on spark it is spark you are using then.
>>>>>
>>>>> Add this line in yours import
>>>>>
>>>>> from pyspark import pandas as ps
>>>>>
>>>>>
>>>>> Now you can pass yours dataframe back and forward to pandas API on
>>>>> spark by using
>>>>>
>>>>> pf01 = f01.to_pandas_on_spark()
>>>>>
>>>>>
>>>>> f01 = pf01.to_spark()
>>>>>
>>>>>
>>>>> Note that I have changed pd to ps here.
>>>>>
>>>>> df = ps.DataFrame({'A': range(3), 'B': range(1, 4)})
>>>>>
>>>>> df.transform(lambda x: x + 1)
>>>>>
>>>>> You will now see that all numbers are +1
>>>>>
>>>>> You can find more information about pandas API on spark transform
>>>>> https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.transform.html?highlight=pyspark%20pandas%20dataframe%20transform#pyspark.pandas.DataFrame.transform
>>>>> or in yours notbook
>>>>> df.transform?
>>>>>
>>>>> Signature:
>>>>> df.transform(
>>>>>     func: Callable[..., ForwardRef('Series')],
>>>>>     axis: Union[int, str] = 0,
>>>>>     *args: Any,
>>>>>     **kwargs: Any,) -> 'DataFrame'Docstring:
>>>>> Call ``func`` on self producing a Series with transformed values
>>>>> and that has the same length as its input.
>>>>>
>>>>> See also `Transform and apply a function
>>>>> <https://koalas.readthedocs.io/en/latest/user_guide/transform_apply.html>`_.
>>>>>
>>>>> .. note:: this API executes the function once to infer the type which is
>>>>>      potentially expensive, for instance, when the dataset is created after
>>>>>      aggregations or sorting.
>>>>>
>>>>>      To avoid this, specify return type in ``func``, for instance, as below:
>>>>>
>>>>>      >>> def square(x) -> ps.Series[np.int32]:
>>>>>      ...     return x ** 2
>>>>>
>>>>>      pandas-on-Spark uses return type hint and does not try to infer the type.
>>>>>
>>>>> .. note:: the series within ``func`` is actually multiple pandas series as the
>>>>>     segments of the whole pandas-on-Spark series; therefore, the length of each series
>>>>>     is not guaranteed. As an example, an aggregation against each series
>>>>>     does work as a global aggregation but an aggregation of each segment. See
>>>>>     below:
>>>>>
>>>>>     >>> def func(x) -> ps.Series[np.int32]:
>>>>>     ...     return x + sum(x)
>>>>>
>>>>> Parameters
>>>>> ----------
>>>>> func : function
>>>>>     Function to use for transforming the data. It must work when pandas Series
>>>>>     is passed.
>>>>> axis : int, default 0 or 'index'
>>>>>     Can only be set to 0 at the moment.
>>>>> *args
>>>>>     Positional arguments to pass to func.
>>>>> **kwargs
>>>>>     Keyword arguments to pass to func.
>>>>>
>>>>> Returns
>>>>> -------
>>>>> DataFrame
>>>>>     A DataFrame that must have the same length as self.
>>>>>
>>>>> Raises
>>>>> ------
>>>>> Exception : If the returned DataFrame has a different length than self.
>>>>>
>>>>> See Also
>>>>> --------
>>>>> DataFrame.aggregate : Only perform aggregating type operations.
>>>>> DataFrame.apply : Invoke function on DataFrame.
>>>>> Series.transform : The equivalent function for Series.
>>>>>
>>>>> Examples
>>>>> --------
>>>>> >>> df = ps.DataFrame({'A': range(3), 'B': range(1, 4)}, columns=['A', 'B'])
>>>>> >>> df
>>>>>    A  B
>>>>> 0  0  1
>>>>> 1  1  2
>>>>> 2  2  3
>>>>>
>>>>> >>> def square(x) -> ps.Series[np.int32]:
>>>>> ...     return x ** 2
>>>>> >>> df.transform(square)
>>>>>    A  B
>>>>> 0  0  1
>>>>> 1  1  4
>>>>> 2  4  9
>>>>>
>>>>> You can omit the type hint and let pandas-on-Spark infer its type.
>>>>>
>>>>> >>> df.transform(lambda x: x ** 2)
>>>>>    A  B
>>>>> 0  0  1
>>>>> 1  1  4
>>>>> 2  4  9
>>>>>
>>>>> For multi-index columns:
>>>>>
>>>>> >>> df.columns = [('X', 'A'), ('X', 'B')]
>>>>> >>> df.transform(square)  # doctest: +NORMALIZE_WHITESPACE
>>>>>    X
>>>>>    A  B
>>>>> 0  0  1
>>>>> 1  1  4
>>>>> 2  4  9
>>>>>
>>>>> >>> (df * -1).transform(abs)  # doctest: +NORMALIZE_WHITESPACE
>>>>>    X
>>>>>    A  B
>>>>> 0  0  1
>>>>> 1  1  2
>>>>> 2  2  3
>>>>>
>>>>> You can also specify extra arguments.
>>>>>
>>>>> >>> def calculation(x, y, z) -> ps.Series[int]:
>>>>> ...     return x ** y + z
>>>>> >>> df.transform(calculation, y=10, z=20)  # doctest: +NORMALIZE_WHITESPACE
>>>>>       X
>>>>>       A      B
>>>>> 0    20     21
>>>>> 1    21   1044
>>>>> 2  1044  59069File:      /opt/spark/python/pyspark/pandas/frame.pyType:      method
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> tir. 15. mar. 2022 kl. 19:33 skrev Andrew Davidson <aedavids@ucsc.edu
>>>>> >:
>>>>>
>>>>>> Hi Bjorn
>>>>>>
>>>>>>
>>>>>>
>>>>>> I have been looking for spark transform for a while. Can you send me
>>>>>> a link to the pyspark function?
>>>>>>
>>>>>>
>>>>>>
>>>>>> I assume pandas transform is not really an option. I think it will
>>>>>> try to pull the entire dataframe into the drivers memory.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Kind regards
>>>>>>
>>>>>>
>>>>>>
>>>>>> Andy
>>>>>>
>>>>>>
>>>>>>
>>>>>> p.s. My real problem is that spark does not allow you to bind
>>>>>> columns. You can use union() to bind rows. I could get the equivalent of
>>>>>> cbind() using union().transform()
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From: *Bjørn Jørgensen <bj...@gmail.com>
>>>>>> *Date: *Tuesday, March 15, 2022 at 10:37 AM
>>>>>> *To: *Mich Talebzadeh <mi...@gmail.com>
>>>>>> *Cc: *"user @spark" <us...@spark.apache.org>
>>>>>> *Subject: *Re: pivoting panda dataframe
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transpose.html we
>>>>>> have that transpose in pandas api for spark to.
>>>>>>
>>>>>>
>>>>>>
>>>>>> You also have stack() and multilevel
>>>>>> https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> tir. 15. mar. 2022 kl. 17:50 skrev Mich Talebzadeh <
>>>>>> mich.talebzadeh@gmail.com>:
>>>>>>
>>>>>>
>>>>>> hi,
>>>>>>
>>>>>>
>>>>>>
>>>>>> Is it possible to pivot a panda dataframe by making the row column
>>>>>> heading?
>>>>>>
>>>>>>
>>>>>>
>>>>>> thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>  [image: Image removed by sender.]  view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>
>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Bjørn Jørgensen
>>>>>> Vestre Aspehaug 4
>>>>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4?entry=gmail&source=g>,
>>>>>> 6010 Ålesund
>>>>>> Norge
>>>>>>
>>>>>> +47 480 94 297
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Bjørn Jørgensen
>>>>> Vestre Aspehaug 4
>>>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4?entry=gmail&source=g>,
>>>>> 6010 Ålesund
>>>>> Norge
>>>>>
>>>>> +47 480 94 297
>>>>>
>>>> --
Best Regards,
Ayan Guha

Re: pivoting panda dataframe

Posted by Mich Talebzadeh <mi...@gmail.com>.

OK this is the version that works with Panda only without Spark

import random
import string
import math
import datetime
import time
import pandas as pd

class UsedFunctions:

  def randomString(self,length):
    letters = string.ascii_letters
    result_str = ''.join(random.choice(letters) for i in range(length))
    return result_str

  def clustered(self,x,numRows):
    return math.floor(x -1)/numRows

  def scattered(self,x,numRows):
    return abs((x -1 % numRows))* 1.0

  def randomised(self,seed,numRows):
    random.seed(seed)
    return abs(random.randint(0, numRows) % numRows) * 1.0

  def padString(self,x,chars,length):
    n = int(math.log10(x) + 1)
    result_str = ''.join(random.choice(chars) for i in range(length-n)) + str(x)
    return result_str

  def padSingleChar(self,chars,length):
    result_str = ''.join(chars for i in range(length))
    return result_str

  def println(self,lst):
    for ll in lst:
      print(ll[0])

  def createSomeChars(self):
      string.ascii_letters = 'ABCDEFGHIJ'
      return random.choice(string.ascii_letters)

usedFunctions = UsedFunctions()

def main():
    appName = "RandomDataGenerator"
    start_time = time.time()
    randomdata = RandomData()
    dfRandom = randomdata.generateRamdomData()


class RandomData:
    def generateRamdomData(self):
      uf = UsedFunctions()
      numRows = 10
      start = 1
      end = start + numRows - 1
      print("starting at ID = ", start, ",ending on = ", end)
      Range = range(start, end)
      df = pd.DataFrame(map(lambda x: (x, usedFunctions.clustered(x, numRows), \
                                       usedFunctions.scattered(x, numRows), \
                                       usedFunctions.randomised(x, numRows), \
                                       usedFunctions.randomString(10), \
                                       usedFunctions.padString(x, " ", 20), \
                                       usedFunctions.padSingleChar("z", 20), \
                                       usedFunctions.createSomeChars()), Range))
      pd.set_option("display.max_rows", None, "display.max_columns", None)
      for col_name in df.columns:
          print(col_name)
      print(df.groupby(7).groups)
      ##print(df)

if __name__ == "__main__":
  main()

and comes back with this


starting at ID =  1 ,ending on =  10

0

1

2

3

4

5

6

7

{'B': [5, 7], 'D': [4], 'F': [1], 'G': [0, 3, 6, 8], 'J': [2]}



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 15 Mar 2022 at 22:19, Mich Talebzadeh <mi...@gmail.com>
wrote:

> Thanks, I don't want to use Spark, otherwise I can do this.
>
> p_dfm = df.toPandas()  # converting spark DF to Pandas DF
>
>
> Can I do it without using Spark?
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 15 Mar 2022 at 22:08, Bjørn Jørgensen <bj...@gmail.com>
> wrote:
>
>> You have a pyspark dataframe and you want to convert it to pandas?
>>
>> Convert it first to pandas api on spark
>>
>>
>> pf01 = f01.to_pandas_on_spark()
>>
>>
>> Then convert it to pandas
>>
>>
>> pf01 = f01.to_pandas()
>>
>> Or?
>>
>> tir. 15. mar. 2022, 22:56 skrev Mich Talebzadeh <
>> mich.talebzadeh@gmail.com>:
>>
>>> Thanks everyone.
>>>
>>> I want to do the following in pandas and numpy without using spark.
>>>
>>> This is what I do in spark to generate some random data using class
>>> UsedFunctions (not important).
>>>
>>> class UsedFunctions:
>>>   def randomString(self,length):
>>>     letters = string.ascii_letters
>>>     result_str = ''.join(random.choice(letters) for i in range(length))
>>>     return result_str
>>>   def clustered(self,x,numRows):
>>>     return math.floor(x -1)/numRows
>>>   def scattered(self,x,numRows):
>>>     return abs((x -1 % numRows))* 1.0
>>>   def randomised(self,seed,numRows):
>>>     random.seed(seed)
>>>     return abs(random.randint(0, numRows) % numRows) * 1.0
>>>   def padString(self,x,chars,length):
>>>     n = int(math.log10(x) + 1)
>>>     result_str = ''.join(random.choice(chars) for i in range(length-n))
>>> + str(x)
>>>     return result_str
>>>   def padSingleChar(self,chars,length):
>>>     result_str = ''.join(chars for i in range(length))
>>>     return result_str
>>>   def println(self,lst):
>>>     for ll in lst:
>>>       print(ll[0])
>>>
>>>
>>> usedFunctions = UsedFunctions()
>>>
>>> start = 1
>>> end = start + 9
>>> print ("starting at ID = ",start, ",ending on = ",end)
>>> Range = range(start, end)
>>> rdd = sc.parallelize(Range). \
>>>          map(lambda x: (x, usedFunctions.clustered(x,numRows), \
>>>                            usedFunctions.scattered(x,numRows), \
>>>                            usedFunctions.randomised(x,numRows), \
>>>                            usedFunctions.randomString(50), \
>>>                            usedFunctions.padString(x," ",50), \
>>>                            usedFunctions.padSingleChar("x",4000)))
>>> df = rdd.toDF()
>>>
>>> OK how can I create a panda DataFrame df without using Spark?
>>>
>>> Thanks
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 15 Mar 2022 at 21:19, Bjørn Jørgensen <bj...@gmail.com>
>>> wrote:
>>>
>>>> Hi Andrew. Mitch asked, and I answered transpose()
>>>> https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.transpose.html
>>>> .
>>>>
>>>> And now you are asking in the same thread about pandas API on spark and
>>>> the transform().
>>>>
>>>> Apache Spark have pandas API on Spark.
>>>>
>>>> Which means that spark has an API call for pandas functions, and when
>>>> you use pandas API on spark it is spark you are using then.
>>>>
>>>> Add this line in yours import
>>>>
>>>> from pyspark import pandas as ps
>>>>
>>>>
>>>> Now you can pass yours dataframe back and forward to pandas API on
>>>> spark by using
>>>>
>>>> pf01 = f01.to_pandas_on_spark()
>>>>
>>>>
>>>> f01 = pf01.to_spark()
>>>>
>>>>
>>>> Note that I have changed pd to ps here.
>>>>
>>>> df = ps.DataFrame({'A': range(3), 'B': range(1, 4)})
>>>>
>>>> df.transform(lambda x: x + 1)
>>>>
>>>> You will now see that all numbers are +1
>>>>
>>>> You can find more information about pandas API on spark transform
>>>> https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.transform.html?highlight=pyspark%20pandas%20dataframe%20transform#pyspark.pandas.DataFrame.transform
>>>> or in yours notbook
>>>> df.transform?
>>>>
>>>> Signature:
>>>> df.transform(
>>>>     func: Callable[..., ForwardRef('Series')],
>>>>     axis: Union[int, str] = 0,
>>>>     *args: Any,
>>>>     **kwargs: Any,) -> 'DataFrame'Docstring:
>>>> Call ``func`` on self producing a Series with transformed values
>>>> and that has the same length as its input.
>>>>
>>>> See also `Transform and apply a function
>>>> <https://koalas.readthedocs.io/en/latest/user_guide/transform_apply.html>`_.
>>>>
>>>> .. note:: this API executes the function once to infer the type which is
>>>>      potentially expensive, for instance, when the dataset is created after
>>>>      aggregations or sorting.
>>>>
>>>>      To avoid this, specify return type in ``func``, for instance, as below:
>>>>
>>>>      >>> def square(x) -> ps.Series[np.int32]:
>>>>      ...     return x ** 2
>>>>
>>>>      pandas-on-Spark uses return type hint and does not try to infer the type.
>>>>
>>>> .. note:: the series within ``func`` is actually multiple pandas series as the
>>>>     segments of the whole pandas-on-Spark series; therefore, the length of each series
>>>>     is not guaranteed. As an example, an aggregation against each series
>>>>     does work as a global aggregation but an aggregation of each segment. See
>>>>     below:
>>>>
>>>>     >>> def func(x) -> ps.Series[np.int32]:
>>>>     ...     return x + sum(x)
>>>>
>>>> Parameters
>>>> ----------
>>>> func : function
>>>>     Function to use for transforming the data. It must work when pandas Series
>>>>     is passed.
>>>> axis : int, default 0 or 'index'
>>>>     Can only be set to 0 at the moment.
>>>> *args
>>>>     Positional arguments to pass to func.
>>>> **kwargs
>>>>     Keyword arguments to pass to func.
>>>>
>>>> Returns
>>>> -------
>>>> DataFrame
>>>>     A DataFrame that must have the same length as self.
>>>>
>>>> Raises
>>>> ------
>>>> Exception : If the returned DataFrame has a different length than self.
>>>>
>>>> See Also
>>>> --------
>>>> DataFrame.aggregate : Only perform aggregating type operations.
>>>> DataFrame.apply : Invoke function on DataFrame.
>>>> Series.transform : The equivalent function for Series.
>>>>
>>>> Examples
>>>> --------
>>>> >>> df = ps.DataFrame({'A': range(3), 'B': range(1, 4)}, columns=['A', 'B'])
>>>> >>> df
>>>>    A  B
>>>> 0  0  1
>>>> 1  1  2
>>>> 2  2  3
>>>>
>>>> >>> def square(x) -> ps.Series[np.int32]:
>>>> ...     return x ** 2
>>>> >>> df.transform(square)
>>>>    A  B
>>>> 0  0  1
>>>> 1  1  4
>>>> 2  4  9
>>>>
>>>> You can omit the type hint and let pandas-on-Spark infer its type.
>>>>
>>>> >>> df.transform(lambda x: x ** 2)
>>>>    A  B
>>>> 0  0  1
>>>> 1  1  4
>>>> 2  4  9
>>>>
>>>> For multi-index columns:
>>>>
>>>> >>> df.columns = [('X', 'A'), ('X', 'B')]
>>>> >>> df.transform(square)  # doctest: +NORMALIZE_WHITESPACE
>>>>    X
>>>>    A  B
>>>> 0  0  1
>>>> 1  1  4
>>>> 2  4  9
>>>>
>>>> >>> (df * -1).transform(abs)  # doctest: +NORMALIZE_WHITESPACE
>>>>    X
>>>>    A  B
>>>> 0  0  1
>>>> 1  1  2
>>>> 2  2  3
>>>>
>>>> You can also specify extra arguments.
>>>>
>>>> >>> def calculation(x, y, z) -> ps.Series[int]:
>>>> ...     return x ** y + z
>>>> >>> df.transform(calculation, y=10, z=20)  # doctest: +NORMALIZE_WHITESPACE
>>>>       X
>>>>       A      B
>>>> 0    20     21
>>>> 1    21   1044
>>>> 2  1044  59069File:      /opt/spark/python/pyspark/pandas/frame.pyType:      method
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> tir. 15. mar. 2022 kl. 19:33 skrev Andrew Davidson <ae...@ucsc.edu>:
>>>>
>>>>> Hi Bjorn
>>>>>
>>>>>
>>>>>
>>>>> I have been looking for spark transform for a while. Can you send me a
>>>>> link to the pyspark function?
>>>>>
>>>>>
>>>>>
>>>>> I assume pandas transform is not really an option. I think it will try
>>>>> to pull the entire dataframe into the drivers memory.
>>>>>
>>>>>
>>>>>
>>>>> Kind regards
>>>>>
>>>>>
>>>>>
>>>>> Andy
>>>>>
>>>>>
>>>>>
>>>>> p.s. My real problem is that spark does not allow you to bind columns.
>>>>> You can use union() to bind rows. I could get the equivalent of cbind()
>>>>> using union().transform()
>>>>>
>>>>>
>>>>>
>>>>> *From: *Bjørn Jørgensen <bj...@gmail.com>
>>>>> *Date: *Tuesday, March 15, 2022 at 10:37 AM
>>>>> *To: *Mich Talebzadeh <mi...@gmail.com>
>>>>> *Cc: *"user @spark" <us...@spark.apache.org>
>>>>> *Subject: *Re: pivoting panda dataframe
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transpose.html we
>>>>> have that transpose in pandas api for spark to.
>>>>>
>>>>>
>>>>>
>>>>> You also have stack() and multilevel
>>>>> https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> tir. 15. mar. 2022 kl. 17:50 skrev Mich Talebzadeh <
>>>>> mich.talebzadeh@gmail.com>:
>>>>>
>>>>>
>>>>> hi,
>>>>>
>>>>>
>>>>>
>>>>> Is it possible to pivot a panda dataframe by making the row column
>>>>> heading?
>>>>>
>>>>>
>>>>>
>>>>> thanks
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>  [image: Image removed by sender.]  view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Bjørn Jørgensen
>>>>> Vestre Aspehaug 4, 6010 Ålesund
>>>>> Norge
>>>>>
>>>>> +47 480 94 297
>>>>>
>>>>
>>>>
>>>> --
>>>> Bjørn Jørgensen
>>>> Vestre Aspehaug 4, 6010 Ålesund
>>>> Norge
>>>>
>>>> +47 480 94 297
>>>>
>>>

Re: pivoting panda dataframe

Posted by Mich Talebzadeh <mi...@gmail.com>.

Thanks, I don't want to use Spark, otherwise I can do this.

p_dfm = df.toPandas()  # converting spark DF to Pandas DF


Can I do it without using Spark?


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 15 Mar 2022 at 22:08, Bjørn Jørgensen <bj...@gmail.com>
wrote:

> You have a pyspark dataframe and you want to convert it to pandas?
>
> Convert it first to pandas api on spark
>
>
> pf01 = f01.to_pandas_on_spark()
>
>
> Then convert it to pandas
>
>
> pf01 = f01.to_pandas()
>
> Or?
>
> tir. 15. mar. 2022, 22:56 skrev Mich Talebzadeh <mich.talebzadeh@gmail.com
> >:
>
>> Thanks everyone.
>>
>> I want to do the following in pandas and numpy without using spark.
>>
>> This is what I do in spark to generate some random data using class
>> UsedFunctions (not important).
>>
>> class UsedFunctions:
>>   def randomString(self,length):
>>     letters = string.ascii_letters
>>     result_str = ''.join(random.choice(letters) for i in range(length))
>>     return result_str
>>   def clustered(self,x,numRows):
>>     return math.floor(x -1)/numRows
>>   def scattered(self,x,numRows):
>>     return abs((x -1 % numRows))* 1.0
>>   def randomised(self,seed,numRows):
>>     random.seed(seed)
>>     return abs(random.randint(0, numRows) % numRows) * 1.0
>>   def padString(self,x,chars,length):
>>     n = int(math.log10(x) + 1)
>>     result_str = ''.join(random.choice(chars) for i in range(length-n)) +
>> str(x)
>>     return result_str
>>   def padSingleChar(self,chars,length):
>>     result_str = ''.join(chars for i in range(length))
>>     return result_str
>>   def println(self,lst):
>>     for ll in lst:
>>       print(ll[0])
>>
>>
>> usedFunctions = UsedFunctions()
>>
>> start = 1
>> end = start + 9
>> print ("starting at ID = ",start, ",ending on = ",end)
>> Range = range(start, end)
>> rdd = sc.parallelize(Range). \
>>          map(lambda x: (x, usedFunctions.clustered(x,numRows), \
>>                            usedFunctions.scattered(x,numRows), \
>>                            usedFunctions.randomised(x,numRows), \
>>                            usedFunctions.randomString(50), \
>>                            usedFunctions.padString(x," ",50), \
>>                            usedFunctions.padSingleChar("x",4000)))
>> df = rdd.toDF()
>>
>> OK how can I create a panda DataFrame df without using Spark?
>>
>> Thanks
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 15 Mar 2022 at 21:19, Bjørn Jørgensen <bj...@gmail.com>
>> wrote:
>>
>>> Hi Andrew. Mitch asked, and I answered transpose()
>>> https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.transpose.html
>>> .
>>>
>>> And now you are asking in the same thread about pandas API on spark and
>>> the transform().
>>>
>>> Apache Spark have pandas API on Spark.
>>>
>>> Which means that spark has an API call for pandas functions, and when
>>> you use pandas API on spark it is spark you are using then.
>>>
>>> Add this line in yours import
>>>
>>> from pyspark import pandas as ps
>>>
>>>
>>> Now you can pass yours dataframe back and forward to pandas API on spark
>>> by using
>>>
>>> pf01 = f01.to_pandas_on_spark()
>>>
>>>
>>> f01 = pf01.to_spark()
>>>
>>>
>>> Note that I have changed pd to ps here.
>>>
>>> df = ps.DataFrame({'A': range(3), 'B': range(1, 4)})
>>>
>>> df.transform(lambda x: x + 1)
>>>
>>> You will now see that all numbers are +1
>>>
>>> You can find more information about pandas API on spark transform
>>> https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.transform.html?highlight=pyspark%20pandas%20dataframe%20transform#pyspark.pandas.DataFrame.transform
>>> or in yours notbook
>>> df.transform?
>>>
>>> Signature:
>>> df.transform(
>>>     func: Callable[..., ForwardRef('Series')],
>>>     axis: Union[int, str] = 0,
>>>     *args: Any,
>>>     **kwargs: Any,) -> 'DataFrame'Docstring:
>>> Call ``func`` on self producing a Series with transformed values
>>> and that has the same length as its input.
>>>
>>> See also `Transform and apply a function
>>> <https://koalas.readthedocs.io/en/latest/user_guide/transform_apply.html>`_.
>>>
>>> .. note:: this API executes the function once to infer the type which is
>>>      potentially expensive, for instance, when the dataset is created after
>>>      aggregations or sorting.
>>>
>>>      To avoid this, specify return type in ``func``, for instance, as below:
>>>
>>>      >>> def square(x) -> ps.Series[np.int32]:
>>>      ...     return x ** 2
>>>
>>>      pandas-on-Spark uses return type hint and does not try to infer the type.
>>>
>>> .. note:: the series within ``func`` is actually multiple pandas series as the
>>>     segments of the whole pandas-on-Spark series; therefore, the length of each series
>>>     is not guaranteed. As an example, an aggregation against each series
>>>     does work as a global aggregation but an aggregation of each segment. See
>>>     below:
>>>
>>>     >>> def func(x) -> ps.Series[np.int32]:
>>>     ...     return x + sum(x)
>>>
>>> Parameters
>>> ----------
>>> func : function
>>>     Function to use for transforming the data. It must work when pandas Series
>>>     is passed.
>>> axis : int, default 0 or 'index'
>>>     Can only be set to 0 at the moment.
>>> *args
>>>     Positional arguments to pass to func.
>>> **kwargs
>>>     Keyword arguments to pass to func.
>>>
>>> Returns
>>> -------
>>> DataFrame
>>>     A DataFrame that must have the same length as self.
>>>
>>> Raises
>>> ------
>>> Exception : If the returned DataFrame has a different length than self.
>>>
>>> See Also
>>> --------
>>> DataFrame.aggregate : Only perform aggregating type operations.
>>> DataFrame.apply : Invoke function on DataFrame.
>>> Series.transform : The equivalent function for Series.
>>>
>>> Examples
>>> --------
>>> >>> df = ps.DataFrame({'A': range(3), 'B': range(1, 4)}, columns=['A', 'B'])
>>> >>> df
>>>    A  B
>>> 0  0  1
>>> 1  1  2
>>> 2  2  3
>>>
>>> >>> def square(x) -> ps.Series[np.int32]:
>>> ...     return x ** 2
>>> >>> df.transform(square)
>>>    A  B
>>> 0  0  1
>>> 1  1  4
>>> 2  4  9
>>>
>>> You can omit the type hint and let pandas-on-Spark infer its type.
>>>
>>> >>> df.transform(lambda x: x ** 2)
>>>    A  B
>>> 0  0  1
>>> 1  1  4
>>> 2  4  9
>>>
>>> For multi-index columns:
>>>
>>> >>> df.columns = [('X', 'A'), ('X', 'B')]
>>> >>> df.transform(square)  # doctest: +NORMALIZE_WHITESPACE
>>>    X
>>>    A  B
>>> 0  0  1
>>> 1  1  4
>>> 2  4  9
>>>
>>> >>> (df * -1).transform(abs)  # doctest: +NORMALIZE_WHITESPACE
>>>    X
>>>    A  B
>>> 0  0  1
>>> 1  1  2
>>> 2  2  3
>>>
>>> You can also specify extra arguments.
>>>
>>> >>> def calculation(x, y, z) -> ps.Series[int]:
>>> ...     return x ** y + z
>>> >>> df.transform(calculation, y=10, z=20)  # doctest: +NORMALIZE_WHITESPACE
>>>       X
>>>       A      B
>>> 0    20     21
>>> 1    21   1044
>>> 2  1044  59069File:      /opt/spark/python/pyspark/pandas/frame.pyType:      method
>>>
>>>
>>>
>>>
>>>
>>> tir. 15. mar. 2022 kl. 19:33 skrev Andrew Davidson <ae...@ucsc.edu>:
>>>
>>>> Hi Bjorn
>>>>
>>>>
>>>>
>>>> I have been looking for spark transform for a while. Can you send me a
>>>> link to the pyspark function?
>>>>
>>>>
>>>>
>>>> I assume pandas transform is not really an option. I think it will try
>>>> to pull the entire dataframe into the drivers memory.
>>>>
>>>>
>>>>
>>>> Kind regards
>>>>
>>>>
>>>>
>>>> Andy
>>>>
>>>>
>>>>
>>>> p.s. My real problem is that spark does not allow you to bind columns.
>>>> You can use union() to bind rows. I could get the equivalent of cbind()
>>>> using union().transform()
>>>>
>>>>
>>>>
>>>> *From: *Bjørn Jørgensen <bj...@gmail.com>
>>>> *Date: *Tuesday, March 15, 2022 at 10:37 AM
>>>> *To: *Mich Talebzadeh <mi...@gmail.com>
>>>> *Cc: *"user @spark" <us...@spark.apache.org>
>>>> *Subject: *Re: pivoting panda dataframe
>>>>
>>>>
>>>>
>>>>
>>>> https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transpose.html we
>>>> have that transpose in pandas api for spark to.
>>>>
>>>>
>>>>
>>>> You also have stack() and multilevel
>>>> https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> tir. 15. mar. 2022 kl. 17:50 skrev Mich Talebzadeh <
>>>> mich.talebzadeh@gmail.com>:
>>>>
>>>>
>>>> hi,
>>>>
>>>>
>>>>
>>>> Is it possible to pivot a panda dataframe by making the row column
>>>> heading?
>>>>
>>>>
>>>>
>>>> thanks
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>  [image: Image removed by sender.]  view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Bjørn Jørgensen
>>>> Vestre Aspehaug 4, 6010 Ålesund
>>>> Norge
>>>>
>>>> +47 480 94 297
>>>>
>>>
>>>
>>> --
>>> Bjørn Jørgensen
>>> Vestre Aspehaug 4, 6010 Ålesund
>>> Norge
>>>
>>> +47 480 94 297
>>>
>>

Re: pivoting panda dataframe

Posted by Bjørn Jørgensen <bj...@gmail.com>.

You have a pyspark dataframe and you want to convert it to pandas?

Convert it first to pandas api on spark


pf01 = f01.to_pandas_on_spark()


Then convert it to pandas


pf01 = f01.to_pandas()

Or?

tir. 15. mar. 2022, 22:56 skrev Mich Talebzadeh <mi...@gmail.com>:

> Thanks everyone.
>
> I want to do the following in pandas and numpy without using spark.
>
> This is what I do in spark to generate some random data using class
> UsedFunctions (not important).
>
> class UsedFunctions:
>   def randomString(self,length):
>     letters = string.ascii_letters
>     result_str = ''.join(random.choice(letters) for i in range(length))
>     return result_str
>   def clustered(self,x,numRows):
>     return math.floor(x -1)/numRows
>   def scattered(self,x,numRows):
>     return abs((x -1 % numRows))* 1.0
>   def randomised(self,seed,numRows):
>     random.seed(seed)
>     return abs(random.randint(0, numRows) % numRows) * 1.0
>   def padString(self,x,chars,length):
>     n = int(math.log10(x) + 1)
>     result_str = ''.join(random.choice(chars) for i in range(length-n)) +
> str(x)
>     return result_str
>   def padSingleChar(self,chars,length):
>     result_str = ''.join(chars for i in range(length))
>     return result_str
>   def println(self,lst):
>     for ll in lst:
>       print(ll[0])
>
>
> usedFunctions = UsedFunctions()
>
> start = 1
> end = start + 9
> print ("starting at ID = ",start, ",ending on = ",end)
> Range = range(start, end)
> rdd = sc.parallelize(Range). \
>          map(lambda x: (x, usedFunctions.clustered(x,numRows), \
>                            usedFunctions.scattered(x,numRows), \
>                            usedFunctions.randomised(x,numRows), \
>                            usedFunctions.randomString(50), \
>                            usedFunctions.padString(x," ",50), \
>                            usedFunctions.padSingleChar("x",4000)))
> df = rdd.toDF()
>
> OK how can I create a panda DataFrame df without using Spark?
>
> Thanks
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 15 Mar 2022 at 21:19, Bjørn Jørgensen <bj...@gmail.com>
> wrote:
>
>> Hi Andrew. Mitch asked, and I answered transpose()
>> https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.transpose.html
>> .
>>
>> And now you are asking in the same thread about pandas API on spark and
>> the transform().
>>
>> Apache Spark have pandas API on Spark.
>>
>> Which means that spark has an API call for pandas functions, and when you
>> use pandas API on spark it is spark you are using then.
>>
>> Add this line in yours import
>>
>> from pyspark import pandas as ps
>>
>>
>> Now you can pass yours dataframe back and forward to pandas API on spark
>> by using
>>
>> pf01 = f01.to_pandas_on_spark()
>>
>>
>> f01 = pf01.to_spark()
>>
>>
>> Note that I have changed pd to ps here.
>>
>> df = ps.DataFrame({'A': range(3), 'B': range(1, 4)})
>>
>> df.transform(lambda x: x + 1)
>>
>> You will now see that all numbers are +1
>>
>> You can find more information about pandas API on spark transform
>> https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.transform.html?highlight=pyspark%20pandas%20dataframe%20transform#pyspark.pandas.DataFrame.transform
>> or in yours notbook
>> df.transform?
>>
>> Signature:
>> df.transform(
>>     func: Callable[..., ForwardRef('Series')],
>>     axis: Union[int, str] = 0,
>>     *args: Any,
>>     **kwargs: Any,) -> 'DataFrame'Docstring:
>> Call ``func`` on self producing a Series with transformed values
>> and that has the same length as its input.
>>
>> See also `Transform and apply a function
>> <https://koalas.readthedocs.io/en/latest/user_guide/transform_apply.html>`_.
>>
>> .. note:: this API executes the function once to infer the type which is
>>      potentially expensive, for instance, when the dataset is created after
>>      aggregations or sorting.
>>
>>      To avoid this, specify return type in ``func``, for instance, as below:
>>
>>      >>> def square(x) -> ps.Series[np.int32]:
>>      ...     return x ** 2
>>
>>      pandas-on-Spark uses return type hint and does not try to infer the type.
>>
>> .. note:: the series within ``func`` is actually multiple pandas series as the
>>     segments of the whole pandas-on-Spark series; therefore, the length of each series
>>     is not guaranteed. As an example, an aggregation against each series
>>     does work as a global aggregation but an aggregation of each segment. See
>>     below:
>>
>>     >>> def func(x) -> ps.Series[np.int32]:
>>     ...     return x + sum(x)
>>
>> Parameters
>> ----------
>> func : function
>>     Function to use for transforming the data. It must work when pandas Series
>>     is passed.
>> axis : int, default 0 or 'index'
>>     Can only be set to 0 at the moment.
>> *args
>>     Positional arguments to pass to func.
>> **kwargs
>>     Keyword arguments to pass to func.
>>
>> Returns
>> -------
>> DataFrame
>>     A DataFrame that must have the same length as self.
>>
>> Raises
>> ------
>> Exception : If the returned DataFrame has a different length than self.
>>
>> See Also
>> --------
>> DataFrame.aggregate : Only perform aggregating type operations.
>> DataFrame.apply : Invoke function on DataFrame.
>> Series.transform : The equivalent function for Series.
>>
>> Examples
>> --------
>> >>> df = ps.DataFrame({'A': range(3), 'B': range(1, 4)}, columns=['A', 'B'])
>> >>> df
>>    A  B
>> 0  0  1
>> 1  1  2
>> 2  2  3
>>
>> >>> def square(x) -> ps.Series[np.int32]:
>> ...     return x ** 2
>> >>> df.transform(square)
>>    A  B
>> 0  0  1
>> 1  1  4
>> 2  4  9
>>
>> You can omit the type hint and let pandas-on-Spark infer its type.
>>
>> >>> df.transform(lambda x: x ** 2)
>>    A  B
>> 0  0  1
>> 1  1  4
>> 2  4  9
>>
>> For multi-index columns:
>>
>> >>> df.columns = [('X', 'A'), ('X', 'B')]
>> >>> df.transform(square)  # doctest: +NORMALIZE_WHITESPACE
>>    X
>>    A  B
>> 0  0  1
>> 1  1  4
>> 2  4  9
>>
>> >>> (df * -1).transform(abs)  # doctest: +NORMALIZE_WHITESPACE
>>    X
>>    A  B
>> 0  0  1
>> 1  1  2
>> 2  2  3
>>
>> You can also specify extra arguments.
>>
>> >>> def calculation(x, y, z) -> ps.Series[int]:
>> ...     return x ** y + z
>> >>> df.transform(calculation, y=10, z=20)  # doctest: +NORMALIZE_WHITESPACE
>>       X
>>       A      B
>> 0    20     21
>> 1    21   1044
>> 2  1044  59069File:      /opt/spark/python/pyspark/pandas/frame.pyType:      method
>>
>>
>>
>>
>>
>> tir. 15. mar. 2022 kl. 19:33 skrev Andrew Davidson <ae...@ucsc.edu>:
>>
>>> Hi Bjorn
>>>
>>>
>>>
>>> I have been looking for spark transform for a while. Can you send me a
>>> link to the pyspark function?
>>>
>>>
>>>
>>> I assume pandas transform is not really an option. I think it will try
>>> to pull the entire dataframe into the drivers memory.
>>>
>>>
>>>
>>> Kind regards
>>>
>>>
>>>
>>> Andy
>>>
>>>
>>>
>>> p.s. My real problem is that spark does not allow you to bind columns.
>>> You can use union() to bind rows. I could get the equivalent of cbind()
>>> using union().transform()
>>>
>>>
>>>
>>> *From: *Bjørn Jørgensen <bj...@gmail.com>
>>> *Date: *Tuesday, March 15, 2022 at 10:37 AM
>>> *To: *Mich Talebzadeh <mi...@gmail.com>
>>> *Cc: *"user @spark" <us...@spark.apache.org>
>>> *Subject: *Re: pivoting panda dataframe
>>>
>>>
>>>
>>>
>>> https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transpose.html we
>>> have that transpose in pandas api for spark to.
>>>
>>>
>>>
>>> You also have stack() and multilevel
>>> https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> tir. 15. mar. 2022 kl. 17:50 skrev Mich Talebzadeh <
>>> mich.talebzadeh@gmail.com>:
>>>
>>>
>>> hi,
>>>
>>>
>>>
>>> Is it possible to pivot a panda dataframe by making the row column
>>> heading?
>>>
>>>
>>>
>>> thanks
>>>
>>>
>>>
>>>
>>>
>>>  [image: Image removed by sender.]  view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Bjørn Jørgensen
>>> Vestre Aspehaug 4, 6010 Ålesund
>>> Norge
>>>
>>> +47 480 94 297
>>>
>>
>>
>> --
>> Bjørn Jørgensen
>> Vestre Aspehaug 4, 6010 Ålesund
>> Norge
>>
>> +47 480 94 297
>>
>

Re: pivoting panda dataframe

Posted by Mich Talebzadeh <mi...@gmail.com>.

Thanks everyone.

I want to do the following in pandas and numpy without using spark.

This is what I do in spark to generate some random data using class
UsedFunctions (not important).

class UsedFunctions:
  def randomString(self,length):
    letters = string.ascii_letters
    result_str = ''.join(random.choice(letters) for i in range(length))
    return result_str
  def clustered(self,x,numRows):
    return math.floor(x -1)/numRows
  def scattered(self,x,numRows):
    return abs((x -1 % numRows))* 1.0
  def randomised(self,seed,numRows):
    random.seed(seed)
    return abs(random.randint(0, numRows) % numRows) * 1.0
  def padString(self,x,chars,length):
    n = int(math.log10(x) + 1)
    result_str = ''.join(random.choice(chars) for i in range(length-n)) +
str(x)
    return result_str
  def padSingleChar(self,chars,length):
    result_str = ''.join(chars for i in range(length))
    return result_str
  def println(self,lst):
    for ll in lst:
      print(ll[0])


usedFunctions = UsedFunctions()

start = 1
end = start + 9
print ("starting at ID = ",start, ",ending on = ",end)
Range = range(start, end)
rdd = sc.parallelize(Range). \
         map(lambda x: (x, usedFunctions.clustered(x,numRows), \
                           usedFunctions.scattered(x,numRows), \
                           usedFunctions.randomised(x,numRows), \
                           usedFunctions.randomString(50), \
                           usedFunctions.padString(x," ",50), \
                           usedFunctions.padSingleChar("x",4000)))
df = rdd.toDF()

OK how can I create a panda DataFrame df without using Spark?

Thanks


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 15 Mar 2022 at 21:19, Bjørn Jørgensen <bj...@gmail.com>
wrote:

> Hi Andrew. Mitch asked, and I answered transpose()
> https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.transpose.html
> .
>
> And now you are asking in the same thread about pandas API on spark and
> the transform().
>
> Apache Spark have pandas API on Spark.
>
> Which means that spark has an API call for pandas functions, and when you
> use pandas API on spark it is spark you are using then.
>
> Add this line in yours import
>
> from pyspark import pandas as ps
>
>
> Now you can pass yours dataframe back and forward to pandas API on spark
> by using
>
> pf01 = f01.to_pandas_on_spark()
>
>
> f01 = pf01.to_spark()
>
>
> Note that I have changed pd to ps here.
>
> df = ps.DataFrame({'A': range(3), 'B': range(1, 4)})
>
> df.transform(lambda x: x + 1)
>
> You will now see that all numbers are +1
>
> You can find more information about pandas API on spark transform
> https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.transform.html?highlight=pyspark%20pandas%20dataframe%20transform#pyspark.pandas.DataFrame.transform
> or in yours notbook
> df.transform?
>
> Signature:
> df.transform(
>     func: Callable[..., ForwardRef('Series')],
>     axis: Union[int, str] = 0,
>     *args: Any,
>     **kwargs: Any,) -> 'DataFrame'Docstring:
> Call ``func`` on self producing a Series with transformed values
> and that has the same length as its input.
>
> See also `Transform and apply a function
> <https://koalas.readthedocs.io/en/latest/user_guide/transform_apply.html>`_.
>
> .. note:: this API executes the function once to infer the type which is
>      potentially expensive, for instance, when the dataset is created after
>      aggregations or sorting.
>
>      To avoid this, specify return type in ``func``, for instance, as below:
>
>      >>> def square(x) -> ps.Series[np.int32]:
>      ...     return x ** 2
>
>      pandas-on-Spark uses return type hint and does not try to infer the type.
>
> .. note:: the series within ``func`` is actually multiple pandas series as the
>     segments of the whole pandas-on-Spark series; therefore, the length of each series
>     is not guaranteed. As an example, an aggregation against each series
>     does work as a global aggregation but an aggregation of each segment. See
>     below:
>
>     >>> def func(x) -> ps.Series[np.int32]:
>     ...     return x + sum(x)
>
> Parameters
> ----------
> func : function
>     Function to use for transforming the data. It must work when pandas Series
>     is passed.
> axis : int, default 0 or 'index'
>     Can only be set to 0 at the moment.
> *args
>     Positional arguments to pass to func.
> **kwargs
>     Keyword arguments to pass to func.
>
> Returns
> -------
> DataFrame
>     A DataFrame that must have the same length as self.
>
> Raises
> ------
> Exception : If the returned DataFrame has a different length than self.
>
> See Also
> --------
> DataFrame.aggregate : Only perform aggregating type operations.
> DataFrame.apply : Invoke function on DataFrame.
> Series.transform : The equivalent function for Series.
>
> Examples
> --------
> >>> df = ps.DataFrame({'A': range(3), 'B': range(1, 4)}, columns=['A', 'B'])
> >>> df
>    A  B
> 0  0  1
> 1  1  2
> 2  2  3
>
> >>> def square(x) -> ps.Series[np.int32]:
> ...     return x ** 2
> >>> df.transform(square)
>    A  B
> 0  0  1
> 1  1  4
> 2  4  9
>
> You can omit the type hint and let pandas-on-Spark infer its type.
>
> >>> df.transform(lambda x: x ** 2)
>    A  B
> 0  0  1
> 1  1  4
> 2  4  9
>
> For multi-index columns:
>
> >>> df.columns = [('X', 'A'), ('X', 'B')]
> >>> df.transform(square)  # doctest: +NORMALIZE_WHITESPACE
>    X
>    A  B
> 0  0  1
> 1  1  4
> 2  4  9
>
> >>> (df * -1).transform(abs)  # doctest: +NORMALIZE_WHITESPACE
>    X
>    A  B
> 0  0  1
> 1  1  2
> 2  2  3
>
> You can also specify extra arguments.
>
> >>> def calculation(x, y, z) -> ps.Series[int]:
> ...     return x ** y + z
> >>> df.transform(calculation, y=10, z=20)  # doctest: +NORMALIZE_WHITESPACE
>       X
>       A      B
> 0    20     21
> 1    21   1044
> 2  1044  59069File:      /opt/spark/python/pyspark/pandas/frame.pyType:      method
>
>
>
>
>
> tir. 15. mar. 2022 kl. 19:33 skrev Andrew Davidson <ae...@ucsc.edu>:
>
>> Hi Bjorn
>>
>>
>>
>> I have been looking for spark transform for a while. Can you send me a
>> link to the pyspark function?
>>
>>
>>
>> I assume pandas transform is not really an option. I think it will try to
>> pull the entire dataframe into the drivers memory.
>>
>>
>>
>> Kind regards
>>
>>
>>
>> Andy
>>
>>
>>
>> p.s. My real problem is that spark does not allow you to bind columns.
>> You can use union() to bind rows. I could get the equivalent of cbind()
>> using union().transform()
>>
>>
>>
>> *From: *Bjørn Jørgensen <bj...@gmail.com>
>> *Date: *Tuesday, March 15, 2022 at 10:37 AM
>> *To: *Mich Talebzadeh <mi...@gmail.com>
>> *Cc: *"user @spark" <us...@spark.apache.org>
>> *Subject: *Re: pivoting panda dataframe
>>
>>
>>
>>
>> https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transpose.html we
>> have that transpose in pandas api for spark to.
>>
>>
>>
>> You also have stack() and multilevel
>> https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html
>>
>>
>>
>>
>>
>>
>>
>> tir. 15. mar. 2022 kl. 17:50 skrev Mich Talebzadeh <
>> mich.talebzadeh@gmail.com>:
>>
>>
>> hi,
>>
>>
>>
>> Is it possible to pivot a panda dataframe by making the row column
>> heading?
>>
>>
>>
>> thanks
>>
>>
>>
>>
>>
>>  [image: Image removed by sender.]  view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>>
>>
>> --
>>
>> Bjørn Jørgensen
>> Vestre Aspehaug 4, 6010 Ålesund
>> Norge
>>
>> +47 480 94 297
>>
>
>
> --
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>

Re: pivoting panda dataframe

Posted by Bjørn Jørgensen <bj...@gmail.com>.

Hi Andrew. Mitch asked, and I answered transpose()
https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.transpose.html
.

And now you are asking in the same thread about pandas API on spark and the
transform().

Apache Spark have pandas API on Spark.

Which means that spark has an API call for pandas functions, and when you
use pandas API on spark it is spark you are using then.

Add this line in yours import

from pyspark import pandas as ps


Now you can pass yours dataframe back and forward to pandas API on spark by
using

pf01 = f01.to_pandas_on_spark()


f01 = pf01.to_spark()


Note that I have changed pd to ps here.

df = ps.DataFrame({'A': range(3), 'B': range(1, 4)})

df.transform(lambda x: x + 1)

You will now see that all numbers are +1

You can find more information about pandas API on spark transform
https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.transform.html?highlight=pyspark%20pandas%20dataframe%20transform#pyspark.pandas.DataFrame.transform
or in yours notbook
df.transform?

Signature:
df.transform(
    func: Callable[..., ForwardRef('Series')],
    axis: Union[int, str] = 0,
    *args: Any,
    **kwargs: Any,) -> 'DataFrame'Docstring:
Call ``func`` on self producing a Series with transformed values
and that has the same length as its input.

See also `Transform and apply a function
<https://koalas.readthedocs.io/en/latest/user_guide/transform_apply.html>`_.

.. note:: this API executes the function once to infer the type which is
     potentially expensive, for instance, when the dataset is created after
     aggregations or sorting.

     To avoid this, specify return type in ``func``, for instance, as below:

     >>> def square(x) -> ps.Series[np.int32]:
     ...     return x ** 2

     pandas-on-Spark uses return type hint and does not try to infer the type.

.. note:: the series within ``func`` is actually multiple pandas series as the
    segments of the whole pandas-on-Spark series; therefore, the
length of each series
    is not guaranteed. As an example, an aggregation against each series
    does work as a global aggregation but an aggregation of each segment. See
    below:

    >>> def func(x) -> ps.Series[np.int32]:
    ...     return x + sum(x)

Parameters
----------
func : function
    Function to use for transforming the data. It must work when pandas Series
    is passed.
axis : int, default 0 or 'index'
    Can only be set to 0 at the moment.
*args
    Positional arguments to pass to func.
**kwargs
    Keyword arguments to pass to func.

Returns
-------
DataFrame
    A DataFrame that must have the same length as self.

Raises
------
Exception : If the returned DataFrame has a different length than self.

See Also
--------
DataFrame.aggregate : Only perform aggregating type operations.
DataFrame.apply : Invoke function on DataFrame.
Series.transform : The equivalent function for Series.

Examples
--------
>>> df = ps.DataFrame({'A': range(3), 'B': range(1, 4)}, columns=['A', 'B'])
>>> df
   A  B
0  0  1
1  1  2
2  2  3

>>> def square(x) -> ps.Series[np.int32]:
...     return x ** 2
>>> df.transform(square)
   A  B
0  0  1
1  1  4
2  4  9

You can omit the type hint and let pandas-on-Spark infer its type.

>>> df.transform(lambda x: x ** 2)
   A  B
0  0  1
1  1  4
2  4  9

For multi-index columns:

>>> df.columns = [('X', 'A'), ('X', 'B')]
>>> df.transform(square)  # doctest: +NORMALIZE_WHITESPACE
   X
   A  B
0  0  1
1  1  4
2  4  9

>>> (df * -1).transform(abs)  # doctest: +NORMALIZE_WHITESPACE
   X
   A  B
0  0  1
1  1  2
2  2  3

You can also specify extra arguments.

>>> def calculation(x, y, z) -> ps.Series[int]:
...     return x ** y + z
>>> df.transform(calculation, y=10, z=20)  # doctest: +NORMALIZE_WHITESPACE
      X
      A      B
0    20     21
1    21   1044
2  1044  59069File:
/opt/spark/python/pyspark/pandas/frame.pyType:      method





tir. 15. mar. 2022 kl. 19:33 skrev Andrew Davidson <ae...@ucsc.edu>:

> Hi Bjorn
>
>
>
> I have been looking for spark transform for a while. Can you send me a
> link to the pyspark function?
>
>
>
> I assume pandas transform is not really an option. I think it will try to
> pull the entire dataframe into the drivers memory.
>
>
>
> Kind regards
>
>
>
> Andy
>
>
>
> p.s. My real problem is that spark does not allow you to bind columns. You
> can use union() to bind rows. I could get the equivalent of cbind() using
> union().transform()
>
>
>
> *From: *Bjørn Jørgensen <bj...@gmail.com>
> *Date: *Tuesday, March 15, 2022 at 10:37 AM
> *To: *Mich Talebzadeh <mi...@gmail.com>
> *Cc: *"user @spark" <us...@spark.apache.org>
> *Subject: *Re: pivoting panda dataframe
>
>
>
>
> https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transpose.html we
> have that transpose in pandas api for spark to.
>
>
>
> You also have stack() and multilevel
> https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html
>
>
>
>
>
>
>
> tir. 15. mar. 2022 kl. 17:50 skrev Mich Talebzadeh <
> mich.talebzadeh@gmail.com>:
>
>
> hi,
>
>
>
> Is it possible to pivot a panda dataframe by making the row column
> heading?
>
>
>
> thanks
>
>
>
>
>
>  [image: Image removed by sender.]  view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
> --
>
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>


-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297

Re: pivoting panda dataframe

Posted by Andrew Davidson <ae...@ucsc.edu.INVALID>.

Hi Bjorn

I have been looking for spark transform for a while. Can you send me a link to the pyspark function?

I assume pandas transform is not really an option. I think it will try to pull the entire dataframe into the drivers memory.

Kind regards

Andy

p.s. My real problem is that spark does not allow you to bind columns. You can use union() to bind rows. I could get the equivalent of cbind() using union().transform()

From: Bjørn Jørgensen <bj...@gmail.com>
Date: Tuesday, March 15, 2022 at 10:37 AM
To: Mich Talebzadeh <mi...@gmail.com>
Cc: "user @spark" <us...@spark.apache.org>
Subject: Re: pivoting panda dataframe

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transpose.html we have that transpose in pandas api for spark to.

You also have stack() and multilevel https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html



tir. 15. mar. 2022 kl. 17:50 skrev Mich Talebzadeh <mi...@gmail.com>>:


hi,



Is it possible to pivot a panda dataframe by making the row column heading?



thanks




 [Image removed by sender.]   view my Linkedin profile<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.




--
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297

Re: pivoting panda dataframe

Posted by Bjørn Jørgensen <bj...@gmail.com>.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transpose.html we
have that transpose in pandas api for spark to.

You also have stack() and multilevel
https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html



tir. 15. mar. 2022 kl. 17:50 skrev Mich Talebzadeh <
mich.talebzadeh@gmail.com>:

>
> hi,
>
>
> Is it possible to pivot a panda dataframe by making the row column
> heading?
>
>
> thanks
>
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297