You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Mich Talebzadeh <mi...@gmail.com> on 2022/03/15 16:48:40 UTC
pivoting panda dataframe
hi,
Is it possible to pivot a panda dataframe by making the row column heading?
thanks
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
https://en.everybodywiki.com/Mich_Talebzadeh
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.
Re: pivoting panda dataframe
Posted by Bjørn Jørgensen <bj...@gmail.com>.
Colums bind in r is concatinat in pandas
https://www.datasciencemadesimple.com/append-concatenate-columns-python-pandas-column-bind/
Please start a now thread for each questions.
tir. 15. mar. 2022, 22:59 skrev Andrew Davidson <ae...@ucsc.edu>:
> Many many thanks!
>
>
>
> I have been looking for a pyspark data frame column_bind() solution for
> several months. Hopefully pyspark.pandas works. The only other solutions I
> was aware of was to use spark.dataframe.join(). This does not scale for
> obvious reason.
>
>
>
> Andy
>
>
>
>
>
> *From: *Bjørn Jørgensen <bj...@gmail.com>
> *Date: *Tuesday, March 15, 2022 at 2:19 PM
> *To: *Andrew Davidson <ae...@ucsc.edu>
> *Cc: *Mich Talebzadeh <mi...@gmail.com>, "user @spark" <
> user@spark.apache.org>
> *Subject: *Re: pivoting panda dataframe
>
>
>
> Hi Andrew. Mitch asked, and I answered transpose()
> https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.transpose.html
> .
>
>
>
> And now you are asking in the same thread about pandas API on spark and
> the transform().
>
>
>
> Apache Spark have pandas API on Spark.
>
>
>
> Which means that spark has an API call for pandas functions, and when you
> use pandas API on spark it is spark you are using then.
>
>
>
> Add this line in yours import
>
>
>
> from pyspark import pandas as ps
>
>
>
>
>
> Now you can pass yours dataframe back and forward to pandas API on spark
> by using
>
>
>
> pf01 = f01.to_pandas_on_spark()
>
>
> f01 = pf01.to_spark()
>
>
>
>
>
> Note that I have changed pd to ps here.
>
>
>
> df = ps.DataFrame({'A': range(3), 'B': range(1, 4)})
>
>
>
> df.transform(lambda x: x + 1)
>
>
>
> You will now see that all numbers are +1
>
>
>
> You can find more information about pandas API on spark transform
> https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.transform.html?highlight=pyspark%20pandas%20dataframe%20transform#pyspark.pandas.DataFrame.transform
>
> or in yours notbook
>
> df.transform?
>
>
>
> Signature:
>
> df.transform(
>
> func: Callable[..., ForwardRef('Series')],
>
> axis: Union[int, str] = 0,
>
> *args: Any,
>
> **kwargs: Any,
>
> ) -> 'DataFrame'
>
> Docstring:
>
> Call ``func`` on self producing a Series with transformed values
>
> and that has the same length as its input.
>
>
>
> See also `Transform and apply a function
>
> <https://koalas.readthedocs.io/en/latest/user_guide/transform_apply.html>`_.
>
>
>
> .. note:: this API executes the function once to infer the type which is
>
> potentially expensive, for instance, when the dataset is created after
>
> aggregations or sorting.
>
>
>
> To avoid this, specify return type in ``func``, for instance, as below:
>
>
>
> >>> def square(x) -> ps.Series[np.int32]:
>
> ... return x ** 2
>
>
>
> pandas-on-Spark uses return type hint and does not try to infer the type.
>
>
>
> .. note:: the series within ``func`` is actually multiple pandas series as the
>
> segments of the whole pandas-on-Spark series; therefore, the length of each series
>
> is not guaranteed. As an example, an aggregation against each series
>
> does work as a global aggregation but an aggregation of each segment. See
>
> below:
>
>
>
> >>> def func(x) -> ps.Series[np.int32]:
>
> ... return x + sum(x)
>
>
>
> Parameters
>
> ----------
>
> func : function
>
> Function to use for transforming the data. It must work when pandas Series
>
> is passed.
>
> axis : int, default 0 or 'index'
>
> Can only be set to 0 at the moment.
>
> *args
>
> Positional arguments to pass to func.
>
> **kwargs
>
> Keyword arguments to pass to func.
>
>
>
> Returns
>
> -------
>
> DataFrame
>
> A DataFrame that must have the same length as self.
>
>
>
> Raises
>
> ------
>
> Exception : If the returned DataFrame has a different length than self.
>
>
>
> See Also
>
> --------
>
> DataFrame.aggregate : Only perform aggregating type operations.
>
> DataFrame.apply : Invoke function on DataFrame.
>
> Series.transform : The equivalent function for Series.
>
>
>
> Examples
>
> --------
>
> >>> df = ps.DataFrame({'A': range(3), 'B': range(1, 4)}, columns=['A', 'B'])
>
> >>> df
>
> A B
>
> 0 0 1
>
> 1 1 2
>
> 2 2 3
>
>
>
> >>> def square(x) -> ps.Series[np.int32]:
>
> ... return x ** 2
>
> >>> df.transform(square)
>
> A B
>
> 0 0 1
>
> 1 1 4
>
> 2 4 9
>
>
>
> You can omit the type hint and let pandas-on-Spark infer its type.
>
>
>
> >>> df.transform(lambda x: x ** 2)
>
> A B
>
> 0 0 1
>
> 1 1 4
>
> 2 4 9
>
>
>
> For multi-index columns:
>
>
>
> >>> df.columns = [('X', 'A'), ('X', 'B')]
>
> >>> df.transform(square) # doctest: +NORMALIZE_WHITESPACE
>
> X
>
> A B
>
> 0 0 1
>
> 1 1 4
>
> 2 4 9
>
>
>
> >>> (df * -1).transform(abs) # doctest: +NORMALIZE_WHITESPACE
>
> X
>
> A B
>
> 0 0 1
>
> 1 1 2
>
> 2 2 3
>
>
>
> You can also specify extra arguments.
>
>
>
> >>> def calculation(x, y, z) -> ps.Series[int]:
>
> ... return x ** y + z
>
> >>> df.transform(calculation, y=10, z=20) # doctest: +NORMALIZE_WHITESPACE
>
> X
>
> A B
>
> 0 20 21
>
> 1 21 1044
>
> 2 1044 59069
>
> File: /opt/spark/python/pyspark/pandas/frame.py
>
> Type: method
>
>
>
>
>
>
>
>
>
> tir. 15. mar. 2022 kl. 19:33 skrev Andrew Davidson <ae...@ucsc.edu>:
>
> Hi Bjorn
>
>
>
> I have been looking for spark transform for a while. Can you send me a
> link to the pyspark function?
>
>
>
> I assume pandas transform is not really an option. I think it will try to
> pull the entire dataframe into the drivers memory.
>
>
>
> Kind regards
>
>
>
> Andy
>
>
>
> p.s. My real problem is that spark does not allow you to bind columns. You
> can use union() to bind rows. I could get the equivalent of cbind() using
> union().transform()
>
>
>
> *From: *Bjørn Jørgensen <bj...@gmail.com>
> *Date: *Tuesday, March 15, 2022 at 10:37 AM
> *To: *Mich Talebzadeh <mi...@gmail.com>
> *Cc: *"user @spark" <us...@spark.apache.org>
> *Subject: *Re: pivoting panda dataframe
>
>
>
>
> https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transpose.html we
> have that transpose in pandas api for spark to.
>
>
>
> You also have stack() and multilevel
> https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html
>
>
>
>
>
>
>
> tir. 15. mar. 2022 kl. 17:50 skrev Mich Talebzadeh <
> mich.talebzadeh@gmail.com>:
>
>
> hi,
>
>
>
> Is it possible to pivot a panda dataframe by making the row column
> heading?
>
>
>
> thanks
>
>
>
>
>
> *Error! Filename not specified.* view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
> --
>
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>
>
>
>
> --
>
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>
Re: pivoting panda dataframe
Posted by Andrew Davidson <ae...@ucsc.edu.INVALID>.
Many many thanks!
I have been looking for a pyspark data frame column_bind() solution for several months. Hopefully pyspark.pandas works. The only other solutions I was aware of was to use spark.dataframe.join(). This does not scale for obvious reason.
Andy
From: Bjørn Jørgensen <bj...@gmail.com>
Date: Tuesday, March 15, 2022 at 2:19 PM
To: Andrew Davidson <ae...@ucsc.edu>
Cc: Mich Talebzadeh <mi...@gmail.com>, "user @spark" <us...@spark.apache.org>
Subject: Re: pivoting panda dataframe
Hi Andrew. Mitch asked, and I answered transpose() https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.transpose.html .
And now you are asking in the same thread about pandas API on spark and the transform().
Apache Spark have pandas API on Spark.
Which means that spark has an API call for pandas functions, and when you use pandas API on spark it is spark you are using then.
Add this line in yours import
from pyspark import pandas as ps
Now you can pass yours dataframe back and forward to pandas API on spark by using
pf01 = f01.to_pandas_on_spark()
f01 = pf01.to_spark()
Note that I have changed pd to ps here.
df = ps.DataFrame({'A': range(3), 'B': range(1, 4)})
df.transform(lambda x: x + 1)
You will now see that all numbers are +1
You can find more information about pandas API on spark transform https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.transform.html?highlight=pyspark%20pandas%20dataframe%20transform#pyspark.pandas.DataFrame.transform
or in yours notbook
df.transform?
Signature:
df.transform(
func: Callable[..., ForwardRef('Series')],
axis: Union[int, str] = 0,
*args: Any,
**kwargs: Any,
) -> 'DataFrame'
Docstring:
Call ``func`` on self producing a Series with transformed values
and that has the same length as its input.
See also `Transform and apply a function
<https://koalas.readthedocs.io/en/latest/user_guide/transform_apply.html>`_<https://koalas.readthedocs.io/en/latest/user_guide/transform_apply.html%3E%60_>.
.. note:: this API executes the function once to infer the type which is
potentially expensive, for instance, when the dataset is created after
aggregations or sorting.
To avoid this, specify return type in ``func``, for instance, as below:
>>> def square(x) -> ps.Series[np.int32]:
... return x ** 2
pandas-on-Spark uses return type hint and does not try to infer the type.
.. note:: the series within ``func`` is actually multiple pandas series as the
segments of the whole pandas-on-Spark series; therefore, the length of each series
is not guaranteed. As an example, an aggregation against each series
does work as a global aggregation but an aggregation of each segment. See
below:
>>> def func(x) -> ps.Series[np.int32]:
... return x + sum(x)
Parameters
----------
func : function
Function to use for transforming the data. It must work when pandas Series
is passed.
axis : int, default 0 or 'index'
Can only be set to 0 at the moment.
*args
Positional arguments to pass to func.
**kwargs
Keyword arguments to pass to func.
Returns
-------
DataFrame
A DataFrame that must have the same length as self.
Raises
------
Exception : If the returned DataFrame has a different length than self.
See Also
--------
DataFrame.aggregate : Only perform aggregating type operations.
DataFrame.apply : Invoke function on DataFrame.
Series.transform : The equivalent function for Series.
Examples
--------
>>> df = ps.DataFrame({'A': range(3), 'B': range(1, 4)}, columns=['A', 'B'])
>>> df
A B
0 0 1
1 1 2
2 2 3
>>> def square(x) -> ps.Series[np.int32]:
... return x ** 2
>>> df.transform(square)
A B
0 0 1
1 1 4
2 4 9
You can omit the type hint and let pandas-on-Spark infer its type.
>>> df.transform(lambda x: x ** 2)
A B
0 0 1
1 1 4
2 4 9
For multi-index columns:
>>> df.columns = [('X', 'A'), ('X', 'B')]
>>> df.transform(square) # doctest: +NORMALIZE_WHITESPACE
X
A B
0 0 1
1 1 4
2 4 9
>>> (df * -1).transform(abs) # doctest: +NORMALIZE_WHITESPACE
X
A B
0 0 1
1 1 2
2 2 3
You can also specify extra arguments.
>>> def calculation(x, y, z) -> ps.Series[int]:
... return x ** y + z
>>> df.transform(calculation, y=10, z=20) # doctest: +NORMALIZE_WHITESPACE
X
A B
0 20 21
1 21 1044
2 1044 59069
File: /opt/spark/python/pyspark/pandas/frame.py
Type: method
tir. 15. mar. 2022 kl. 19:33 skrev Andrew Davidson <ae...@ucsc.edu>>:
Hi Bjorn
I have been looking for spark transform for a while. Can you send me a link to the pyspark function?
I assume pandas transform is not really an option. I think it will try to pull the entire dataframe into the drivers memory.
Kind regards
Andy
p.s. My real problem is that spark does not allow you to bind columns. You can use union() to bind rows. I could get the equivalent of cbind() using union().transform()
From: Bjørn Jørgensen <bj...@gmail.com>>
Date: Tuesday, March 15, 2022 at 10:37 AM
To: Mich Talebzadeh <mi...@gmail.com>>
Cc: "user @spark" <us...@spark.apache.org>>
Subject: Re: pivoting panda dataframe
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transpose.html we have that transpose in pandas api for spark to.
You also have stack() and multilevel https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html
tir. 15. mar. 2022 kl. 17:50 skrev Mich Talebzadeh <mi...@gmail.com>>:
hi,
Is it possible to pivot a panda dataframe by making the row column heading?
thanks
Error! Filename not specified. view my Linkedin profile<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
https://en.everybodywiki.com/Mich_Talebzadeh
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
--
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge
+47 480 94 297
--
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge
+47 480 94 297
Re: pivoting panda dataframe
Posted by ayan guha <gu...@gmail.com>.
Column bind is called join in relational world, spark uses the same.
Pivot in true sense is harder to achieve because you really dont know how
many columns you will end up with, but spark has a pivot function
On Thu, 17 Mar 2022 at 9:16 am, Mich Talebzadeh <mi...@gmail.com>
wrote:
> OK this is the version that works with Panda only without Spark
>
> import random
> import string
> import math
> import datetime
> import time
> import pandas as pd
>
> class UsedFunctions:
>
> def randomString(self,length):
> letters = string.ascii_letters
> result_str = ''.join(random.choice(letters) for i in range(length))
> return result_str
>
> def clustered(self,x,numRows):
> return math.floor(x -1)/numRows
>
> def scattered(self,x,numRows):
> return abs((x -1 % numRows))* 1.0
>
> def randomised(self,seed,numRows):
> random.seed(seed)
> return abs(random.randint(0, numRows) % numRows) * 1.0
>
> def padString(self,x,chars,length):
> n = int(math.log10(x) + 1)
> result_str = ''.join(random.choice(chars) for i in range(length-n)) + str(x)
> return result_str
>
> def padSingleChar(self,chars,length):
> result_str = ''.join(chars for i in range(length))
> return result_str
>
> def println(self,lst):
> for ll in lst:
> print(ll[0])
>
> def createSomeChars(self):
> string.ascii_letters = 'ABCDEFGHIJ'
> return random.choice(string.ascii_letters)
>
> usedFunctions = UsedFunctions()
>
> def main():
> appName = "RandomDataGenerator"
> start_time = time.time()
> randomdata = RandomData()
> dfRandom = randomdata.generateRamdomData()
>
>
> class RandomData:
> def generateRamdomData(self):
> uf = UsedFunctions()
> numRows = 10
> start = 1
> end = start + numRows - 1
> print("starting at ID = ", start, ",ending on = ", end)
> Range = range(start, end)
> df = pd.DataFrame(map(lambda x: (x, usedFunctions.clustered(x, numRows), \
> usedFunctions.scattered(x, numRows), \
> usedFunctions.randomised(x, numRows), \
> usedFunctions.randomString(10), \
> usedFunctions.padString(x, " ", 20), \
> usedFunctions.padSingleChar("z", 20), \
> usedFunctions.createSomeChars()), Range))
> pd.set_option("display.max_rows", None, "display.max_columns", None)
> for col_name in df.columns:
> print(col_name)
> print(df.groupby(7).groups)
> ##print(df)
>
> if __name__ == "__main__":
> main()
>
> and comes back with this
>
>
> starting at ID = 1 ,ending on = 10
>
> 0
>
> 1
>
> 2
>
> 3
>
> 4
>
> 5
>
> 6
>
> 7
>
> {'B': [5, 7], 'D': [4], 'F': [1], 'G': [0, 3, 6, 8], 'J': [2]}
>
>
>
> view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
> https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 15 Mar 2022 at 22:19, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
>> Thanks, I don't want to use Spark, otherwise I can do this.
>>
>> p_dfm = df.toPandas() # converting spark DF to Pandas DF
>>
>>
>> Can I do it without using Spark?
>>
>>
>> view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>> https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 15 Mar 2022 at 22:08, Bjørn Jørgensen <bj...@gmail.com>
>> wrote:
>>
>>> You have a pyspark dataframe and you want to convert it to pandas?
>>>
>>> Convert it first to pandas api on spark
>>>
>>>
>>> pf01 = f01.to_pandas_on_spark()
>>>
>>>
>>> Then convert it to pandas
>>>
>>>
>>> pf01 = f01.to_pandas()
>>>
>>> Or?
>>>
>>> tir. 15. mar. 2022, 22:56 skrev Mich Talebzadeh <
>>> mich.talebzadeh@gmail.com>:
>>>
>>>> Thanks everyone.
>>>>
>>>> I want to do the following in pandas and numpy without using spark.
>>>>
>>>> This is what I do in spark to generate some random data using class
>>>> UsedFunctions (not important).
>>>>
>>>> class UsedFunctions:
>>>> def randomString(self,length):
>>>> letters = string.ascii_letters
>>>> result_str = ''.join(random.choice(letters) for i in range(length))
>>>> return result_str
>>>> def clustered(self,x,numRows):
>>>> return math.floor(x -1)/numRows
>>>> def scattered(self,x,numRows):
>>>> return abs((x -1 % numRows))* 1.0
>>>> def randomised(self,seed,numRows):
>>>> random.seed(seed)
>>>> return abs(random.randint(0, numRows) % numRows) * 1.0
>>>> def padString(self,x,chars,length):
>>>> n = int(math.log10(x) + 1)
>>>> result_str = ''.join(random.choice(chars) for i in range(length-n))
>>>> + str(x)
>>>> return result_str
>>>> def padSingleChar(self,chars,length):
>>>> result_str = ''.join(chars for i in range(length))
>>>> return result_str
>>>> def println(self,lst):
>>>> for ll in lst:
>>>> print(ll[0])
>>>>
>>>>
>>>> usedFunctions = UsedFunctions()
>>>>
>>>> start = 1
>>>> end = start + 9
>>>> print ("starting at ID = ",start, ",ending on = ",end)
>>>> Range = range(start, end)
>>>> rdd = sc.parallelize(Range). \
>>>> map(lambda x: (x, usedFunctions.clustered(x,numRows), \
>>>> usedFunctions.scattered(x,numRows), \
>>>> usedFunctions.randomised(x,numRows), \
>>>> usedFunctions.randomString(50), \
>>>> usedFunctions.padString(x," ",50), \
>>>> usedFunctions.padSingleChar("x",4000)))
>>>> df = rdd.toDF()
>>>>
>>>> OK how can I create a panda DataFrame df without using Spark?
>>>>
>>>> Thanks
>>>>
>>>>
>>>> view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>> https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, 15 Mar 2022 at 21:19, Bjørn Jørgensen <bj...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Andrew. Mitch asked, and I answered transpose()
>>>>> https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.transpose.html
>>>>> .
>>>>>
>>>>> And now you are asking in the same thread about pandas API on spark
>>>>> and the transform().
>>>>>
>>>>> Apache Spark have pandas API on Spark.
>>>>>
>>>>> Which means that spark has an API call for pandas functions, and when
>>>>> you use pandas API on spark it is spark you are using then.
>>>>>
>>>>> Add this line in yours import
>>>>>
>>>>> from pyspark import pandas as ps
>>>>>
>>>>>
>>>>> Now you can pass yours dataframe back and forward to pandas API on
>>>>> spark by using
>>>>>
>>>>> pf01 = f01.to_pandas_on_spark()
>>>>>
>>>>>
>>>>> f01 = pf01.to_spark()
>>>>>
>>>>>
>>>>> Note that I have changed pd to ps here.
>>>>>
>>>>> df = ps.DataFrame({'A': range(3), 'B': range(1, 4)})
>>>>>
>>>>> df.transform(lambda x: x + 1)
>>>>>
>>>>> You will now see that all numbers are +1
>>>>>
>>>>> You can find more information about pandas API on spark transform
>>>>> https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.transform.html?highlight=pyspark%20pandas%20dataframe%20transform#pyspark.pandas.DataFrame.transform
>>>>> or in yours notbook
>>>>> df.transform?
>>>>>
>>>>> Signature:
>>>>> df.transform(
>>>>> func: Callable[..., ForwardRef('Series')],
>>>>> axis: Union[int, str] = 0,
>>>>> *args: Any,
>>>>> **kwargs: Any,) -> 'DataFrame'Docstring:
>>>>> Call ``func`` on self producing a Series with transformed values
>>>>> and that has the same length as its input.
>>>>>
>>>>> See also `Transform and apply a function
>>>>> <https://koalas.readthedocs.io/en/latest/user_guide/transform_apply.html>`_.
>>>>>
>>>>> .. note:: this API executes the function once to infer the type which is
>>>>> potentially expensive, for instance, when the dataset is created after
>>>>> aggregations or sorting.
>>>>>
>>>>> To avoid this, specify return type in ``func``, for instance, as below:
>>>>>
>>>>> >>> def square(x) -> ps.Series[np.int32]:
>>>>> ... return x ** 2
>>>>>
>>>>> pandas-on-Spark uses return type hint and does not try to infer the type.
>>>>>
>>>>> .. note:: the series within ``func`` is actually multiple pandas series as the
>>>>> segments of the whole pandas-on-Spark series; therefore, the length of each series
>>>>> is not guaranteed. As an example, an aggregation against each series
>>>>> does work as a global aggregation but an aggregation of each segment. See
>>>>> below:
>>>>>
>>>>> >>> def func(x) -> ps.Series[np.int32]:
>>>>> ... return x + sum(x)
>>>>>
>>>>> Parameters
>>>>> ----------
>>>>> func : function
>>>>> Function to use for transforming the data. It must work when pandas Series
>>>>> is passed.
>>>>> axis : int, default 0 or 'index'
>>>>> Can only be set to 0 at the moment.
>>>>> *args
>>>>> Positional arguments to pass to func.
>>>>> **kwargs
>>>>> Keyword arguments to pass to func.
>>>>>
>>>>> Returns
>>>>> -------
>>>>> DataFrame
>>>>> A DataFrame that must have the same length as self.
>>>>>
>>>>> Raises
>>>>> ------
>>>>> Exception : If the returned DataFrame has a different length than self.
>>>>>
>>>>> See Also
>>>>> --------
>>>>> DataFrame.aggregate : Only perform aggregating type operations.
>>>>> DataFrame.apply : Invoke function on DataFrame.
>>>>> Series.transform : The equivalent function for Series.
>>>>>
>>>>> Examples
>>>>> --------
>>>>> >>> df = ps.DataFrame({'A': range(3), 'B': range(1, 4)}, columns=['A', 'B'])
>>>>> >>> df
>>>>> A B
>>>>> 0 0 1
>>>>> 1 1 2
>>>>> 2 2 3
>>>>>
>>>>> >>> def square(x) -> ps.Series[np.int32]:
>>>>> ... return x ** 2
>>>>> >>> df.transform(square)
>>>>> A B
>>>>> 0 0 1
>>>>> 1 1 4
>>>>> 2 4 9
>>>>>
>>>>> You can omit the type hint and let pandas-on-Spark infer its type.
>>>>>
>>>>> >>> df.transform(lambda x: x ** 2)
>>>>> A B
>>>>> 0 0 1
>>>>> 1 1 4
>>>>> 2 4 9
>>>>>
>>>>> For multi-index columns:
>>>>>
>>>>> >>> df.columns = [('X', 'A'), ('X', 'B')]
>>>>> >>> df.transform(square) # doctest: +NORMALIZE_WHITESPACE
>>>>> X
>>>>> A B
>>>>> 0 0 1
>>>>> 1 1 4
>>>>> 2 4 9
>>>>>
>>>>> >>> (df * -1).transform(abs) # doctest: +NORMALIZE_WHITESPACE
>>>>> X
>>>>> A B
>>>>> 0 0 1
>>>>> 1 1 2
>>>>> 2 2 3
>>>>>
>>>>> You can also specify extra arguments.
>>>>>
>>>>> >>> def calculation(x, y, z) -> ps.Series[int]:
>>>>> ... return x ** y + z
>>>>> >>> df.transform(calculation, y=10, z=20) # doctest: +NORMALIZE_WHITESPACE
>>>>> X
>>>>> A B
>>>>> 0 20 21
>>>>> 1 21 1044
>>>>> 2 1044 59069File: /opt/spark/python/pyspark/pandas/frame.pyType: method
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> tir. 15. mar. 2022 kl. 19:33 skrev Andrew Davidson <aedavids@ucsc.edu
>>>>> >:
>>>>>
>>>>>> Hi Bjorn
>>>>>>
>>>>>>
>>>>>>
>>>>>> I have been looking for spark transform for a while. Can you send me
>>>>>> a link to the pyspark function?
>>>>>>
>>>>>>
>>>>>>
>>>>>> I assume pandas transform is not really an option. I think it will
>>>>>> try to pull the entire dataframe into the drivers memory.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Kind regards
>>>>>>
>>>>>>
>>>>>>
>>>>>> Andy
>>>>>>
>>>>>>
>>>>>>
>>>>>> p.s. My real problem is that spark does not allow you to bind
>>>>>> columns. You can use union() to bind rows. I could get the equivalent of
>>>>>> cbind() using union().transform()
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From: *Bjørn Jørgensen <bj...@gmail.com>
>>>>>> *Date: *Tuesday, March 15, 2022 at 10:37 AM
>>>>>> *To: *Mich Talebzadeh <mi...@gmail.com>
>>>>>> *Cc: *"user @spark" <us...@spark.apache.org>
>>>>>> *Subject: *Re: pivoting panda dataframe
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transpose.html we
>>>>>> have that transpose in pandas api for spark to.
>>>>>>
>>>>>>
>>>>>>
>>>>>> You also have stack() and multilevel
>>>>>> https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> tir. 15. mar. 2022 kl. 17:50 skrev Mich Talebzadeh <
>>>>>> mich.talebzadeh@gmail.com>:
>>>>>>
>>>>>>
>>>>>> hi,
>>>>>>
>>>>>>
>>>>>>
>>>>>> Is it possible to pivot a panda dataframe by making the row column
>>>>>> heading?
>>>>>>
>>>>>>
>>>>>>
>>>>>> thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> [image: Image removed by sender.] view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>
>>>>>> https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Bjørn Jørgensen
>>>>>> Vestre Aspehaug 4
>>>>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4?entry=gmail&source=g>,
>>>>>> 6010 Ålesund
>>>>>> Norge
>>>>>>
>>>>>> +47 480 94 297
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Bjørn Jørgensen
>>>>> Vestre Aspehaug 4
>>>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4?entry=gmail&source=g>,
>>>>> 6010 Ålesund
>>>>> Norge
>>>>>
>>>>> +47 480 94 297
>>>>>
>>>> --
Best Regards,
Ayan Guha
Re: pivoting panda dataframe
Posted by Mich Talebzadeh <mi...@gmail.com>.
OK this is the version that works with Panda only without Spark
import random
import string
import math
import datetime
import time
import pandas as pd
class UsedFunctions:
def randomString(self,length):
letters = string.ascii_letters
result_str = ''.join(random.choice(letters) for i in range(length))
return result_str
def clustered(self,x,numRows):
return math.floor(x -1)/numRows
def scattered(self,x,numRows):
return abs((x -1 % numRows))* 1.0
def randomised(self,seed,numRows):
random.seed(seed)
return abs(random.randint(0, numRows) % numRows) * 1.0
def padString(self,x,chars,length):
n = int(math.log10(x) + 1)
result_str = ''.join(random.choice(chars) for i in range(length-n)) + str(x)
return result_str
def padSingleChar(self,chars,length):
result_str = ''.join(chars for i in range(length))
return result_str
def println(self,lst):
for ll in lst:
print(ll[0])
def createSomeChars(self):
string.ascii_letters = 'ABCDEFGHIJ'
return random.choice(string.ascii_letters)
usedFunctions = UsedFunctions()
def main():
appName = "RandomDataGenerator"
start_time = time.time()
randomdata = RandomData()
dfRandom = randomdata.generateRamdomData()
class RandomData:
def generateRamdomData(self):
uf = UsedFunctions()
numRows = 10
start = 1
end = start + numRows - 1
print("starting at ID = ", start, ",ending on = ", end)
Range = range(start, end)
df = pd.DataFrame(map(lambda x: (x, usedFunctions.clustered(x, numRows), \
usedFunctions.scattered(x, numRows), \
usedFunctions.randomised(x, numRows), \
usedFunctions.randomString(10), \
usedFunctions.padString(x, " ", 20), \
usedFunctions.padSingleChar("z", 20), \
usedFunctions.createSomeChars()), Range))
pd.set_option("display.max_rows", None, "display.max_columns", None)
for col_name in df.columns:
print(col_name)
print(df.groupby(7).groups)
##print(df)
if __name__ == "__main__":
main()
and comes back with this
starting at ID = 1 ,ending on = 10
0
1
2
3
4
5
6
7
{'B': [5, 7], 'D': [4], 'F': [1], 'G': [0, 3, 6, 8], 'J': [2]}
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
https://en.everybodywiki.com/Mich_Talebzadeh
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.
On Tue, 15 Mar 2022 at 22:19, Mich Talebzadeh <mi...@gmail.com>
wrote:
> Thanks, I don't want to use Spark, otherwise I can do this.
>
> p_dfm = df.toPandas() # converting spark DF to Pandas DF
>
>
> Can I do it without using Spark?
>
>
> view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
> https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 15 Mar 2022 at 22:08, Bjørn Jørgensen <bj...@gmail.com>
> wrote:
>
>> You have a pyspark dataframe and you want to convert it to pandas?
>>
>> Convert it first to pandas api on spark
>>
>>
>> pf01 = f01.to_pandas_on_spark()
>>
>>
>> Then convert it to pandas
>>
>>
>> pf01 = f01.to_pandas()
>>
>> Or?
>>
>> tir. 15. mar. 2022, 22:56 skrev Mich Talebzadeh <
>> mich.talebzadeh@gmail.com>:
>>
>>> Thanks everyone.
>>>
>>> I want to do the following in pandas and numpy without using spark.
>>>
>>> This is what I do in spark to generate some random data using class
>>> UsedFunctions (not important).
>>>
>>> class UsedFunctions:
>>> def randomString(self,length):
>>> letters = string.ascii_letters
>>> result_str = ''.join(random.choice(letters) for i in range(length))
>>> return result_str
>>> def clustered(self,x,numRows):
>>> return math.floor(x -1)/numRows
>>> def scattered(self,x,numRows):
>>> return abs((x -1 % numRows))* 1.0
>>> def randomised(self,seed,numRows):
>>> random.seed(seed)
>>> return abs(random.randint(0, numRows) % numRows) * 1.0
>>> def padString(self,x,chars,length):
>>> n = int(math.log10(x) + 1)
>>> result_str = ''.join(random.choice(chars) for i in range(length-n))
>>> + str(x)
>>> return result_str
>>> def padSingleChar(self,chars,length):
>>> result_str = ''.join(chars for i in range(length))
>>> return result_str
>>> def println(self,lst):
>>> for ll in lst:
>>> print(ll[0])
>>>
>>>
>>> usedFunctions = UsedFunctions()
>>>
>>> start = 1
>>> end = start + 9
>>> print ("starting at ID = ",start, ",ending on = ",end)
>>> Range = range(start, end)
>>> rdd = sc.parallelize(Range). \
>>> map(lambda x: (x, usedFunctions.clustered(x,numRows), \
>>> usedFunctions.scattered(x,numRows), \
>>> usedFunctions.randomised(x,numRows), \
>>> usedFunctions.randomString(50), \
>>> usedFunctions.padString(x," ",50), \
>>> usedFunctions.padSingleChar("x",4000)))
>>> df = rdd.toDF()
>>>
>>> OK how can I create a panda DataFrame df without using Spark?
>>>
>>> Thanks
>>>
>>>
>>> view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>> https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 15 Mar 2022 at 21:19, Bjørn Jørgensen <bj...@gmail.com>
>>> wrote:
>>>
>>>> Hi Andrew. Mitch asked, and I answered transpose()
>>>> https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.transpose.html
>>>> .
>>>>
>>>> And now you are asking in the same thread about pandas API on spark and
>>>> the transform().
>>>>
>>>> Apache Spark have pandas API on Spark.
>>>>
>>>> Which means that spark has an API call for pandas functions, and when
>>>> you use pandas API on spark it is spark you are using then.
>>>>
>>>> Add this line in yours import
>>>>
>>>> from pyspark import pandas as ps
>>>>
>>>>
>>>> Now you can pass yours dataframe back and forward to pandas API on
>>>> spark by using
>>>>
>>>> pf01 = f01.to_pandas_on_spark()
>>>>
>>>>
>>>> f01 = pf01.to_spark()
>>>>
>>>>
>>>> Note that I have changed pd to ps here.
>>>>
>>>> df = ps.DataFrame({'A': range(3), 'B': range(1, 4)})
>>>>
>>>> df.transform(lambda x: x + 1)
>>>>
>>>> You will now see that all numbers are +1
>>>>
>>>> You can find more information about pandas API on spark transform
>>>> https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.transform.html?highlight=pyspark%20pandas%20dataframe%20transform#pyspark.pandas.DataFrame.transform
>>>> or in yours notbook
>>>> df.transform?
>>>>
>>>> Signature:
>>>> df.transform(
>>>> func: Callable[..., ForwardRef('Series')],
>>>> axis: Union[int, str] = 0,
>>>> *args: Any,
>>>> **kwargs: Any,) -> 'DataFrame'Docstring:
>>>> Call ``func`` on self producing a Series with transformed values
>>>> and that has the same length as its input.
>>>>
>>>> See also `Transform and apply a function
>>>> <https://koalas.readthedocs.io/en/latest/user_guide/transform_apply.html>`_.
>>>>
>>>> .. note:: this API executes the function once to infer the type which is
>>>> potentially expensive, for instance, when the dataset is created after
>>>> aggregations or sorting.
>>>>
>>>> To avoid this, specify return type in ``func``, for instance, as below:
>>>>
>>>> >>> def square(x) -> ps.Series[np.int32]:
>>>> ... return x ** 2
>>>>
>>>> pandas-on-Spark uses return type hint and does not try to infer the type.
>>>>
>>>> .. note:: the series within ``func`` is actually multiple pandas series as the
>>>> segments of the whole pandas-on-Spark series; therefore, the length of each series
>>>> is not guaranteed. As an example, an aggregation against each series
>>>> does work as a global aggregation but an aggregation of each segment. See
>>>> below:
>>>>
>>>> >>> def func(x) -> ps.Series[np.int32]:
>>>> ... return x + sum(x)
>>>>
>>>> Parameters
>>>> ----------
>>>> func : function
>>>> Function to use for transforming the data. It must work when pandas Series
>>>> is passed.
>>>> axis : int, default 0 or 'index'
>>>> Can only be set to 0 at the moment.
>>>> *args
>>>> Positional arguments to pass to func.
>>>> **kwargs
>>>> Keyword arguments to pass to func.
>>>>
>>>> Returns
>>>> -------
>>>> DataFrame
>>>> A DataFrame that must have the same length as self.
>>>>
>>>> Raises
>>>> ------
>>>> Exception : If the returned DataFrame has a different length than self.
>>>>
>>>> See Also
>>>> --------
>>>> DataFrame.aggregate : Only perform aggregating type operations.
>>>> DataFrame.apply : Invoke function on DataFrame.
>>>> Series.transform : The equivalent function for Series.
>>>>
>>>> Examples
>>>> --------
>>>> >>> df = ps.DataFrame({'A': range(3), 'B': range(1, 4)}, columns=['A', 'B'])
>>>> >>> df
>>>> A B
>>>> 0 0 1
>>>> 1 1 2
>>>> 2 2 3
>>>>
>>>> >>> def square(x) -> ps.Series[np.int32]:
>>>> ... return x ** 2
>>>> >>> df.transform(square)
>>>> A B
>>>> 0 0 1
>>>> 1 1 4
>>>> 2 4 9
>>>>
>>>> You can omit the type hint and let pandas-on-Spark infer its type.
>>>>
>>>> >>> df.transform(lambda x: x ** 2)
>>>> A B
>>>> 0 0 1
>>>> 1 1 4
>>>> 2 4 9
>>>>
>>>> For multi-index columns:
>>>>
>>>> >>> df.columns = [('X', 'A'), ('X', 'B')]
>>>> >>> df.transform(square) # doctest: +NORMALIZE_WHITESPACE
>>>> X
>>>> A B
>>>> 0 0 1
>>>> 1 1 4
>>>> 2 4 9
>>>>
>>>> >>> (df * -1).transform(abs) # doctest: +NORMALIZE_WHITESPACE
>>>> X
>>>> A B
>>>> 0 0 1
>>>> 1 1 2
>>>> 2 2 3
>>>>
>>>> You can also specify extra arguments.
>>>>
>>>> >>> def calculation(x, y, z) -> ps.Series[int]:
>>>> ... return x ** y + z
>>>> >>> df.transform(calculation, y=10, z=20) # doctest: +NORMALIZE_WHITESPACE
>>>> X
>>>> A B
>>>> 0 20 21
>>>> 1 21 1044
>>>> 2 1044 59069File: /opt/spark/python/pyspark/pandas/frame.pyType: method
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> tir. 15. mar. 2022 kl. 19:33 skrev Andrew Davidson <ae...@ucsc.edu>:
>>>>
>>>>> Hi Bjorn
>>>>>
>>>>>
>>>>>
>>>>> I have been looking for spark transform for a while. Can you send me a
>>>>> link to the pyspark function?
>>>>>
>>>>>
>>>>>
>>>>> I assume pandas transform is not really an option. I think it will try
>>>>> to pull the entire dataframe into the drivers memory.
>>>>>
>>>>>
>>>>>
>>>>> Kind regards
>>>>>
>>>>>
>>>>>
>>>>> Andy
>>>>>
>>>>>
>>>>>
>>>>> p.s. My real problem is that spark does not allow you to bind columns.
>>>>> You can use union() to bind rows. I could get the equivalent of cbind()
>>>>> using union().transform()
>>>>>
>>>>>
>>>>>
>>>>> *From: *Bjørn Jørgensen <bj...@gmail.com>
>>>>> *Date: *Tuesday, March 15, 2022 at 10:37 AM
>>>>> *To: *Mich Talebzadeh <mi...@gmail.com>
>>>>> *Cc: *"user @spark" <us...@spark.apache.org>
>>>>> *Subject: *Re: pivoting panda dataframe
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transpose.html we
>>>>> have that transpose in pandas api for spark to.
>>>>>
>>>>>
>>>>>
>>>>> You also have stack() and multilevel
>>>>> https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> tir. 15. mar. 2022 kl. 17:50 skrev Mich Talebzadeh <
>>>>> mich.talebzadeh@gmail.com>:
>>>>>
>>>>>
>>>>> hi,
>>>>>
>>>>>
>>>>>
>>>>> Is it possible to pivot a panda dataframe by making the row column
>>>>> heading?
>>>>>
>>>>>
>>>>>
>>>>> thanks
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> [image: Image removed by sender.] view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>
>>>>> https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Bjørn Jørgensen
>>>>> Vestre Aspehaug 4, 6010 Ålesund
>>>>> Norge
>>>>>
>>>>> +47 480 94 297
>>>>>
>>>>
>>>>
>>>> --
>>>> Bjørn Jørgensen
>>>> Vestre Aspehaug 4, 6010 Ålesund
>>>> Norge
>>>>
>>>> +47 480 94 297
>>>>
>>>
Re: pivoting panda dataframe
Posted by Mich Talebzadeh <mi...@gmail.com>.
Thanks, I don't want to use Spark, otherwise I can do this.
p_dfm = df.toPandas() # converting spark DF to Pandas DF
Can I do it without using Spark?
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
https://en.everybodywiki.com/Mich_Talebzadeh
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.
On Tue, 15 Mar 2022 at 22:08, Bjørn Jørgensen <bj...@gmail.com>
wrote:
> You have a pyspark dataframe and you want to convert it to pandas?
>
> Convert it first to pandas api on spark
>
>
> pf01 = f01.to_pandas_on_spark()
>
>
> Then convert it to pandas
>
>
> pf01 = f01.to_pandas()
>
> Or?
>
> tir. 15. mar. 2022, 22:56 skrev Mich Talebzadeh <mich.talebzadeh@gmail.com
> >:
>
>> Thanks everyone.
>>
>> I want to do the following in pandas and numpy without using spark.
>>
>> This is what I do in spark to generate some random data using class
>> UsedFunctions (not important).
>>
>> class UsedFunctions:
>> def randomString(self,length):
>> letters = string.ascii_letters
>> result_str = ''.join(random.choice(letters) for i in range(length))
>> return result_str
>> def clustered(self,x,numRows):
>> return math.floor(x -1)/numRows
>> def scattered(self,x,numRows):
>> return abs((x -1 % numRows))* 1.0
>> def randomised(self,seed,numRows):
>> random.seed(seed)
>> return abs(random.randint(0, numRows) % numRows) * 1.0
>> def padString(self,x,chars,length):
>> n = int(math.log10(x) + 1)
>> result_str = ''.join(random.choice(chars) for i in range(length-n)) +
>> str(x)
>> return result_str
>> def padSingleChar(self,chars,length):
>> result_str = ''.join(chars for i in range(length))
>> return result_str
>> def println(self,lst):
>> for ll in lst:
>> print(ll[0])
>>
>>
>> usedFunctions = UsedFunctions()
>>
>> start = 1
>> end = start + 9
>> print ("starting at ID = ",start, ",ending on = ",end)
>> Range = range(start, end)
>> rdd = sc.parallelize(Range). \
>> map(lambda x: (x, usedFunctions.clustered(x,numRows), \
>> usedFunctions.scattered(x,numRows), \
>> usedFunctions.randomised(x,numRows), \
>> usedFunctions.randomString(50), \
>> usedFunctions.padString(x," ",50), \
>> usedFunctions.padSingleChar("x",4000)))
>> df = rdd.toDF()
>>
>> OK how can I create a panda DataFrame df without using Spark?
>>
>> Thanks
>>
>>
>> view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>> https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 15 Mar 2022 at 21:19, Bjørn Jørgensen <bj...@gmail.com>
>> wrote:
>>
>>> Hi Andrew. Mitch asked, and I answered transpose()
>>> https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.transpose.html
>>> .
>>>
>>> And now you are asking in the same thread about pandas API on spark and
>>> the transform().
>>>
>>> Apache Spark have pandas API on Spark.
>>>
>>> Which means that spark has an API call for pandas functions, and when
>>> you use pandas API on spark it is spark you are using then.
>>>
>>> Add this line in yours import
>>>
>>> from pyspark import pandas as ps
>>>
>>>
>>> Now you can pass yours dataframe back and forward to pandas API on spark
>>> by using
>>>
>>> pf01 = f01.to_pandas_on_spark()
>>>
>>>
>>> f01 = pf01.to_spark()
>>>
>>>
>>> Note that I have changed pd to ps here.
>>>
>>> df = ps.DataFrame({'A': range(3), 'B': range(1, 4)})
>>>
>>> df.transform(lambda x: x + 1)
>>>
>>> You will now see that all numbers are +1
>>>
>>> You can find more information about pandas API on spark transform
>>> https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.transform.html?highlight=pyspark%20pandas%20dataframe%20transform#pyspark.pandas.DataFrame.transform
>>> or in yours notbook
>>> df.transform?
>>>
>>> Signature:
>>> df.transform(
>>> func: Callable[..., ForwardRef('Series')],
>>> axis: Union[int, str] = 0,
>>> *args: Any,
>>> **kwargs: Any,) -> 'DataFrame'Docstring:
>>> Call ``func`` on self producing a Series with transformed values
>>> and that has the same length as its input.
>>>
>>> See also `Transform and apply a function
>>> <https://koalas.readthedocs.io/en/latest/user_guide/transform_apply.html>`_.
>>>
>>> .. note:: this API executes the function once to infer the type which is
>>> potentially expensive, for instance, when the dataset is created after
>>> aggregations or sorting.
>>>
>>> To avoid this, specify return type in ``func``, for instance, as below:
>>>
>>> >>> def square(x) -> ps.Series[np.int32]:
>>> ... return x ** 2
>>>
>>> pandas-on-Spark uses return type hint and does not try to infer the type.
>>>
>>> .. note:: the series within ``func`` is actually multiple pandas series as the
>>> segments of the whole pandas-on-Spark series; therefore, the length of each series
>>> is not guaranteed. As an example, an aggregation against each series
>>> does work as a global aggregation but an aggregation of each segment. See
>>> below:
>>>
>>> >>> def func(x) -> ps.Series[np.int32]:
>>> ... return x + sum(x)
>>>
>>> Parameters
>>> ----------
>>> func : function
>>> Function to use for transforming the data. It must work when pandas Series
>>> is passed.
>>> axis : int, default 0 or 'index'
>>> Can only be set to 0 at the moment.
>>> *args
>>> Positional arguments to pass to func.
>>> **kwargs
>>> Keyword arguments to pass to func.
>>>
>>> Returns
>>> -------
>>> DataFrame
>>> A DataFrame that must have the same length as self.
>>>
>>> Raises
>>> ------
>>> Exception : If the returned DataFrame has a different length than self.
>>>
>>> See Also
>>> --------
>>> DataFrame.aggregate : Only perform aggregating type operations.
>>> DataFrame.apply : Invoke function on DataFrame.
>>> Series.transform : The equivalent function for Series.
>>>
>>> Examples
>>> --------
>>> >>> df = ps.DataFrame({'A': range(3), 'B': range(1, 4)}, columns=['A', 'B'])
>>> >>> df
>>> A B
>>> 0 0 1
>>> 1 1 2
>>> 2 2 3
>>>
>>> >>> def square(x) -> ps.Series[np.int32]:
>>> ... return x ** 2
>>> >>> df.transform(square)
>>> A B
>>> 0 0 1
>>> 1 1 4
>>> 2 4 9
>>>
>>> You can omit the type hint and let pandas-on-Spark infer its type.
>>>
>>> >>> df.transform(lambda x: x ** 2)
>>> A B
>>> 0 0 1
>>> 1 1 4
>>> 2 4 9
>>>
>>> For multi-index columns:
>>>
>>> >>> df.columns = [('X', 'A'), ('X', 'B')]
>>> >>> df.transform(square) # doctest: +NORMALIZE_WHITESPACE
>>> X
>>> A B
>>> 0 0 1
>>> 1 1 4
>>> 2 4 9
>>>
>>> >>> (df * -1).transform(abs) # doctest: +NORMALIZE_WHITESPACE
>>> X
>>> A B
>>> 0 0 1
>>> 1 1 2
>>> 2 2 3
>>>
>>> You can also specify extra arguments.
>>>
>>> >>> def calculation(x, y, z) -> ps.Series[int]:
>>> ... return x ** y + z
>>> >>> df.transform(calculation, y=10, z=20) # doctest: +NORMALIZE_WHITESPACE
>>> X
>>> A B
>>> 0 20 21
>>> 1 21 1044
>>> 2 1044 59069File: /opt/spark/python/pyspark/pandas/frame.pyType: method
>>>
>>>
>>>
>>>
>>>
>>> tir. 15. mar. 2022 kl. 19:33 skrev Andrew Davidson <ae...@ucsc.edu>:
>>>
>>>> Hi Bjorn
>>>>
>>>>
>>>>
>>>> I have been looking for spark transform for a while. Can you send me a
>>>> link to the pyspark function?
>>>>
>>>>
>>>>
>>>> I assume pandas transform is not really an option. I think it will try
>>>> to pull the entire dataframe into the drivers memory.
>>>>
>>>>
>>>>
>>>> Kind regards
>>>>
>>>>
>>>>
>>>> Andy
>>>>
>>>>
>>>>
>>>> p.s. My real problem is that spark does not allow you to bind columns.
>>>> You can use union() to bind rows. I could get the equivalent of cbind()
>>>> using union().transform()
>>>>
>>>>
>>>>
>>>> *From: *Bjørn Jørgensen <bj...@gmail.com>
>>>> *Date: *Tuesday, March 15, 2022 at 10:37 AM
>>>> *To: *Mich Talebzadeh <mi...@gmail.com>
>>>> *Cc: *"user @spark" <us...@spark.apache.org>
>>>> *Subject: *Re: pivoting panda dataframe
>>>>
>>>>
>>>>
>>>>
>>>> https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transpose.html we
>>>> have that transpose in pandas api for spark to.
>>>>
>>>>
>>>>
>>>> You also have stack() and multilevel
>>>> https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> tir. 15. mar. 2022 kl. 17:50 skrev Mich Talebzadeh <
>>>> mich.talebzadeh@gmail.com>:
>>>>
>>>>
>>>> hi,
>>>>
>>>>
>>>>
>>>> Is it possible to pivot a panda dataframe by making the row column
>>>> heading?
>>>>
>>>>
>>>>
>>>> thanks
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> [image: Image removed by sender.] view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>
>>>> https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Bjørn Jørgensen
>>>> Vestre Aspehaug 4, 6010 Ålesund
>>>> Norge
>>>>
>>>> +47 480 94 297
>>>>
>>>
>>>
>>> --
>>> Bjørn Jørgensen
>>> Vestre Aspehaug 4, 6010 Ålesund
>>> Norge
>>>
>>> +47 480 94 297
>>>
>>
Re: pivoting panda dataframe
Posted by Bjørn Jørgensen <bj...@gmail.com>.
You have a pyspark dataframe and you want to convert it to pandas?
Convert it first to pandas api on spark
pf01 = f01.to_pandas_on_spark()
Then convert it to pandas
pf01 = f01.to_pandas()
Or?
tir. 15. mar. 2022, 22:56 skrev Mich Talebzadeh <mi...@gmail.com>:
> Thanks everyone.
>
> I want to do the following in pandas and numpy without using spark.
>
> This is what I do in spark to generate some random data using class
> UsedFunctions (not important).
>
> class UsedFunctions:
> def randomString(self,length):
> letters = string.ascii_letters
> result_str = ''.join(random.choice(letters) for i in range(length))
> return result_str
> def clustered(self,x,numRows):
> return math.floor(x -1)/numRows
> def scattered(self,x,numRows):
> return abs((x -1 % numRows))* 1.0
> def randomised(self,seed,numRows):
> random.seed(seed)
> return abs(random.randint(0, numRows) % numRows) * 1.0
> def padString(self,x,chars,length):
> n = int(math.log10(x) + 1)
> result_str = ''.join(random.choice(chars) for i in range(length-n)) +
> str(x)
> return result_str
> def padSingleChar(self,chars,length):
> result_str = ''.join(chars for i in range(length))
> return result_str
> def println(self,lst):
> for ll in lst:
> print(ll[0])
>
>
> usedFunctions = UsedFunctions()
>
> start = 1
> end = start + 9
> print ("starting at ID = ",start, ",ending on = ",end)
> Range = range(start, end)
> rdd = sc.parallelize(Range). \
> map(lambda x: (x, usedFunctions.clustered(x,numRows), \
> usedFunctions.scattered(x,numRows), \
> usedFunctions.randomised(x,numRows), \
> usedFunctions.randomString(50), \
> usedFunctions.padString(x," ",50), \
> usedFunctions.padSingleChar("x",4000)))
> df = rdd.toDF()
>
> OK how can I create a panda DataFrame df without using Spark?
>
> Thanks
>
>
> view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
> https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 15 Mar 2022 at 21:19, Bjørn Jørgensen <bj...@gmail.com>
> wrote:
>
>> Hi Andrew. Mitch asked, and I answered transpose()
>> https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.transpose.html
>> .
>>
>> And now you are asking in the same thread about pandas API on spark and
>> the transform().
>>
>> Apache Spark have pandas API on Spark.
>>
>> Which means that spark has an API call for pandas functions, and when you
>> use pandas API on spark it is spark you are using then.
>>
>> Add this line in yours import
>>
>> from pyspark import pandas as ps
>>
>>
>> Now you can pass yours dataframe back and forward to pandas API on spark
>> by using
>>
>> pf01 = f01.to_pandas_on_spark()
>>
>>
>> f01 = pf01.to_spark()
>>
>>
>> Note that I have changed pd to ps here.
>>
>> df = ps.DataFrame({'A': range(3), 'B': range(1, 4)})
>>
>> df.transform(lambda x: x + 1)
>>
>> You will now see that all numbers are +1
>>
>> You can find more information about pandas API on spark transform
>> https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.transform.html?highlight=pyspark%20pandas%20dataframe%20transform#pyspark.pandas.DataFrame.transform
>> or in yours notbook
>> df.transform?
>>
>> Signature:
>> df.transform(
>> func: Callable[..., ForwardRef('Series')],
>> axis: Union[int, str] = 0,
>> *args: Any,
>> **kwargs: Any,) -> 'DataFrame'Docstring:
>> Call ``func`` on self producing a Series with transformed values
>> and that has the same length as its input.
>>
>> See also `Transform and apply a function
>> <https://koalas.readthedocs.io/en/latest/user_guide/transform_apply.html>`_.
>>
>> .. note:: this API executes the function once to infer the type which is
>> potentially expensive, for instance, when the dataset is created after
>> aggregations or sorting.
>>
>> To avoid this, specify return type in ``func``, for instance, as below:
>>
>> >>> def square(x) -> ps.Series[np.int32]:
>> ... return x ** 2
>>
>> pandas-on-Spark uses return type hint and does not try to infer the type.
>>
>> .. note:: the series within ``func`` is actually multiple pandas series as the
>> segments of the whole pandas-on-Spark series; therefore, the length of each series
>> is not guaranteed. As an example, an aggregation against each series
>> does work as a global aggregation but an aggregation of each segment. See
>> below:
>>
>> >>> def func(x) -> ps.Series[np.int32]:
>> ... return x + sum(x)
>>
>> Parameters
>> ----------
>> func : function
>> Function to use for transforming the data. It must work when pandas Series
>> is passed.
>> axis : int, default 0 or 'index'
>> Can only be set to 0 at the moment.
>> *args
>> Positional arguments to pass to func.
>> **kwargs
>> Keyword arguments to pass to func.
>>
>> Returns
>> -------
>> DataFrame
>> A DataFrame that must have the same length as self.
>>
>> Raises
>> ------
>> Exception : If the returned DataFrame has a different length than self.
>>
>> See Also
>> --------
>> DataFrame.aggregate : Only perform aggregating type operations.
>> DataFrame.apply : Invoke function on DataFrame.
>> Series.transform : The equivalent function for Series.
>>
>> Examples
>> --------
>> >>> df = ps.DataFrame({'A': range(3), 'B': range(1, 4)}, columns=['A', 'B'])
>> >>> df
>> A B
>> 0 0 1
>> 1 1 2
>> 2 2 3
>>
>> >>> def square(x) -> ps.Series[np.int32]:
>> ... return x ** 2
>> >>> df.transform(square)
>> A B
>> 0 0 1
>> 1 1 4
>> 2 4 9
>>
>> You can omit the type hint and let pandas-on-Spark infer its type.
>>
>> >>> df.transform(lambda x: x ** 2)
>> A B
>> 0 0 1
>> 1 1 4
>> 2 4 9
>>
>> For multi-index columns:
>>
>> >>> df.columns = [('X', 'A'), ('X', 'B')]
>> >>> df.transform(square) # doctest: +NORMALIZE_WHITESPACE
>> X
>> A B
>> 0 0 1
>> 1 1 4
>> 2 4 9
>>
>> >>> (df * -1).transform(abs) # doctest: +NORMALIZE_WHITESPACE
>> X
>> A B
>> 0 0 1
>> 1 1 2
>> 2 2 3
>>
>> You can also specify extra arguments.
>>
>> >>> def calculation(x, y, z) -> ps.Series[int]:
>> ... return x ** y + z
>> >>> df.transform(calculation, y=10, z=20) # doctest: +NORMALIZE_WHITESPACE
>> X
>> A B
>> 0 20 21
>> 1 21 1044
>> 2 1044 59069File: /opt/spark/python/pyspark/pandas/frame.pyType: method
>>
>>
>>
>>
>>
>> tir. 15. mar. 2022 kl. 19:33 skrev Andrew Davidson <ae...@ucsc.edu>:
>>
>>> Hi Bjorn
>>>
>>>
>>>
>>> I have been looking for spark transform for a while. Can you send me a
>>> link to the pyspark function?
>>>
>>>
>>>
>>> I assume pandas transform is not really an option. I think it will try
>>> to pull the entire dataframe into the drivers memory.
>>>
>>>
>>>
>>> Kind regards
>>>
>>>
>>>
>>> Andy
>>>
>>>
>>>
>>> p.s. My real problem is that spark does not allow you to bind columns.
>>> You can use union() to bind rows. I could get the equivalent of cbind()
>>> using union().transform()
>>>
>>>
>>>
>>> *From: *Bjørn Jørgensen <bj...@gmail.com>
>>> *Date: *Tuesday, March 15, 2022 at 10:37 AM
>>> *To: *Mich Talebzadeh <mi...@gmail.com>
>>> *Cc: *"user @spark" <us...@spark.apache.org>
>>> *Subject: *Re: pivoting panda dataframe
>>>
>>>
>>>
>>>
>>> https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transpose.html we
>>> have that transpose in pandas api for spark to.
>>>
>>>
>>>
>>> You also have stack() and multilevel
>>> https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> tir. 15. mar. 2022 kl. 17:50 skrev Mich Talebzadeh <
>>> mich.talebzadeh@gmail.com>:
>>>
>>>
>>> hi,
>>>
>>>
>>>
>>> Is it possible to pivot a panda dataframe by making the row column
>>> heading?
>>>
>>>
>>>
>>> thanks
>>>
>>>
>>>
>>>
>>>
>>> [image: Image removed by sender.] view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>
>>> https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Bjørn Jørgensen
>>> Vestre Aspehaug 4, 6010 Ålesund
>>> Norge
>>>
>>> +47 480 94 297
>>>
>>
>>
>> --
>> Bjørn Jørgensen
>> Vestre Aspehaug 4, 6010 Ålesund
>> Norge
>>
>> +47 480 94 297
>>
>
Re: pivoting panda dataframe
Posted by Mich Talebzadeh <mi...@gmail.com>.
Thanks everyone.
I want to do the following in pandas and numpy without using spark.
This is what I do in spark to generate some random data using class
UsedFunctions (not important).
class UsedFunctions:
def randomString(self,length):
letters = string.ascii_letters
result_str = ''.join(random.choice(letters) for i in range(length))
return result_str
def clustered(self,x,numRows):
return math.floor(x -1)/numRows
def scattered(self,x,numRows):
return abs((x -1 % numRows))* 1.0
def randomised(self,seed,numRows):
random.seed(seed)
return abs(random.randint(0, numRows) % numRows) * 1.0
def padString(self,x,chars,length):
n = int(math.log10(x) + 1)
result_str = ''.join(random.choice(chars) for i in range(length-n)) +
str(x)
return result_str
def padSingleChar(self,chars,length):
result_str = ''.join(chars for i in range(length))
return result_str
def println(self,lst):
for ll in lst:
print(ll[0])
usedFunctions = UsedFunctions()
start = 1
end = start + 9
print ("starting at ID = ",start, ",ending on = ",end)
Range = range(start, end)
rdd = sc.parallelize(Range). \
map(lambda x: (x, usedFunctions.clustered(x,numRows), \
usedFunctions.scattered(x,numRows), \
usedFunctions.randomised(x,numRows), \
usedFunctions.randomString(50), \
usedFunctions.padString(x," ",50), \
usedFunctions.padSingleChar("x",4000)))
df = rdd.toDF()
OK how can I create a panda DataFrame df without using Spark?
Thanks
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
https://en.everybodywiki.com/Mich_Talebzadeh
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.
On Tue, 15 Mar 2022 at 21:19, Bjørn Jørgensen <bj...@gmail.com>
wrote:
> Hi Andrew. Mitch asked, and I answered transpose()
> https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.transpose.html
> .
>
> And now you are asking in the same thread about pandas API on spark and
> the transform().
>
> Apache Spark have pandas API on Spark.
>
> Which means that spark has an API call for pandas functions, and when you
> use pandas API on spark it is spark you are using then.
>
> Add this line in yours import
>
> from pyspark import pandas as ps
>
>
> Now you can pass yours dataframe back and forward to pandas API on spark
> by using
>
> pf01 = f01.to_pandas_on_spark()
>
>
> f01 = pf01.to_spark()
>
>
> Note that I have changed pd to ps here.
>
> df = ps.DataFrame({'A': range(3), 'B': range(1, 4)})
>
> df.transform(lambda x: x + 1)
>
> You will now see that all numbers are +1
>
> You can find more information about pandas API on spark transform
> https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.transform.html?highlight=pyspark%20pandas%20dataframe%20transform#pyspark.pandas.DataFrame.transform
> or in yours notbook
> df.transform?
>
> Signature:
> df.transform(
> func: Callable[..., ForwardRef('Series')],
> axis: Union[int, str] = 0,
> *args: Any,
> **kwargs: Any,) -> 'DataFrame'Docstring:
> Call ``func`` on self producing a Series with transformed values
> and that has the same length as its input.
>
> See also `Transform and apply a function
> <https://koalas.readthedocs.io/en/latest/user_guide/transform_apply.html>`_.
>
> .. note:: this API executes the function once to infer the type which is
> potentially expensive, for instance, when the dataset is created after
> aggregations or sorting.
>
> To avoid this, specify return type in ``func``, for instance, as below:
>
> >>> def square(x) -> ps.Series[np.int32]:
> ... return x ** 2
>
> pandas-on-Spark uses return type hint and does not try to infer the type.
>
> .. note:: the series within ``func`` is actually multiple pandas series as the
> segments of the whole pandas-on-Spark series; therefore, the length of each series
> is not guaranteed. As an example, an aggregation against each series
> does work as a global aggregation but an aggregation of each segment. See
> below:
>
> >>> def func(x) -> ps.Series[np.int32]:
> ... return x + sum(x)
>
> Parameters
> ----------
> func : function
> Function to use for transforming the data. It must work when pandas Series
> is passed.
> axis : int, default 0 or 'index'
> Can only be set to 0 at the moment.
> *args
> Positional arguments to pass to func.
> **kwargs
> Keyword arguments to pass to func.
>
> Returns
> -------
> DataFrame
> A DataFrame that must have the same length as self.
>
> Raises
> ------
> Exception : If the returned DataFrame has a different length than self.
>
> See Also
> --------
> DataFrame.aggregate : Only perform aggregating type operations.
> DataFrame.apply : Invoke function on DataFrame.
> Series.transform : The equivalent function for Series.
>
> Examples
> --------
> >>> df = ps.DataFrame({'A': range(3), 'B': range(1, 4)}, columns=['A', 'B'])
> >>> df
> A B
> 0 0 1
> 1 1 2
> 2 2 3
>
> >>> def square(x) -> ps.Series[np.int32]:
> ... return x ** 2
> >>> df.transform(square)
> A B
> 0 0 1
> 1 1 4
> 2 4 9
>
> You can omit the type hint and let pandas-on-Spark infer its type.
>
> >>> df.transform(lambda x: x ** 2)
> A B
> 0 0 1
> 1 1 4
> 2 4 9
>
> For multi-index columns:
>
> >>> df.columns = [('X', 'A'), ('X', 'B')]
> >>> df.transform(square) # doctest: +NORMALIZE_WHITESPACE
> X
> A B
> 0 0 1
> 1 1 4
> 2 4 9
>
> >>> (df * -1).transform(abs) # doctest: +NORMALIZE_WHITESPACE
> X
> A B
> 0 0 1
> 1 1 2
> 2 2 3
>
> You can also specify extra arguments.
>
> >>> def calculation(x, y, z) -> ps.Series[int]:
> ... return x ** y + z
> >>> df.transform(calculation, y=10, z=20) # doctest: +NORMALIZE_WHITESPACE
> X
> A B
> 0 20 21
> 1 21 1044
> 2 1044 59069File: /opt/spark/python/pyspark/pandas/frame.pyType: method
>
>
>
>
>
> tir. 15. mar. 2022 kl. 19:33 skrev Andrew Davidson <ae...@ucsc.edu>:
>
>> Hi Bjorn
>>
>>
>>
>> I have been looking for spark transform for a while. Can you send me a
>> link to the pyspark function?
>>
>>
>>
>> I assume pandas transform is not really an option. I think it will try to
>> pull the entire dataframe into the drivers memory.
>>
>>
>>
>> Kind regards
>>
>>
>>
>> Andy
>>
>>
>>
>> p.s. My real problem is that spark does not allow you to bind columns.
>> You can use union() to bind rows. I could get the equivalent of cbind()
>> using union().transform()
>>
>>
>>
>> *From: *Bjørn Jørgensen <bj...@gmail.com>
>> *Date: *Tuesday, March 15, 2022 at 10:37 AM
>> *To: *Mich Talebzadeh <mi...@gmail.com>
>> *Cc: *"user @spark" <us...@spark.apache.org>
>> *Subject: *Re: pivoting panda dataframe
>>
>>
>>
>>
>> https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transpose.html we
>> have that transpose in pandas api for spark to.
>>
>>
>>
>> You also have stack() and multilevel
>> https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html
>>
>>
>>
>>
>>
>>
>>
>> tir. 15. mar. 2022 kl. 17:50 skrev Mich Talebzadeh <
>> mich.talebzadeh@gmail.com>:
>>
>>
>> hi,
>>
>>
>>
>> Is it possible to pivot a panda dataframe by making the row column
>> heading?
>>
>>
>>
>> thanks
>>
>>
>>
>>
>>
>> [image: Image removed by sender.] view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>> https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>>
>>
>> --
>>
>> Bjørn Jørgensen
>> Vestre Aspehaug 4, 6010 Ålesund
>> Norge
>>
>> +47 480 94 297
>>
>
>
> --
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>
Re: pivoting panda dataframe
Posted by Bjørn Jørgensen <bj...@gmail.com>.
Hi Andrew. Mitch asked, and I answered transpose()
https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.transpose.html
.
And now you are asking in the same thread about pandas API on spark and the
transform().
Apache Spark have pandas API on Spark.
Which means that spark has an API call for pandas functions, and when you
use pandas API on spark it is spark you are using then.
Add this line in yours import
from pyspark import pandas as ps
Now you can pass yours dataframe back and forward to pandas API on spark by
using
pf01 = f01.to_pandas_on_spark()
f01 = pf01.to_spark()
Note that I have changed pd to ps here.
df = ps.DataFrame({'A': range(3), 'B': range(1, 4)})
df.transform(lambda x: x + 1)
You will now see that all numbers are +1
You can find more information about pandas API on spark transform
https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.transform.html?highlight=pyspark%20pandas%20dataframe%20transform#pyspark.pandas.DataFrame.transform
or in yours notbook
df.transform?
Signature:
df.transform(
func: Callable[..., ForwardRef('Series')],
axis: Union[int, str] = 0,
*args: Any,
**kwargs: Any,) -> 'DataFrame'Docstring:
Call ``func`` on self producing a Series with transformed values
and that has the same length as its input.
See also `Transform and apply a function
<https://koalas.readthedocs.io/en/latest/user_guide/transform_apply.html>`_.
.. note:: this API executes the function once to infer the type which is
potentially expensive, for instance, when the dataset is created after
aggregations or sorting.
To avoid this, specify return type in ``func``, for instance, as below:
>>> def square(x) -> ps.Series[np.int32]:
... return x ** 2
pandas-on-Spark uses return type hint and does not try to infer the type.
.. note:: the series within ``func`` is actually multiple pandas series as the
segments of the whole pandas-on-Spark series; therefore, the
length of each series
is not guaranteed. As an example, an aggregation against each series
does work as a global aggregation but an aggregation of each segment. See
below:
>>> def func(x) -> ps.Series[np.int32]:
... return x + sum(x)
Parameters
----------
func : function
Function to use for transforming the data. It must work when pandas Series
is passed.
axis : int, default 0 or 'index'
Can only be set to 0 at the moment.
*args
Positional arguments to pass to func.
**kwargs
Keyword arguments to pass to func.
Returns
-------
DataFrame
A DataFrame that must have the same length as self.
Raises
------
Exception : If the returned DataFrame has a different length than self.
See Also
--------
DataFrame.aggregate : Only perform aggregating type operations.
DataFrame.apply : Invoke function on DataFrame.
Series.transform : The equivalent function for Series.
Examples
--------
>>> df = ps.DataFrame({'A': range(3), 'B': range(1, 4)}, columns=['A', 'B'])
>>> df
A B
0 0 1
1 1 2
2 2 3
>>> def square(x) -> ps.Series[np.int32]:
... return x ** 2
>>> df.transform(square)
A B
0 0 1
1 1 4
2 4 9
You can omit the type hint and let pandas-on-Spark infer its type.
>>> df.transform(lambda x: x ** 2)
A B
0 0 1
1 1 4
2 4 9
For multi-index columns:
>>> df.columns = [('X', 'A'), ('X', 'B')]
>>> df.transform(square) # doctest: +NORMALIZE_WHITESPACE
X
A B
0 0 1
1 1 4
2 4 9
>>> (df * -1).transform(abs) # doctest: +NORMALIZE_WHITESPACE
X
A B
0 0 1
1 1 2
2 2 3
You can also specify extra arguments.
>>> def calculation(x, y, z) -> ps.Series[int]:
... return x ** y + z
>>> df.transform(calculation, y=10, z=20) # doctest: +NORMALIZE_WHITESPACE
X
A B
0 20 21
1 21 1044
2 1044 59069File:
/opt/spark/python/pyspark/pandas/frame.pyType: method
tir. 15. mar. 2022 kl. 19:33 skrev Andrew Davidson <ae...@ucsc.edu>:
> Hi Bjorn
>
>
>
> I have been looking for spark transform for a while. Can you send me a
> link to the pyspark function?
>
>
>
> I assume pandas transform is not really an option. I think it will try to
> pull the entire dataframe into the drivers memory.
>
>
>
> Kind regards
>
>
>
> Andy
>
>
>
> p.s. My real problem is that spark does not allow you to bind columns. You
> can use union() to bind rows. I could get the equivalent of cbind() using
> union().transform()
>
>
>
> *From: *Bjørn Jørgensen <bj...@gmail.com>
> *Date: *Tuesday, March 15, 2022 at 10:37 AM
> *To: *Mich Talebzadeh <mi...@gmail.com>
> *Cc: *"user @spark" <us...@spark.apache.org>
> *Subject: *Re: pivoting panda dataframe
>
>
>
>
> https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transpose.html we
> have that transpose in pandas api for spark to.
>
>
>
> You also have stack() and multilevel
> https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html
>
>
>
>
>
>
>
> tir. 15. mar. 2022 kl. 17:50 skrev Mich Talebzadeh <
> mich.talebzadeh@gmail.com>:
>
>
> hi,
>
>
>
> Is it possible to pivot a panda dataframe by making the row column
> heading?
>
>
>
> thanks
>
>
>
>
>
> [image: Image removed by sender.] view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
> --
>
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>
--
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge
+47 480 94 297
Re: pivoting panda dataframe
Posted by Andrew Davidson <ae...@ucsc.edu.INVALID>.
Hi Bjorn
I have been looking for spark transform for a while. Can you send me a link to the pyspark function?
I assume pandas transform is not really an option. I think it will try to pull the entire dataframe into the drivers memory.
Kind regards
Andy
p.s. My real problem is that spark does not allow you to bind columns. You can use union() to bind rows. I could get the equivalent of cbind() using union().transform()
From: Bjørn Jørgensen <bj...@gmail.com>
Date: Tuesday, March 15, 2022 at 10:37 AM
To: Mich Talebzadeh <mi...@gmail.com>
Cc: "user @spark" <us...@spark.apache.org>
Subject: Re: pivoting panda dataframe
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transpose.html we have that transpose in pandas api for spark to.
You also have stack() and multilevel https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html
tir. 15. mar. 2022 kl. 17:50 skrev Mich Talebzadeh <mi...@gmail.com>>:
hi,
Is it possible to pivot a panda dataframe by making the row column heading?
thanks
[Image removed by sender.] view my Linkedin profile<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
https://en.everybodywiki.com/Mich_Talebzadeh
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
--
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge
+47 480 94 297
Re: pivoting panda dataframe
Posted by Bjørn Jørgensen <bj...@gmail.com>.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transpose.html we
have that transpose in pandas api for spark to.
You also have stack() and multilevel
https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html
tir. 15. mar. 2022 kl. 17:50 skrev Mich Talebzadeh <
mich.talebzadeh@gmail.com>:
>
> hi,
>
>
> Is it possible to pivot a panda dataframe by making the row column
> heading?
>
>
> thanks
>
>
>
> view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
> https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
--
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge
+47 480 94 297