You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Aakash Basu <aa...@gmail.com> on 2018/05/31 10:10:49 UTC

[Help] PySpark Dynamic mean calculation

Hi,

Using -
Python 3.6
Spark 2.3

Original DF -
key a_fold_0 b_fold_0 a_fold_1 b_fold_1 a_fold_2 b_fold_2
1 1 2 3 4 5 6
2 7 5 3 5 2 1


I want to calculate means from the below  dataframe as follows (like this
for all columns and all folds) -

key a_fold_0 b_fold_0 a_fold_1 b_fold_1 a_fold_2 b_fold_2 a_fold_0_mean
b_fold_0_mean a_fold_1_mean
1 1 2 3 4 5 6 3 + 5 / 2 4 + 6 / 2 1 + 5 / 2
2 7 5 3 5 2 1 3 + 2 / 2 5 + 1 / 2 7 + 2 / 2

Process -

For fold_0 my mean should be fold_1 + fold_2 / 2
For fold_1 my mean should be fold_0 + fold_2 / 2
For fold_2 my mean should be fold_0 + fold_1 / 2

For each column.

And my number of columns, no. of folds, everything would be dynamic.

How to go about this problem on a pyspark dataframe?

Thanks,
Aakash.

Fwd: [Help] PySpark Dynamic mean calculation

Posted by Aakash Basu <aa...@gmail.com>.
Solved it myself.

In-case anyone needs to reuse the code. Can be re-used.

orig_list = ['Married-spouse-absent', 'Married-AF-spouse',
'Separated', 'Married-civ-spouse', 'Widowed', 'Divorced',
'Never-married']
k_folds = 3

cols = df.columns  # ['fnlwgt_bucketed',
'Married-spouse-absent_fold_0', 'Married-AF-spouse_fold_0',
'Separated_fold_0', 'Married-civ-spouse_fold_0', 'Widowed_fold_0',
'Divorced_fold_0', 'Never-married_fold_0',
'Married-spouse-absent_fold_1', 'Married-AF-spouse_fold_1',
'Separated_fold_1', 'Married-civ-spouse_fold_1', 'Widowed_fold_1',
'Divorced_fold_1', 'Never-married_fold_1',
'Married-spouse-absent_fold_2', 'Married-AF-spouse_fold_2',
'Separated_fold_2', 'Married-civ-spouse_fold_2', 'Widowed_fold_2',
'Divorced_fold_2', 'Never-married_fold_2']

for folds in range(k_folds):
for column in orig_list:
col_namer = []
for fold in range(k_folds):
if fold != folds:
col_namer.append(column+'_fold_'+str(fold))
df = df.withColumn(column+'_fold_'+str(folds)+'_mean', (sum(df[col] for col in
col_namer)/(k_folds-1)))
print(col_namer)
df.show(1)



---------- Forwarded message ----------
From: Aakash Basu <aa...@gmail.com>
Date: Thu, May 31, 2018 at 3:40 PM
Subject: [Help] PySpark Dynamic mean calculation
To: user <us...@spark.apache.org>


Hi,

Using -
Python 3.6
Spark 2.3

Original DF -
key a_fold_0 b_fold_0 a_fold_1 b_fold_1 a_fold_2 b_fold_2
1 1 2 3 4 5 6
2 7 5 3 5 2 1


I want to calculate means from the below  dataframe as follows (like this
for all columns and all folds) -

key a_fold_0 b_fold_0 a_fold_1 b_fold_1 a_fold_2 b_fold_2 a_fold_0_mean
b_fold_0_mean a_fold_1_mean
1 1 2 3 4 5 6 3 + 5 / 2 4 + 6 / 2 1 + 5 / 2
2 7 5 3 5 2 1 3 + 2 / 2 5 + 1 / 2 7 + 2 / 2

Process -

For fold_0 my mean should be fold_1 + fold_2 / 2
For fold_1 my mean should be fold_0 + fold_2 / 2
For fold_2 my mean should be fold_0 + fold_1 / 2

For each column.

And my number of columns, no. of folds, everything would be dynamic.

How to go about this problem on a pyspark dataframe?

Thanks,
Aakash.