You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2022/01/18 01:59:00 UTC
[jira] [Commented] (SPARK-37926) toPandas presicion doesn't match pandas, and would cause error on some case
[ https://issues.apache.org/jira/browse/SPARK-37926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17477517#comment-17477517 ]
Hyukjin Kwon commented on SPARK-37926:
--------------------------------------
[~mithril], the results might be slightly different between spark and pandas. what are exactly difference between them? also Spark 2.X is EOL. should check with Spark 3+.
> toPandas presicion doesn't match pandas, and would cause error on some case
> ---------------------------------------------------------------------------
>
> Key: SPARK-37926
> URL: https://issues.apache.org/jira/browse/SPARK-37926
> Project: Spark
> Issue Type: Bug
> Components: Input/Output, ML
> Affects Versions: 2.4.7
> Reporter: kasim
> Priority: Major
>
> # The background:
> I have two copies of one dataset on the filesystem and spark hdfs .
> I transformed the two data one by pandas and one by spark SQL with the same logic:
> - df: read from hdfs, transformed by spark SQL, convert spark.DataFrame to pandas.DataFrame
> - df1: read from the filesystem, transformed by pandas,
> Put each to BetaGeoFitter model (https://lifetimes.readthedocs.io/en/latest/) , df1 is fine, but df2 got ConvergenceError.
> # First: the summary is the same between df and df1
> ```
> In [17]: df.describe()
> Out[17]:
> frequency recency T monetary_value
> count 68878.000000 68878.000000 68878.000000 68878.000000
> mean 0.210198 1.364253 69.407097 66.740974
> std 1.094161 7.460129 44.604855 351.516145
> min 0.000000 0.000000 0.000000 0.000000
> 25% 0.000000 0.000000 31.000000 0.000000
> 50% 0.000000 0.000000 64.000000 0.000000
> 75% 0.000000 0.000000 108.000000 0.000000
> max 59.000000 155.000000 157.000000 18975.360000
>
> In [18]: df1.describe()
> Out[18]:
> frequency recency T monetary_value
> count 68878.000000 68878.000000 68878.000000 68878.000000
> mean 0.210198 1.364253 69.407097 66.740974
> std 1.094161 7.460129 44.604856 351.516145
> min 0.000000 0.000000 0.000000 0.000000
> 25% 0.000000 0.000000 31.000000 0.000000
> 50% 0.000000 0.000000 64.000000 0.000000
> 75% 0.000000 0.000000 108.000000 0.000000
> max 59.000000 155.000000 157.000000 18975.360000
>
> In [19]: bgf = BetaGeoFitter(penalizer_coef=penalizer_coef)
> ...: bgf.fit(df1['frequency'], df1['recency'], df1['T'])
> Out[19]: <lifetimes.BetaGeoFitter: fitted with 68878 subjects, a: 1.08, alpha: 0.74, b: 0.65, r: 0.03>
>
> In [20]: bgf = BetaGeoFitter(penalizer_coef=penalizer_coef)
> ...: bgf.fit(df['frequency'], df['recency'], df['T'])
> fun: -0.03513675395757231
> hess_inv: array([[ 13.30839758, 17.8546921 , -0.17820442, 0.31872313],
> [ 17.8546921 , 73.49152334, -1.06609042, 0.96429223],
> [ -0.17820442, -1.06609042, 65.85101032, 67.62388159],
> [ 0.31872313, 0.96429223, 67.62388159, 109.01577057]])
> jac: array([ 1.17874160e-06, -6.62967570e-07, 1.06154732e-06, 1.56458773e-06])
> message: 'Desired error not necessarily achieved due to precision loss.'
> nfev: 130
> nit: 29
> njev: 117
> status: 2
> success: False
> x: array([-3.59592079, -5.36183489, 0.07652525, -0.4253566 ])
> ---------------------------------------------------------------------------
> ConvergenceError Traceback (most recent call last)
> /data/modou/python/clv.py in <module>
> 1 bgf = BetaGeoFitter(penalizer_coef=penalizer_coef)
> ----> 2 bgf.fit(df['frequency'], df['recency'], df['T'])
>
> ```
> # Secound, I found the float is something different on df1 and df
> They shows different after round:
> ```python
> idx = ~np.isclose(df.round(1)['monetary_value'], df1.round(1)['monetary_value'])
> In [71]: np.isclose(df[idx]['monetary_value'], df1[idx]['monetary_value'])
> Out[71]:
> array([ True, True, True, True, True, True, True, True, True,
> True, True, True, True, True, True, True, True, True,
> True, True, True, True, True])
>
> In [72]: np.isclose(df[idx].round(1)['monetary_value'], df1[idx].round(1)['monetary_value'])
> Out[72]:
> array([False, False, False, False, False, False, False, False, False,
> False, False, False, False, False, False, False, False, False,
> False, False, False, False, False])
>
> ```
> The diff contents:
> ```
>
> In [67]: df[idx].round(1)['monetary_value']
> Out[67]:
> 11498 426.4
> 17791 1464.1
> 18037 1309.1
> 19800 426.4
> 22464 134.3
> 24717 29.7
> 26202 881.6
> 26729 426.4
> 29519 1464.1
> 35798 1464.1
> 36034 388.7
> 39156 1464.1
> 39566 194.1
> 39687 426.4
> 39737 388.7
> 44185 1464.1
> 45628 1574.9
> 48241 4325.3
> 49841 1464.1
> 54789 129.5
> 57159 3289.6
> 66517 426.4
> 67991 388.7
> Name: monetary_value, dtype: float64
>
> In [68]: df1[idx].round(1)['monetary_value']
> Out[68]:
> 11498 426.5
> 17791 1464.2
> 18037 1309.2
> 19800 426.5
> 22464 134.2
> 24717 29.8
> 26202 881.7
> 26729 426.5
> 29519 1464.2
> 35798 1464.2
> 36034 388.6
> 39156 1464.2
> 39566 194.2
> 39687 426.5
> 39737 388.6
> 44185 1464.2
> 45628 1574.8
> 48241 4325.2
> 49841 1464.2
> 54789 129.6
> 57159 3289.7
> 66517 426.5
> 67991 388.6
> Name: monetary_value, dtype: float64
> ```
> # Third, suppress idx value to zeros on both df and df1 test again
> fit df1 is still converged
> ```
> In [88]: df2 = df1.copy()
> ...: df2.loc[idx, "monetary_value"] = 0
>
> In [89]: df2[idx]
> Out[89]:
> frequency recency T monetary_value
> 11498 6.0 16.0 124.0 0.0
> 17791 1.0 1.0 109.0 0.0
> 18037 1.0 1.0 109.0 0.0
> 19800 2.0 3.0 104.0 0.0
> 22464 6.0 36.0 69.0 0.0
> 24717 11.0 11.0 93.0 0.0
> 26202 1.0 12.0 88.0 0.0
> 26729 2.0 14.0 34.0 0.0
> 29519 1.0 5.0 79.0 0.0
> 35798 1.0 1.0 63.0 0.0
> 36034 1.0 1.0 63.0 0.0
> 39156 1.0 1.0 54.0 0.0
> 39566 1.0 2.0 53.0 0.0
> 39687 2.0 3.0 53.0 0.0
> 39737 1.0 1.0 53.0 0.0
> 44185 1.0 6.0 45.0 0.0
> 45628 1.0 1.0 43.0 0.0
> 48241 3.0 17.0 39.0 0.0
> 49841 1.0 2.0 36.0 0.0
> 54789 3.0 3.0 27.0 0.0
> 57159 9.0 9.0 22.0 0.0
> 66517 2.0 2.0 4.0 0.0
> 67991 1.0 1.0 1.0 0.0
>
> In [90]: bgf = BetaGeoFitter(penalizer_coef=penalizer_coef)
> ...: bgf.fit(df2['frequency'], df2['recency'], df2['T'])
> Out[90]: <lifetimes.BetaGeoFitter: fitted with 68878 subjects, a: 1.08, alpha: 0.74, b: 0.65, r: 0.03>
> ```
> fit df still throw ConvergenceError
> ```
> In [92]: df2 = df.copy()
> ...: df2.loc[idx, "monetary_value"] = 0
>
> In [93]: df2[idx]
> Out[93]:
> user_id frequency recency T monetary_value
> 11498 1515915625531317256 6.0 16.0 124.0 0.0
> 17791 1515915625538189543 1.0 1.0 109.0 0.0
> 18037 1515915625538353966 1.0 1.0 109.0 0.0
> 19800 1515915625539864468 2.0 3.0 104.0 0.0
> 22464 1515915625542102075 6.0 36.0 69.0 0.0
> 24717 1515915625545486890 11.0 11.0 93.0 0.0
> 26202 1515915625547164014 1.0 12.0 88.0 0.0
> 26729 1515915625547973880 2.0 14.0 34.0 0.0
> 29519 1515915625561317292 1.0 5.0 79.0 0.0
> 35798 1515915625569444951 1.0 1.0 63.0 0.0
> 36034 1515915625569751989 1.0 1.0 63.0 0.0
> 39156 1515915625573167676 1.0 1.0 54.0 0.0
> 39566 1515915625573482744 1.0 2.0 53.0 0.0
> 39687 1515915625573575950 2.0 3.0 53.0 0.0
> 39737 1515915625573629519 1.0 1.0 53.0 0.0
> 44185 1515915625592904652 1.0 6.0 45.0 0.0
> 45628 1515915625593770495 1.0 1.0 43.0 0.0
> 48241 1515915625595271558 3.0 17.0 39.0 0.0
> 49841 1515915625596215381 1.0 2.0 36.0 0.0
> 54789 1515915625599473044 3.0 3.0 27.0 0.0
> 57159 1515915625601113987 9.0 9.0 22.0 0.0
> 66517 1515915625609072139 2.0 2.0 4.0 0.0
> 67991 1515915625610224305 1.0 1.0 1.0 0.0
>
> In [94]: bgf = BetaGeoFitter(penalizer_coef=penalizer_coef)
> ...: bgf.fit(df2['frequency'], df2['recency'], df2['T'])
> fun: -0.03513675395757231
> hess_inv: array([[ 13.30839758, 17.8546921 , -0.17820442, 0.31872313],
> [ 17.8546921 , 73.49152334, -1.06609042, 0.96429223],
> [ -0.17820442, -1.06609042, 65.85101032, 67.62388159],
> [ 0.31872313, 0.96429223, 67.62388159, 109.01577057]])
> jac: array([ 1.17874160e-06, -6.62967570e-07, 1.06154732e-06, 1.56458773e-06])
> message: 'Desired error not necessarily achieved due to precision loss.'
> nfev: 130
> nit: 29
> njev: 117
> status: 2
> success: False
> x: array([-3.59592079, -5.36183489, 0.07652525, -0.4253566 ])
> ---------------------------------------------------------------------------
> ConvergenceError Traceback (most recent call last)
> /data/modou/python/clv.py in <module>
> 1 bgf = BetaGeoFitter(penalizer_coef=penalizer_coef)
> ----> 2 bgf.fit(df2['frequency'], df2['recency'], df2['T'])
> /data/modou/conda/envs/py36/lib/python3.6/site-packages/lifetimes/fitters/beta_geo_fitter.py in fit(self, frequency, recency, T, weights, initial_params, verb
> ose, tol, index, **kwargs)
> 141 verbose,
> 142 tol,
> --> 143 **kwargs
> 144 )
> 145
>
> /data/modou/conda/envs/py36/lib/python3.6/site-packages/lifetimes/fitters/__init__.py in _fit(self, minimizing_function_args, initial_params, params_size, dis
> p, tol, bounds, **kwargs)
> 117 """
> 118 The model did not converge. Try adding a larger penalizer to see if that helps convergence.
> --> 119 """
> 120 )
> 121 )
>
> ConvergenceError:
> The model did not converge. Try adding a larger penalizer to see if that helps convergence.
> ```
> ## As a result, df still got error
> There must be some strange thing on the df( transformed on spark) , how it got error even if
> suppress idx monetary_value value to zeros ??
> I just want to figure this thing out.
> ## Update
> Write out df and read back, go through fitting!? Holy strange.
> ```
> In [108]: df["monetary_value"].sum()
> Out[108]: 4596984.839164658
>
> In [109]: df1["monetary_value"].sum()
> Out[109]: 4596984.8391646575
>
>
> In [111]: df.to_csv('e.csv', index=False, header=True)
>
> In [112]: x = pd.read_csv('e.csv')
>
> In [113]: x["monetary_value"].sum()
> Out[113]: 4596984.8391646575
> In [114]: bgf = BetaGeoFitter(penalizer_coef=penalizer_coef)
> ...: bgf.fit(x['frequency'], x['recency'], x['T'])
> ...:
> Out[114]: <lifetimes.BetaGeoFitter: fitted with 68878 subjects, a: 1.08, alpha: 0.74, b: 0.65, r: 0.03>
> ```
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org