You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Dongjoon Hyun (Jira)" <ji...@apache.org> on 2020/08/05 21:29:00 UTC
[jira] [Commented] (SPARK-32522) Using pyspark with a MultiLayerPerceptron model given inconsistent outputs if a large amount of data is fed into it and at least one of the model outputs is fed to a Python UDF.

    [ https://issues.apache.org/jira/browse/SPARK-32522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171773#comment-17171773 ] 

Dongjoon Hyun commented on SPARK-32522:
---------------------------------------

Is this correctness issue?

> Using pyspark with a MultiLayerPerceptron model given inconsistent outputs if a large amount of data is fed into it and at least one of the model outputs is fed to a Python UDF.
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-32522
>                 URL: https://issues.apache.org/jira/browse/SPARK-32522
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.4.3, 3.1.0
>         Environment: CentOS 7.6 with Python 3.6.3 and Spark 2.4.3
> or
> CentOS 7.6 with Python 3.6.3 and Spark built from master
>            Reporter: Ben Smith
>            Priority: Major
>              Labels: correctness
>         Attachments: model.zip, pyspark-script.py
>
>
> Using pyspark with a MultiLayerPerceptron model given inconsistent outputs if a large amount of data is fed into it and at least one of the model outputs is fed to a Python UDF.
> This data correctness issue impacts both the Spark 2.4 releases and the latest Master branch.
> I do not understand the root cause and cannot recreate 100% of the time. But I have a simplified code sample (attached) that triggers the bug regularly. I raised an inquiry on the mailing list as a Spark 2.4 issue but nobody had a suggested root cause and I have since recreated the problem on master so I am now raising a bug here.
> During debugging I have narrowed the problem down somewhat and some observations I have made while doing this are:
>  * I can recreate the problem with a very simple MultilayerPerceptron with no hidden layers (just 14 features and 2 outputs), I also see it with a more complex MultilayerPerceptron model so I don't think the model details are important.
>  * I cannot recreate the problem unless the model output is fed to a python UDF, removing this leads to good outputs for the model and having it means that model outputs are inconsistent (note that not just the Python UDF outputs are inconsistent)
>  * I cannot recreate the problem on minuscule amounts of data or when my data is partitioned heavily. 100,000 rows of input with 2 partitions sees the issue happen most of the time.
>  * Some of the bad outputs I get could be explained if certain features were zero when they came into the model (when they are not in my actual feature data)
>  * I can recreate the problem on several different servers
> My environment is CentOS 7.6 with Python 3.6.3 and Spark 2.4.3, I can also recreate the issue from the code on the Spark master branch but strangely I cannot recreate the issue with Spark 2.4.3 and Python 2.7. I'm not sure why the version of python would matter.
> The attached code sample triggers the problem for me the vast majority of the time when pasted into a pyspark shell. This code generates a dataframe containing 100,000 identical rows, transforms it with a MultiLayerPerceptron model and feeds one of the model output columns to a simple Python UDF to generate an additional column. The resulting dataframe has the distinct rows selected and since all the inputs are identical I would expect to get 1 row back, instead I get many unique rows with the number returned varying each time I run the code. To run the code you will need the model files locally. I have attached the model as a zip archive and unzipping this to /tmp should be all you need to do to get the code to run.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org