You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Ben Smith (Jira)" <ji...@apache.org> on 2020/08/03 16:35:00 UTC

[jira] [Created] (SPARK-32522) Using pyspark with a MultiLayerPerceptron model given inconsistent outputs if a large amount of data is fed into it and at least one of the model outputs is fed to a Python UDF.

Ben Smith created SPARK-32522:
---------------------------------

             Summary: Using pyspark with a MultiLayerPerceptron model given inconsistent outputs if a large amount of data is fed into it and at least one of the model outputs is fed to a Python UDF.
                 Key: SPARK-32522
                 URL: https://issues.apache.org/jira/browse/SPARK-32522
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 2.4.3, 3.1.0
         Environment: CentOS 7.6 with Python 3.6.3 and Spark 2.4.3

or

CentOS 7.6 with Python 3.6.3 and Spark built from master
            Reporter: Ben Smith
         Attachments: model.zip

Using pyspark with a MultiLayerPerceptron model given inconsistent outputs if a large amount of data is fed into it and at least one of the model outputs is fed to a Python UDF.

This data correctness issue impacts both the Spark 2.4 releases and the latest Master branch.

I do not understand the root cause and cannot recreate 100% of the time. But I have a simplified code sample (attached) that triggers the bug regularly. I raised an inquiry on the mailing list as a Spark 2.4 issue but nobody had a suggested root cause and I have since recreated the problem on master so I am now raising a bug here.

During debugging I have narrowed the problem down somewhat and some observations I have made while doing this are:
 * I can recreate the problem with a very simple MultilayerPerceptron with no hidden layers (just 14 features and 2 outputs), I also see it with a more complex MultilayerPerceptron model so I don't think the model details are important.
 * I cannot recreate the problem unless the model output is fed to a python UDF, removing this leads to good outputs for the model and having it means that model outputs are inconsistent (note that not just the Python UDF outputs are inconsistent)
 * I cannot recreate the problem on minuscule amounts of data or when my data is partitioned heavily. 100,000 rows of input with 2 partitions sees the issue happen most of the time.
 * Some of the bad outputs I get could be explained if certain features were zero when they came into the model (when they are not in my actual feature data)
 * I can recreate the problem on several different servers

My environment is CentOS 7.6 with Python 3.6.3 and Spark 2.4.3, I can also recreate the issue from the code on the Spark master branch but strangely I cannot recreate the issue with Spark 2.4.3 and Python 2.7. I'm not sure why the version of python would matter.

The attached code sample triggers the problem for me the vast majority of the time when pasted into a pyspark shell. This code generates a dataframe containing 100,000 identical rows, transforms it with a MultiLayerPerceptron model and feeds one of the model output columns to a simple Python UDF to generate an additional column. The resulting dataframe has the distinct rows selected and since all the inputs are identical I would expect to get 1 row back, instead I get many unique rows with the number returned varying each time I run the code. To run the code you will need the model files locally. I have attached the model as a zip archive and unzipping this to /tmp should be all you need to do to get the code to run.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org