You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by logannc <gi...@git.apache.org> on 2017/08/15 01:48:09 UTC

[GitHub] spark pull request #18945: Add option to convert nullable int columns to flo...

GitHub user logannc opened a pull request:

    https://github.com/apache/spark/pull/18945

    Add option to convert nullable int columns to float columns in toPand…

    …as to prevent needless Exceptions during routine use.
    
    Add the `strict=True` kwarg to DataFrame.toPandas to allow for a non-strict interpretation of the schema of a dataframe. This is currently limited to allowing a nullable int column to being interpreted as a float column (because that is the only way Pandas supports nullable int columns and actually crashes without this).
    
    I consider this small change to be a massive quality of life improvement for DataFrames with lots of nullable int columns, which would otherwise need a litany of `df.withColumn(name, F.col(name).cast(DoubleType()))`, etc, just to view them easily or interact with them in-memory.
    
    **Possible Objections**
    * I foresee concerns with the name of the kwarg, of which I am open to suggestions.
    * I also foresee possible objections due to the potential for needless conversion of nullable int columns to floats when there are actually no null values. I would counter those objections by noting that it only occurs when strict=False, which is not the default, and can be avoided on a per-column basis by setting the `nullable` property of the schema to False. 
    
    **Alternatives**
    * Rename the kwarg to be specific to the current change. i.e., `nullable_int_to_float` instead of `strict` or some other, similar name.
    * Fix Pandas to allow nullable int columns. (Very difficult, per Wes McKinney, due to lack of NumPy support. https://stackoverflow.com/questions/11548005/numpy-or-pandas-keeping-array-type-as-integer-while-having-a-nan-value)


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/logannc/spark nullable_int_pandas

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18945.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18945
    
----
commit bceeefca77dd3414e4ec97ad3570043ec3ce3059
Author: Logan Collins <lo...@gmail.com>
Date:   2017-08-15T01:30:08Z

    Add option to convert nullable int columns to float columns in toPandas to prevent needless crashes.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    **[Test build #82061 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82061/testReport)** for PR 18945 at commit [`14f36c3`](https://github.com/apache/spark/commit/14f36c354f65a34e3e06cd4d35029e5f8f2b79f0).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: Add option to convert nullable int columns to float colu...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    BTW, I think it'd be nicer if we can go with the approach above ^ (checking the null in data and setting the correct type). I am okay with any form for the approach above as we have a decent Arrow optimization now for the performance aspect.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: Add option to convert nullable int columns to float colu...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    > Another rough thought for a feasible way I could think to keep the current behaviour (to be more specific, to match the types with / without Arrow optimization, IIUC) is, to make a generator wrapper to check if None exists in the results for columns where the type is int and nullable.
    
    I'm not sure I follow @HyukjinKwon , we can just look at the `pdf` when it is first created but before any type conversions and just not do anything if it the column is a nullable int with null values.  Is this similar to what you're suggesting?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: Add option to convert nullable int columns to float colu...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    @logannc, mind adding the JIRA number in this PR title as described in the guide line?
    I definitely also think this needs a JIRA as before and after are not virtually same and it looks a non-trivial change - this even could be a regression if SPARK-21163 was resolved in 2.2.0. I believe tests are required here too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: Add option to convert nullable int columns to float colu...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    **[Test build #82024 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82024/testReport)** for PR 18945 at commit [`18d8d72`](https://github.com/apache/spark/commit/18d8d72d8d81e92e03ff66be15fb0bf323e3997d).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82060/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

Posted by logannc <gi...@git.apache.org>.
Github user logannc commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    I've continued to use @HyukjinKwon 's suggestion because it should be more performant and is capable of handling it without loss of precision. I believe I've addressed your concerns by only changing the type when we encounter a null (duh).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18945: Add option to convert nullable int columns to flo...

Posted by a10y <gi...@git.apache.org>.
Github user a10y commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18945#discussion_r140142458
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -1761,12 +1761,37 @@ def toPandas(self):
                     raise ImportError("%s\n%s" % (e.message, msg))
             else:
                 dtype = {}
    +            columns_with_null_int = {}
    +            def null_handler(rows, columns_with_null_int):
    +                for row in rows:
    +                    row = row.asDict()
    +                    for column in columns_with_null_int:
    +                        val = row[column]
    +                        dt = dtype[column]
    +                        if val is not None:
    +                            if abs(val) > 16777216: # Max value before np.float32 loses precision.
    +                                val = np.float64(val)
    +                                if np.float64 != dt:
    --- End diff --
    
    (Also thanks for solving the precision issue)!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18945#discussion_r140412632
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -1761,12 +1761,37 @@ def toPandas(self):
                     raise ImportError("%s\n%s" % (e.message, msg))
             else:
                 dtype = {}
    +            columns_with_null_int = set()
    +            def null_handler(rows, columns_with_null_int):
    +                for row in rows:
    +                    row = row.asDict()
    +                    for column in columns_with_null_int:
    +                        val = row[column]
    +                        dt = dtype[column]
    +                        if val is not None:
    +                            if abs(val) > 16777216: # Max value before np.float32 loses precision.
    +                                val = np.float64(val)
    +                                dt = np.float64
    +                                dtype[column] = np.float64
    +                            else:
    +                                val = np.float32(val)
    +                                if dt not in (np.float32, np.float64):
    +                                    dt = np.float32
    +                                    dtype[column] = np.float32
    +                            row[column] = val
    +                    row = Row(**row)
    +                    yield row
    +            row_handler = lambda x,y: x
                 for field in self.schema:
                     pandas_type = _to_corrected_pandas_type(field.dataType)
    +                if pandas_type in (np.int8, np.int16, np.int32) and field.nullable:
    +                    columns_with_null_int.add(field.name)
    +                    row_handler = null_handler
    +                    pandas_type = np.float32
    --- End diff --
    
    I don't think this is a correct fix.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    **[Test build #82067 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82067/testReport)** for PR 18945 at commit [`6e16cd8`](https://github.com/apache/spark/commit/6e16cd82434c82cd7213ae8ef2b52e1c42e607cf).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: Add option to convert nullable int columns to float colu...

Posted by logannc <gi...@git.apache.org>.
Github user logannc commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    Sorry for the delay. Things got busy and now there is the storm in Houston. Will update this per these suggestions soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    **[Test build #82061 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82061/testReport)** for PR 18945 at commit [`14f36c3`](https://github.com/apache/spark/commit/14f36c354f65a34e3e06cd4d35029e5f8f2b79f0).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18945: Add option to convert nullable int columns to flo...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18945#discussion_r140166654
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -1761,12 +1761,37 @@ def toPandas(self):
                     raise ImportError("%s\n%s" % (e.message, msg))
             else:
                 dtype = {}
    +            columns_with_null_int = {}
    +            def null_handler(rows, columns_with_null_int):
    +                for row in rows:
    +                    row = row.asDict()
    +                    for column in columns_with_null_int:
    +                        val = row[column]
    +                        dt = dtype[column]
    +                        if val is not None:
    --- End diff --
    
    Don't we want to fix the issue when pandas type in (np.int8, np.int16, np.int32) and the field is nullable, the `dtype` we get will cause exception later when converting a `None` to int type such as np.int16?
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: Add option to convert nullable int columns to float colu...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    @logannc , thanks for this.  You bring up a big issue here that I think was overlooked when this code was added in Spark.  I filed a JIRA for this [SPARK-21766](https://issues.apache.org/jira/browse/SPARK-21766), which generally comes before the PR.  Please see the contributing guide [here](http://spark.apache.org/contributing.html)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: Add option to convert nullable int columns to float colu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82024/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: Add option to convert nullable int columns to float colu...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    Hey @logannc, have you had some time to work on this? I want to fix this issue asap. Ortherwise, would anyone here be interested in submitimg another PR for the another approach?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18945#discussion_r140415073
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -1761,12 +1761,37 @@ def toPandas(self):
                     raise ImportError("%s\n%s" % (e.message, msg))
             else:
                 dtype = {}
    +            columns_with_null_int = set()
    +            def null_handler(rows, columns_with_null_int):
    +                for row in rows:
    +                    row = row.asDict()
    +                    for column in columns_with_null_int:
    +                        val = row[column]
    +                        dt = dtype[column]
    +                        if val is not None:
    +                            if abs(val) > 16777216: # Max value before np.float32 loses precision.
    +                                val = np.float64(val)
    +                                dt = np.float64
    +                                dtype[column] = np.float64
    +                            else:
    +                                val = np.float32(val)
    +                                if dt not in (np.float32, np.float64):
    +                                    dt = np.float32
    +                                    dtype[column] = np.float32
    +                            row[column] = val
    +                    row = Row(**row)
    +                    yield row
    +            row_handler = lambda x,y: x
                 for field in self.schema:
                     pandas_type = _to_corrected_pandas_type(field.dataType)
    +                if pandas_type in (np.int8, np.int16, np.int32) and field.nullable:
    +                    columns_with_null_int.add(field.name)
    +                    row_handler = null_handler
    +                    pandas_type = np.float32
    --- End diff --
    
    I will take my suggestion back. I think thier suggestions are better than mine.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82065/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: Add option to convert nullable int columns to float colu...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    ok to test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18945: Add option to convert nullable int columns to flo...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18945#discussion_r140144263
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -1761,12 +1761,37 @@ def toPandas(self):
                     raise ImportError("%s\n%s" % (e.message, msg))
             else:
                 dtype = {}
    +            columns_with_null_int = {}
    +            def null_handler(rows, columns_with_null_int):
    +                for row in rows:
    +                    row = row.asDict()
    +                    for column in columns_with_null_int:
    +                        val = row[column]
    +                        dt = dtype[column]
    +                        if val is not None:
    --- End diff --
    
    Don't we want to change data type for `None` values? I don't see you do it?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18945: Add option to convert nullable int columns to flo...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18945#discussion_r133921465
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -1731,7 +1731,7 @@ def toDF(self, *cols):
             return DataFrame(jdf, self.sql_ctx)
     
         @since(1.3)
    -    def toPandas(self):
    +    def toPandas(self, strict=True):
    --- End diff --
    
    BTW, this change looks uesless when the optimization is enabled if I understood correctly. I won't add an option and make it complicated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18945: Add option to convert nullable int columns to flo...

Posted by logannc <gi...@git.apache.org>.
Github user logannc commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18945#discussion_r140148777
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -1761,12 +1761,37 @@ def toPandas(self):
                     raise ImportError("%s\n%s" % (e.message, msg))
             else:
                 dtype = {}
    +            columns_with_null_int = {}
    +            def null_handler(rows, columns_with_null_int):
    +                for row in rows:
    +                    row = row.asDict()
    +                    for column in columns_with_null_int:
    +                        val = row[column]
    +                        dt = dtype[column]
    +                        if val is not None:
    +                            if abs(val) > 16777216: # Max value before np.float32 loses precision.
    +                                val = np.float64(val)
    +                                if np.float64 != dt:
    --- End diff --
    
    No, not strictly necessary, but also hardly harmful and it may future proof a bit...? Anyway, it can be removed if you think it should.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: Add option to convert nullable int columns to float colu...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    Ah, I had to be clear. I thought something like ...
    
    ```python
    dtype = {}
    for field in self.schema:
        pandas_type = _to_corrected_pandas_type(field.dataType)
        if pandas_type is not None:
            dtype[field.name] = pandas_type
    
    # Columns with int + nullable from schemaa
    int_null_cols = [...]
    
    # Columns with int + nullable but with actual None.
    int_null_cols_with_none = [...]
    
    # This functions checks if the value is None.
    def check_nulls():
        for row in rows:
            # Check with int_null_cols and set int_null_cols_with_none if there is None.
            yield rows
    
    # Don't check anything if no int + nullable columns.
    if len(int_null_cols) > 0:
        check_func = check_nulls
    else:
        check_func = lambda r: r
    
    pdf = pd.DataFrame.from_records(check_null(self.collect()), columns=self.columns)
    
    # Replace int32 -> float one by checking int_null_cols.
    dtype = ...
    
    for f, t in dtype.items():
        pdf[f] = pdf[f].astype(t, copy=False)
        return pdf
    ```
    
    So, I was thinking of checking the actual value in the data might be a way if we can't resolve this only with the schema.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: Add option to convert nullable int columns to float colu...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    **[Test build #82024 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82024/testReport)** for PR 18945 at commit [`18d8d72`](https://github.com/apache/spark/commit/18d8d72d8d81e92e03ff66be15fb0bf323e3997d).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82063/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: Add option to convert nullable int columns to float colu...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    @BryanCutler, @a10y and @viirya, would you guys be interested in this and have some time to take over this with the different approach we discussed above -  https://github.com/apache/spark/pull/18945#issuecomment-323917328 and  https://github.com/apache/spark/pull/18945#discussion_r134033952? I could take over this too if you guys are currently busy.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...

Posted by logannc <gi...@git.apache.org>.
Github user logannc commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18945#discussion_r140419964
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -1760,13 +1760,39 @@ def toPandas(self):
                           "if using spark.sql.execution.arrow.enable=true"
                     raise ImportError("%s\n%s" % (e.message, msg))
             else:
    +            import numpy as np
                 dtype = {}
    +            nullable_int_columns = set()
    +
    +            def null_handler(rows, nullable_int_columns):
    +                requires_double_precision = set()
    +                for row in rows:
    +                    row = row.asDict()
    +                    for column in nullable_int_columns:
    +                        val = row[column]
    +                        dt = dtype[column]
    +                        if val is None and dt not in (np.float32, np.float64):
    +                            dt = np.float64 if column in requires_double_precision else np.float32
    +                            dtype[column] = dt
    +                        elif val is not None:
    +                            if abs(val) > 16777216:  # Max value before np.float32 loses precision.
    --- End diff --
    
    Values above this cannot be represented losslessly as a `np.float32`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: Add option to convert nullable int columns to float colu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: Add option to convert nullable int columns to float colu...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    Yea, I think it is basically similar idea with https://github.com/apache/spark/pull/18945#discussion_r134033952.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: Add option to convert nullable int columns to float colu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82025/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18945: Add option to convert nullable int columns to flo...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18945#discussion_r134031415
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -1731,7 +1731,7 @@ def toDF(self, *cols):
             return DataFrame(jdf, self.sql_ctx)
     
         @since(1.3)
    -    def toPandas(self):
    +    def toPandas(self, strict=True):
    --- End diff --
    
    You referring to the Arrow optimization right?  I also agree that we should not add this option, rather just handle all this automatically


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18945: Add option to convert nullable int columns to flo...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18945#discussion_r134033952
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -1762,7 +1762,7 @@ def toPandas(self):
             else:
    --- End diff --
    
    If we wanted to check that a nullable int field actually has null values, we could do it here and then not change type it there are null values.  We would have to construct the pandas DataFrame first though.
    
    ```python
    pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)
    
    dtype = {}
    for field in self.schema:
        if not(field.dataType == IntegerType and field.nullable and pdf[field.name].isnull().any()):
            pandas_type = _to_corrected_pandas_type(field.dataType)
            if pandas_type is not None:
                  dtype[field.name] = pandas_type
    
    for f, t in dtype.items():
             pdf[f] = pdf[f].astype(t, copy=False)
        return pdf
    ```
    This does make a pass over the data to check though


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: Add option to convert nullable int columns to float colu...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    @logannc, mind adding the JIRA number in this PR title as described in the guide line? Please take a look - http://spark.apache.org/contributing.html.
    
    I'd read carefully the comments above, e.g., adding a test, https://github.com/apache/spark/pull/18945#issuecomment-323307343, fixing the PR title, https://github.com/apache/spark/pull/18945#issuecomment-323162796, following the suggestion , https://github.com/apache/spark/pull/18945#issuecomment-331008594.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82067/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...

Posted by logannc <gi...@git.apache.org>.
Github user logannc commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18945#discussion_r140412857
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -1761,12 +1761,37 @@ def toPandas(self):
                     raise ImportError("%s\n%s" % (e.message, msg))
             else:
                 dtype = {}
    +            columns_with_null_int = {}
    +            def null_handler(rows, columns_with_null_int):
    +                for row in rows:
    +                    row = row.asDict()
    +                    for column in columns_with_null_int:
    +                        val = row[column]
    +                        dt = dtype[column]
    +                        if val is not None:
    --- End diff --
    
    If `pandas_type in (np.int8, np.int16, np.int32) and field.nullable` and there are ANY non-null values, the dtype of the column is changed to `np.float32` or `np.float64`, both of which properly handle `None` values.
    
    That said, if the entire column was `None`, it would fail. Therefore I have preemptively changed the type on line 1787 to `np.float32`. Per `null_handler`, it may still change to `np.float64` if needed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18945: Add option to convert nullable int columns to flo...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18945#discussion_r140167087
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -1761,12 +1761,37 @@ def toPandas(self):
                     raise ImportError("%s\n%s" % (e.message, msg))
             else:
                 dtype = {}
    +            columns_with_null_int = {}
    +            def null_handler(rows, columns_with_null_int):
    +                for row in rows:
    +                    row = row.asDict()
    +                    for column in columns_with_null_int:
    +                        val = row[column]
    +                        dt = dtype[column]
    +                        if val is not None:
    --- End diff --
    
    You can follow the https://github.com/apache/spark/pull/18945/files#r134033952 suggested.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82066/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18945#discussion_r140414042
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -1761,12 +1761,37 @@ def toPandas(self):
                     raise ImportError("%s\n%s" % (e.message, msg))
             else:
                 dtype = {}
    +            columns_with_null_int = set()
    +            def null_handler(rows, columns_with_null_int):
    +                for row in rows:
    +                    row = row.asDict()
    +                    for column in columns_with_null_int:
    +                        val = row[column]
    +                        dt = dtype[column]
    +                        if val is not None:
    +                            if abs(val) > 16777216: # Max value before np.float32 loses precision.
    +                                val = np.float64(val)
    +                                dt = np.float64
    +                                dtype[column] = np.float64
    +                            else:
    +                                val = np.float32(val)
    +                                if dt not in (np.float32, np.float64):
    +                                    dt = np.float32
    +                                    dtype[column] = np.float32
    +                            row[column] = val
    +                    row = Row(**row)
    +                    yield row
    +            row_handler = lambda x,y: x
                 for field in self.schema:
                     pandas_type = _to_corrected_pandas_type(field.dataType)
    +                if pandas_type in (np.int8, np.int16, np.int32) and field.nullable:
    +                    columns_with_null_int.add(field.name)
    +                    row_handler = null_handler
    +                    pandas_type = np.float32
    --- End diff --
    
    Have you read carefully the comments in https://github.com/apache/spark/pull/18945#discussion_r134033952, https://github.com/apache/spark/pull/18945#discussion_r134925269? They are good suggestions for this issue. I don't know why you don't want to follow them to check null values with Pandas...


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    **[Test build #82062 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82062/testReport)** for PR 18945 at commit [`6e248dd`](https://github.com/apache/spark/commit/6e248ddf96122910468a3f20125ff4fc9f32f299).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18945: Add option to convert nullable int columns to flo...

Posted by logannc <gi...@git.apache.org>.
Github user logannc commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18945#discussion_r140153330
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -1761,12 +1761,37 @@ def toPandas(self):
                     raise ImportError("%s\n%s" % (e.message, msg))
             else:
                 dtype = {}
    +            columns_with_null_int = {}
    +            def null_handler(rows, columns_with_null_int):
    +                for row in rows:
    +                    row = row.asDict()
    +                    for column in columns_with_null_int:
    +                        val = row[column]
    +                        dt = dtype[column]
    +                        if val is not None:
    +                            if abs(val) > 16777216: # Max value before np.float32 loses precision.
    +                                val = np.float64(val)
    +                                if np.float64 != dt:
    +                                    dt = np.float64
    +                                    dtype[column] = np.float64
    +                            else:
    +                                val = np.float32(val)
    +                                if dt not in (np.float32, np.float64):
    +                                    dt = np.float32
    +                                    dtype[column] = np.float32
    +                            row[column] = val
    +                    row = Row(**row)
    +                    yield row
    +            row_handler = lambda x,y: x
                 for field in self.schema:
                     pandas_type = _to_corrected_pandas_type(field.dataType)
    +                if pandas_type in (np.int8, np.int16, np.int32) and field.nullable:
    +                    columns_with_null_int.add(field.name)
    --- End diff --
    
    Fixed...


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: Add option to convert nullable int columns to float colu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: Add option to convert nullable int columns to float colu...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    There should be no way an error like this is raised during the call `toPandas()` so I am thinking that if there is a nullable int column, the type should not try to be changed in `_to_corrected_pandas_type` cc @HyukjinKwon @ueshin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18945: Add option to convert nullable int columns to flo...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18945#discussion_r134925269
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -1762,7 +1762,7 @@ def toPandas(self):
             else:
    --- End diff --
    
    If we use this approach, how about the following to check if the type corrections are needed:
    
    ```python
    dtype = {}
    for field in self.schema:
        pandas_type = _to_corrected_pandas_type(field.dataType)
        if pandas_type is not None and not(field.nullable and pdf[field.name].isnull().any()):
            dtype[field.name] = pandas_type
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18945#discussion_r140414202
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -1761,12 +1761,37 @@ def toPandas(self):
                     raise ImportError("%s\n%s" % (e.message, msg))
             else:
                 dtype = {}
    +            columns_with_null_int = set()
    +            def null_handler(rows, columns_with_null_int):
    +                for row in rows:
    +                    row = row.asDict()
    +                    for column in columns_with_null_int:
    +                        val = row[column]
    +                        dt = dtype[column]
    +                        if val is not None:
    +                            if abs(val) > 16777216: # Max value before np.float32 loses precision.
    +                                val = np.float64(val)
    +                                dt = np.float64
    +                                dtype[column] = np.float64
    +                            else:
    +                                val = np.float32(val)
    +                                if dt not in (np.float32, np.float64):
    +                                    dt = np.float32
    +                                    dtype[column] = np.float32
    +                            row[column] = val
    +                    row = Row(**row)
    +                    yield row
    +            row_handler = lambda x,y: x
                 for field in self.schema:
                     pandas_type = _to_corrected_pandas_type(field.dataType)
    +                if pandas_type in (np.int8, np.int16, np.int32) and field.nullable:
    +                    columns_with_null_int.add(field.name)
    +                    row_handler = null_handler
    +                    pandas_type = np.float32
    --- End diff --
    
    A simple wrong for this line is, even this condition is met, don't necessarily meaning there are null values in the column. But you forcibly set the type to np.float32. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...

Posted by logannc <gi...@git.apache.org>.
Github user logannc commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18945#discussion_r140414783
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -1761,12 +1761,37 @@ def toPandas(self):
                     raise ImportError("%s\n%s" % (e.message, msg))
             else:
                 dtype = {}
    +            columns_with_null_int = set()
    +            def null_handler(rows, columns_with_null_int):
    +                for row in rows:
    +                    row = row.asDict()
    +                    for column in columns_with_null_int:
    +                        val = row[column]
    +                        dt = dtype[column]
    +                        if val is not None:
    +                            if abs(val) > 16777216: # Max value before np.float32 loses precision.
    +                                val = np.float64(val)
    +                                dt = np.float64
    +                                dtype[column] = np.float64
    +                            else:
    +                                val = np.float32(val)
    +                                if dt not in (np.float32, np.float64):
    +                                    dt = np.float32
    +                                    dtype[column] = np.float32
    +                            row[column] = val
    +                    row = Row(**row)
    +                    yield row
    +            row_handler = lambda x,y: x
                 for field in self.schema:
                     pandas_type = _to_corrected_pandas_type(field.dataType)
    +                if pandas_type in (np.int8, np.int16, np.int32) and field.nullable:
    +                    columns_with_null_int.add(field.name)
    +                    row_handler = null_handler
    +                    pandas_type = np.float32
    --- End diff --
    
    Ah, I see where I got confused. I had started with @ueshin 's suggestion but abandoned it because I didn't want to create the DataFrame before the type correction because I was also looking at @HyukjinKwon 's suggestion. I somehow ended up combining them incorrectly.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: Add option to convert nullable int columns to float colu...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    **[Test build #82025 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82025/testReport)** for PR 18945 at commit [`18d8d72`](https://github.com/apache/spark/commit/18d8d72d8d81e92e03ff66be15fb0bf323e3997d).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: Add option to convert nullable int columns to float colu...

Posted by logannc <gi...@git.apache.org>.
Github user logannc commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    Hm. Where would I add tests?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...

Posted by logannc <gi...@git.apache.org>.
Github user logannc commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18945#discussion_r140413579
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -1761,12 +1761,37 @@ def toPandas(self):
                     raise ImportError("%s\n%s" % (e.message, msg))
             else:
                 dtype = {}
    +            columns_with_null_int = set()
    +            def null_handler(rows, columns_with_null_int):
    +                for row in rows:
    +                    row = row.asDict()
    +                    for column in columns_with_null_int:
    +                        val = row[column]
    +                        dt = dtype[column]
    +                        if val is not None:
    +                            if abs(val) > 16777216: # Max value before np.float32 loses precision.
    +                                val = np.float64(val)
    +                                dt = np.float64
    +                                dtype[column] = np.float64
    +                            else:
    +                                val = np.float32(val)
    +                                if dt not in (np.float32, np.float64):
    +                                    dt = np.float32
    +                                    dtype[column] = np.float32
    +                            row[column] = val
    +                    row = Row(**row)
    +                    yield row
    +            row_handler = lambda x,y: x
                 for field in self.schema:
                     pandas_type = _to_corrected_pandas_type(field.dataType)
    +                if pandas_type in (np.int8, np.int16, np.int32) and field.nullable:
    +                    columns_with_null_int.add(field.name)
    +                    row_handler = null_handler
    +                    pandas_type = np.float32
    --- End diff --
    
    Can you elaborate? I believe it is, per my reply to your comment in the `null_handler`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    Hey @logannc, let's don't make it complicated for now and go with their ways first - https://github.com/apache/spark/pull/18945#discussion_r134033952 and https://github.com/apache/spark/pull/18945#discussion_r134925269.
    
    Maybe we can make a followup later with some small benchmark results for the performance one and precision concern (I guess this one is not a regression BTW?). I think we should first match it with when `spark.sql.execution.arrow.enable` is enabled.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...

Posted by logannc <gi...@git.apache.org>.
Github user logannc commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18945#discussion_r140412745
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -1761,12 +1761,37 @@ def toPandas(self):
                     raise ImportError("%s\n%s" % (e.message, msg))
             else:
                 dtype = {}
    +            columns_with_null_int = {}
    +            def null_handler(rows, columns_with_null_int):
    +                for row in rows:
    +                    row = row.asDict()
    +                    for column in columns_with_null_int:
    +                        val = row[column]
    +                        dt = dtype[column]
    +                        if val is not None:
    --- End diff --
    
    If `pandas_type in (np.int8, np.int16, np.int32) and field.nullable` and there are ANY non-null values, the dtype of the column is changed to `np.float32` or `np.float64`, both of which properly handle `None` values.
    
    That said, if the entire column was `None`, it would fail. Therefore I have preemptively changed the type on line 1787 to `np.float32`. Per `null_handler`, it may still change to `np.float64` if needed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: Add option to convert nullable int columns to float colu...

Posted by logannc <gi...@git.apache.org>.
Github user logannc commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    I read the contributing guide. It said that simple changes didn't need a JIRA. Certainly this code change is quite simple, I just wasn't sure if there would be enough discussion to warrant a Jira. Now I know.
    
    So, rather than return np.float32, return None? That would probably also work, though the bug might get reintroduced by someone unfamiliar with the problem. That is why I prefered the explicitness of returning a type.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: Add option to convert nullable int columns to float colu...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    @logannc There are pandas related tests in `python/pyspark/sql/tests.py`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    **[Test build #82060 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82060/testReport)** for PR 18945 at commit [`b313a3b`](https://github.com/apache/spark/commit/b313a3b8fc88898423940f195ab16bd3a57c0061).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18945#discussion_r140420898
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -1760,13 +1760,39 @@ def toPandas(self):
                           "if using spark.sql.execution.arrow.enable=true"
                     raise ImportError("%s\n%s" % (e.message, msg))
             else:
    +            import numpy as np
                 dtype = {}
    +            nullable_int_columns = set()
    +
    +            def null_handler(rows, nullable_int_columns):
    +                requires_double_precision = set()
    +                for row in rows:
    +                    row = row.asDict()
    +                    for column in nullable_int_columns:
    +                        val = row[column]
    +                        dt = dtype[column]
    +                        if val is None and dt not in (np.float32, np.float64):
    +                            dt = np.float64 if column in requires_double_precision else np.float32
    +                            dtype[column] = dt
    +                        elif val is not None:
    +                            if abs(val) > 16777216:  # Max value before np.float32 loses precision.
    --- End diff --
    
    I think they are represented as np.float64. I added a test in #19319.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18945: Add option to convert nullable int columns to flo...

Posted by logannc <gi...@git.apache.org>.
Github user logannc commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18945#discussion_r140147933
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -1761,12 +1761,37 @@ def toPandas(self):
                     raise ImportError("%s\n%s" % (e.message, msg))
             else:
                 dtype = {}
    +            columns_with_null_int = {}
    +            def null_handler(rows, columns_with_null_int):
    +                for row in rows:
    +                    row = row.asDict()
    +                    for column in columns_with_null_int:
    +                        val = row[column]
    +                        dt = dtype[column]
    +                        if val is not None:
    --- End diff --
    
    They are handled by Pandas already, so I am just letting them pass through.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18945: Add option to convert nullable int columns to flo...

Posted by a10y <gi...@git.apache.org>.
Github user a10y commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18945#discussion_r140141889
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -1761,12 +1761,37 @@ def toPandas(self):
                     raise ImportError("%s\n%s" % (e.message, msg))
             else:
                 dtype = {}
    +            columns_with_null_int = {}
    +            def null_handler(rows, columns_with_null_int):
    +                for row in rows:
    +                    row = row.asDict()
    +                    for column in columns_with_null_int:
    +                        val = row[column]
    +                        dt = dtype[column]
    +                        if val is not None:
    +                            if abs(val) > 16777216: # Max value before np.float32 loses precision.
    +                                val = np.float64(val)
    +                                if np.float64 != dt:
    --- End diff --
    
    Is this `if` totally necessary or can we just move the two assignments up?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    **[Test build #82060 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82060/testReport)** for PR 18945 at commit [`b313a3b`](https://github.com/apache/spark/commit/b313a3b8fc88898423940f195ab16bd3a57c0061).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82061/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18945: Add option to convert nullable int columns to flo...

Posted by a10y <gi...@git.apache.org>.
Github user a10y commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18945#discussion_r139450187
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -1810,17 +1810,20 @@ def _to_scala_map(sc, jm):
         return sc._jvm.PythonUtils.toScalaMap(jm)
     
     
    -def _to_corrected_pandas_type(dt):
    +def _to_corrected_pandas_type(field, strict=True):
         """
         When converting Spark SQL records to Pandas DataFrame, the inferred data type may be wrong.
         This method gets the corrected data type for Pandas if that type may be inferred uncorrectly.
         """
         import numpy as np
    +    dt = field.dataType
         if type(dt) == ByteType:
             return np.int8
         elif type(dt) == ShortType:
             return np.int16
         elif type(dt) == IntegerType:
    +        if not strict and field.nullable:
    +            return np.float32
    --- End diff --
    
    Is loss of precision a concern here? Some integers from the original dataset will now be rounded to the nearest representable float32 if I'm not mistaken.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18945#discussion_r140419255
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -1760,13 +1760,39 @@ def toPandas(self):
                           "if using spark.sql.execution.arrow.enable=true"
                     raise ImportError("%s\n%s" % (e.message, msg))
             else:
    +            import numpy as np
                 dtype = {}
    +            nullable_int_columns = set()
    +
    +            def null_handler(rows, nullable_int_columns):
    +                requires_double_precision = set()
    +                for row in rows:
    +                    row = row.asDict()
    +                    for column in nullable_int_columns:
    +                        val = row[column]
    +                        dt = dtype[column]
    +                        if val is None and dt not in (np.float32, np.float64):
    +                            dt = np.float64 if column in requires_double_precision else np.float32
    +                            dtype[column] = dt
    +                        elif val is not None:
    +                            if abs(val) > 16777216:  # Max value before np.float32 loses precision.
    --- End diff --
    
    Why do we need this?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: Add option to convert nullable int columns to float colu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: Add option to convert nullable int columns to float colu...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    We also need a proper test for this.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: Add option to convert nullable int columns to float colu...

Posted by logannc <gi...@git.apache.org>.
Github user logannc commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    Sorry I feel off the face of the earth. I finally had some time to sit down and do this. I took your suggestions but implemented it a little differently. Unless I've made a dumb mistake, I think I improved on it a bit.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: Add option to convert nullable int columns to float colu...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    Thanks for clarifying @HyukjinKwon , I see what you mean now.  Since pandas will iterate over `self.collect()` anyway I don't think your solution would impact performance at all right?  So your way might be better, but it is slightly more complicated..
    
    Just to sum things up - @logannc does this still meet your requirements? 
    Instead of having the `strict = True` option we do the following:
    ```
    for each nullable int32 column:
        if there are null values:
            change column type to float32
        else:
            change column type to int32
    ```
    
    I'm also guessing we will have the same problem with nullable ShortType - maybe others? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82062/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    **[Test build #82066 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82066/testReport)** for PR 18945 at commit [`d93a203`](https://github.com/apache/spark/commit/d93a2030d366bf1eb5ae2d6cc335894eddbc48dd).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: Add option to convert nullable int columns to float colu...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    **[Test build #82025 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82025/testReport)** for PR 18945 at commit [`18d8d72`](https://github.com/apache/spark/commit/18d8d72d8d81e92e03ff66be15fb0bf323e3997d).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    **[Test build #82067 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82067/testReport)** for PR 18945 at commit [`6e16cd8`](https://github.com/apache/spark/commit/6e16cd82434c82cd7213ae8ef2b52e1c42e607cf).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    **[Test build #82063 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82063/testReport)** for PR 18945 at commit [`bd25923`](https://github.com/apache/spark/commit/bd259239c550b0b19311968aff9a69da29a6a05e).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    **[Test build #82063 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82063/testReport)** for PR 18945 at commit [`bd25923`](https://github.com/apache/spark/commit/bd259239c550b0b19311968aff9a69da29a6a05e).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: Add option to convert nullable int columns to float colu...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    gentle ping @logannc.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: Add option to convert nullable int columns to float colu...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    I agree with @BryanCutler in general. `None` makes more sense to me to infer the type in this case.
    
    Another rough thought for a feasible way I could think to keep the current behaviour (to be more specific, to match the types with / without Arrow optimization, IIUC) is, to make a generator wrapper to check if `None` exists in the results for columns where the type is int and nullable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: Add option to convert nullable int columns to float colu...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    @HyukjinKwon I can take over this if @logannc can't find time to continue it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18945: Add option to convert nullable int columns to flo...

Posted by logannc <gi...@git.apache.org>.
Github user logannc commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18945#discussion_r140148164
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -1761,12 +1761,37 @@ def toPandas(self):
                     raise ImportError("%s\n%s" % (e.message, msg))
             else:
                 dtype = {}
    +            columns_with_null_int = {}
    +            def null_handler(rows, columns_with_null_int):
    +                for row in rows:
    +                    row = row.asDict()
    +                    for column in columns_with_null_int:
    +                        val = row[column]
    +                        dt = dtype[column]
    +                        if val is not None:
    +                            if abs(val) > 16777216: # Max value before np.float32 loses precision.
    +                                val = np.float64(val)
    +                                if np.float64 != dt:
    +                                    dt = np.float64
    +                                    dtype[column] = np.float64
    +                            else:
    +                                val = np.float32(val)
    +                                if dt not in (np.float32, np.float64):
    +                                    dt = np.float32
    +                                    dtype[column] = np.float32
    +                            row[column] = val
    +                    row = Row(**row)
    +                    yield row
    +            row_handler = lambda x,y: x
                 for field in self.schema:
                     pandas_type = _to_corrected_pandas_type(field.dataType)
    +                if pandas_type in (np.int8, np.int16, np.int32) and field.nullable:
    +                    columns_with_null_int.add(field.name)
    --- End diff --
    
    Ack, I noticed I did that then forgot to change it. On it...


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    **[Test build #82062 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82062/testReport)** for PR 18945 at commit [`6e248dd`](https://github.com/apache/spark/commit/6e248ddf96122910468a3f20125ff4fc9f32f299).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18945
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/18945


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18945: Add option to convert nullable int columns to flo...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18945#discussion_r140143875
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -1761,12 +1761,37 @@ def toPandas(self):
                     raise ImportError("%s\n%s" % (e.message, msg))
             else:
                 dtype = {}
    +            columns_with_null_int = {}
    +            def null_handler(rows, columns_with_null_int):
    +                for row in rows:
    +                    row = row.asDict()
    +                    for column in columns_with_null_int:
    +                        val = row[column]
    +                        dt = dtype[column]
    +                        if val is not None:
    +                            if abs(val) > 16777216: # Max value before np.float32 loses precision.
    +                                val = np.float64(val)
    +                                if np.float64 != dt:
    +                                    dt = np.float64
    +                                    dtype[column] = np.float64
    +                            else:
    +                                val = np.float32(val)
    +                                if dt not in (np.float32, np.float64):
    +                                    dt = np.float32
    +                                    dtype[column] = np.float32
    +                            row[column] = val
    +                    row = Row(**row)
    +                    yield row
    +            row_handler = lambda x,y: x
                 for field in self.schema:
                     pandas_type = _to_corrected_pandas_type(field.dataType)
    +                if pandas_type in (np.int8, np.int16, np.int32) and field.nullable:
    +                    columns_with_null_int.add(field.name)
    --- End diff --
    
        >>> columns_with_null_int = {}
        >>> columns_with_null_int.add("test")                                                                              
        Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
        AttributeError: 'dict' object has no attribute 'add'
    
    Am I missing something?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org