You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by zasdfgbnm <gi...@git.apache.org> on 2016/07/14 08:58:06 UTC

[GitHub] spark pull request #14198: Fix bugs about types that result an array of null...

GitHub user zasdfgbnm opened a pull request:

    https://github.com/apache/spark/pull/14198

    Fix bugs about types that result an array of null when creating dataframe using python

    ## What changes were proposed in this pull request?
    
    Fix bugs about types that result an array of null when creating dataframe using python.
    Python's array.array have richer type than python itself, e.g. we can have array('f',[1,2,3]) and array('d',[1,2,3]). Codes in spark-sql didn't take this into consideration which might cause a problem that you get an array of null values when you have array('f') in your rows.
    
    A simple code to reproduce this is:
    
    `from pyspark import SparkContext`
    `from pyspark.sql import SQLContext,Row,DataFrame`
    `from array import array`
    
    `sc = SparkContext()`
    `sqlContext = SQLContext(sc)`
    
    `row1 = Row(floatarray=array('f',[1,2,3]), doublearray=array('d',[1,2,3]))`
    `rows = sc.parallelize([ row1 ])`
    `df = sqlContext.createDataFrame(rows)`
    `df.show()`
    
    which have output
    `+---------------+------------------+`
    `|    doublearray|        floatarray|`
    `+---------------+------------------+`
    `|[1.0, 2.0, 3.0]|[null, null, null]|`
    `+---------------+------------------+`
    
    
    ## How was this patch tested?
    tested manually
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zasdfgbnm/spark fix_array_infer

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14198.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14198
    
----
commit a127486d59528eae452dcbcc2ccfb68fdd7769b7
Author: Xiang Gao <qa...@gmail.com>
Date:   2016-07-09T00:58:14Z

    use array.typecode to infer type
    
    Python's array has more type than python it self, for example
    python only has float while array support 'f' (float) and 'd' (double)
    Switching to array.typecode helps spark make a better inference
    
    For example, for the code:
    
    from pyspark.sql.types import _infer_type
    from array import array
    a = array('f',[1,2,3,4,5,6])
    _infer_type(a)
    
    We will get ArrayType(DoubleType,true) before change,
    but ArrayType(FloatType,true) after change

commit 70131f3b81575edf9073d5be72553730d6316bd6
Author: Xiang Gao <qa...@gmail.com>
Date:   2016-07-09T06:21:31Z

    Merge branch 'master' into fix_array_infer

commit 505e819f415c2f754b5147908516ace6f6ddfe78
Author: Xiang Gao <qa...@gmail.com>
Date:   2016-07-13T12:53:18Z

    sync with upstream

commit 05979ca6eabf723cf3849ec2bf6f6e9de26cb138
Author: Xiang Gao <qa...@gmail.com>
Date:   2016-07-14T08:07:12Z

    add case (c: Float, FloatType) to fromJava

commit 5cd817a4e7ec68a693ee2a878a2e36b09b1965b6
Author: Xiang Gao <qa...@gmail.com>
Date:   2016-07-14T08:09:25Z

    sync with upstream

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14198: [SPARK-16542] Fix bugs about types that result an array ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14198
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14198: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/14198
  
    @holdenk Are you still working on this? If so, could you rebase or merge master to fix conflicts please?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14198: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/14198
  
    We are closing it due to inactivity. please do reopen if you want to push it forward. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14198: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...

Posted by zasdfgbnm <gi...@git.apache.org>.
Github user zasdfgbnm commented on the issue:

    https://github.com/apache/spark/pull/14198
  
    Something to mention is, there is still one problem that I'm not sure whether I solve it correctly: in python's array, unsigned types are supported, but unsigned types are not supported in JVM. The solution in this PR is to convert unsigned types to a larger type, e.g. unsigned int -> long. I'm not sure whether it would be better to reject the unsigned types in python and throw an exception.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14198: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...

Posted by zasdfgbnm <gi...@git.apache.org>.
Github user zasdfgbnm commented on the issue:

    https://github.com/apache/spark/pull/14198
  
    Hi @holdenk , I think I'm done. I create a test for this issue and I do find from the test that spark has the same issue not only for float but also for byte and short. After several commits, `./python/run-tests --modules=pyspark-sql` passes on my computer.
    
    To be clear, I need to say that only array with typecode `b,h,i,l,f,d` are supported, array with typecode `u` is not supported because it "corresponds to Python\u2019s obsolete unicode character", array with typecode `B,H,I,L` are not supported because there is no unsigned types on JVM, array with typecode `q,Q` are not supported because they "are available only if the platform C compiler used to build Python supports C long long", which makes supporting them complicated. For the unsupported typecodes, a TypeError will be raised if the user try to create a DataFrame of it.
    
    Would you, or any other developer, review my code and get it merged?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14198: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/14198
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14198: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/14198
  
    @zasdfgbnm Are you still working on this? If so, could you rebase or merge master to fix conflicts please?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14198: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...

Posted by zasdfgbnm <gi...@git.apache.org>.
Github user zasdfgbnm commented on the issue:

    https://github.com/apache/spark/pull/14198
  
    reopened at https://github.com/apache/spark/pull/18444


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14198: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/14198


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14198: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...

Posted by zasdfgbnm <gi...@git.apache.org>.
Github user zasdfgbnm commented on the issue:

    https://github.com/apache/spark/pull/14198
  
    @ueshin @gatorsmile I'm happy to resolve the conflicts IF AND ONLY IF there will be a developer work on the code review for this. This PR was opened more than a year ago and I keep waiting for the review for one year. If it is guaranteed that there will be a reviewer assigned for this recently, I will resolve the conflicts. Otherwise, I don't want to maintain a PR forever just to wait for review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14198: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/14198
  
    @zasdfgbnm Please reopen the PR and @ueshin can help review your PR. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14198: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/14198
  
    Oh interesting - thanks for working on this @zasdfgbnm and sorry its sort of fallen through the cracks. Is this something you are still working on? For PRs to get in you generally need some form of automated tests, let me know if you would like some help adding tests for this issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14198: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...

Posted by zasdfgbnm <gi...@git.apache.org>.
Github user zasdfgbnm commented on the issue:

    https://github.com/apache/spark/pull/14198
  
    I'd love to help


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14198: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/14198
  
    @zasdfgbnm I think you can ping @ueshin to review.
    Sounds important to me to have. Ping me if it falls through.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14198: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/14198
  
    cc @ueshin 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org