You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by BryanCutler <gi...@git.apache.org> on 2017/12/04 23:36:13 UTC

[GitHub] spark pull request #19884: [WIP][SPARK-22324][SQL][PYTHON] Upgrade Arrow to ...

GitHub user BryanCutler opened a pull request:

    https://github.com/apache/spark/pull/19884

    [WIP][SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

    ## What changes were proposed in this pull request?
    
    Upgrade Spark to Arrow 0.8.0 for Java and Python
    
    ## How was this patch tested?
    
    Existing tests

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/BryanCutler/spark arrow-upgrade-080-SPARK-22324

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19884.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19884
    
----

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19884: [WIP][SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

Posted by wesm <gi...@git.apache.org>.

Github user wesm commented on the issue:

    https://github.com/apache/spark/pull/19884
  
    The Arrow 0.8.0 release vote just started today. Assuming it passes, the earliest you could see packages pushed to PyPI or conda-forge would be sometime on Thursday evening or Friday. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19884#discussion_r158206051
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -2141,22 +2141,22 @@ def pandas_udf(f=None, returnType=None, functionType=None):
     
            >>> from pyspark.sql.functions import pandas_udf, PandasUDFType
            >>> from pyspark.sql.types import IntegerType, StringType
    -       >>> slen = pandas_udf(lambda s: s.str.len(), IntegerType())
    -       >>> @pandas_udf(StringType())
    +       >>> slen = pandas_udf(lambda s: s.str.len(), IntegerType())  # doctest: +SKIP
    +       >>> @pandas_udf(StringType())  # doctest: +SKIP
            ... def to_upper(s):
            ...     return s.str.upper()
            ...
    -       >>> @pandas_udf("integer", PandasUDFType.SCALAR)
    +       >>> @pandas_udf("integer", PandasUDFType.SCALAR)  # doctest: +SKIP
            ... def add_one(x):
            ...     return x + 1
            ...
    -       >>> df = spark.createDataFrame([(1, "John Doe", 21)], ("id", "name", "age"))
    +       >>> df = spark.createDataFrame([(1, "John", 21)], ("id", "name", "age"))  # doctest: +SKIP
    --- End diff --
    
    why change `John Doe` to `John`? And are we going to re-enable these doctest later?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19884
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85043/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19884
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85044/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19884#discussion_r157957465
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowPythonRunner.scala ---
    @@ -126,18 +121,14 @@ class ArrowPythonRunner(
           private var schema: StructType = _
           private var vectors: Array[ColumnVector] = _
     
    -      private var closed = false
    -
           context.addTaskCompletionListener { _ =>
             // todo: we need something like `reader.end()`, which release all the resources, but leave
    --- End diff --
    
    ok done


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19884
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19884#discussion_r157747561
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -1679,6 +1678,15 @@ def from_arrow_schema(arrow_schema):
              for field in arrow_schema])
     
     
    +def _require_minimum_pyarrow_version():
    +    """ Raise ImportError if minimum version of pyarrow is not installed
    +    """
    +    from distutils.version import LooseVersion
    +    import pyarrow
    +    if pyarrow.__version__ < LooseVersion('0.8.0'):
    --- End diff --
    
    Just quickly checked other codes in few places I know. Let's use `LooseVersion` for both sides as @ueshin suggested to reduce possible confusion if you wouldn't mind.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19884#discussion_r157738702
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -1679,6 +1678,15 @@ def from_arrow_schema(arrow_schema):
              for field in arrow_schema])
     
     
    +def _require_minimum_pyarrow_version():
    --- End diff --
    
    >  don't we need to use LooseVersion for pyarrow.__version__, too?
    
    Seems fine by 
    
    https://github.com/python/cpython/blob/6f0eb93183519024cb360162bdd81b9faec97ba6/Lib/distutils/version.py#L331-L340
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19884: [WIP][SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/19884
  
    @wesm I was able to install pyarrow 0.8.0 to my local environment via conda. Thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19884#discussion_r158112470
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -3356,6 +3356,7 @@ def test_schema_conversion_roundtrip(self):
             self.assertEquals(self.schema, schema_rt)
     
     
    +@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not installed")
    --- End diff --
    
    @ueshin @HyukjinKwon just confirming that this test should be conditional on pandas/pyarrow being installed as we will check for a minimum pyarrow version when using `pandas_udf `?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/19884
  
    Hi @shaneknapp , I think we are all ready here to try updating to pyarrow 0.8.0.  The build here should pass once this version is available, if you want to just try updating a single worker first and get an idea if all is well.  Also, if you didn't see here https://github.com/apache/spark/pull/19884#issuecomment-351916074, I believe there are some workers without Pandas 0.19.2 and some without pyarrow already installed.  Thanks!!!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

Posted by shaneknapp <gi...@git.apache.org>.

Github user shaneknapp commented on the issue:

    https://github.com/apache/spark/pull/19884
  
    we should be good to go:
    
    ```$ pssh -h jenkins_workers.txt -t 0 "export PATH=/home/anaconda/envs/py3k/bin$PATH; pip install pyarrow==0.8.0"
    [1] 05:55:00 [SUCCESS] amp-jenkins-worker-01
    [2] 05:55:00 [SUCCESS] amp-jenkins-worker-03
    [3] 05:55:00 [SUCCESS] amp-jenkins-worker-08
    [4] 05:55:00 [SUCCESS] amp-jenkins-worker-07
    [5] 05:55:00 [SUCCESS] amp-jenkins-worker-05
    [6] 05:55:00 [SUCCESS] amp-jenkins-worker-04
    [7] 05:55:00 [SUCCESS] amp-jenkins-worker-06
    [8] 05:55:00 [SUCCESS] amp-jenkins-worker-02
    ```
    
    ...and...
    
    ```$ pssh -h jenkins_workers.txt -t 0 -i "export PATH=/home/anaconda/envs/py3k/bin:$PATH; pip show pyarrow | grep ^Version"
    [1] 05:56:28 [SUCCESS] amp-jenkins-worker-02
    Version: 0.8.0
    [2] 05:56:28 [SUCCESS] amp-jenkins-worker-06
    Version: 0.8.0
    [3] 05:56:28 [SUCCESS] amp-jenkins-worker-03
    Version: 0.8.0
    [4] 05:56:28 [SUCCESS] amp-jenkins-worker-05
    Version: 0.8.0
    [5] 05:56:28 [SUCCESS] amp-jenkins-worker-08
    Version: 0.8.0
    [6] 05:56:28 [SUCCESS] amp-jenkins-worker-04
    Version: 0.8.0
    [7] 05:56:28 [SUCCESS] amp-jenkins-worker-07
    Version: 0.8.0
    [8] 05:56:28 [SUCCESS] amp-jenkins-worker-01
    Version: 0.8.0
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19884: [WIP][SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19884
  
    **[Test build #84738 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84738/testReport)** for PR 19884 at commit [`46ad595`](https://github.com/apache/spark/commit/46ad5951652c40de3c2c108c9b952b16dfcc3ad5).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19884: [WIP][SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

Posted by wesm <gi...@git.apache.org>.

Github user wesm commented on the issue:

    https://github.com/apache/spark/pull/19884
  
    If you want to install pyarrow 0.8.0 via conda it's available now from the `-c conda-forge` channel (https://anaconda.org/conda-forge/pyarrow). I am not sure where we are at on PyPI / pip packages -- I will start the update process later today if no one else does cc @BryanCutler @siddharthteotia @xhochy 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19884
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19884: [WIP][SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19884
  
    Just for a refreshing reminder about Jenkins, I happened to check what we have in Jenkins roughly a month ago (just simply by printing out the versions within PySpark tests) in a specific machine:
    
    ```
    PyPy - No Pandas
    Python 2.7 Pandas [0.16.0]
    Python 3.4 Pandas [0.19.2]
    ```
    
    ```
    PyPy - No PyArrow
    python 2.7 - No PyArrow
    Python 3.4 PyArrow [0.4.1]
    ```
    
    I think we should also make sure which Python has the corresponding Pandas and PyArrow.
    
    Also, we dropped Pandas 0.19.2 per http://apache-spark-developers-list.1001551.n3.nabble.com/discuss-PySpark-Can-we-drop-support-old-Pandas-lt-0-19-2-or-what-version-should-we-support-td22834.html and https://github.com/apache/spark/pull/19607. 
    I think each Python also should have Pandas 0.19.2 now if I haven't missed something.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19884#discussion_r157737242
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -1679,6 +1678,15 @@ def from_arrow_schema(arrow_schema):
              for field in arrow_schema])
     
     
    +def _require_minimum_pyarrow_version():
    --- End diff --
    
    Seems fine. Only I know about `LooseVersion` is it compares versions in string correctly. Sure, we should add it there too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

Posted by zsxwing <gi...@git.apache.org>.

Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19884#discussion_r161391208
  
    --- Diff: common/network-common/src/main/java/org/apache/spark/network/protocol/MessageWithHeader.java ---
    @@ -91,7 +91,7 @@ public long position() {
       }
     
       @Override
    -  public long transfered() {
    +  public long transferred() {
    --- End diff --
    
    It doesn't. The old method is implemented in `AbstractFileRegion.transfered`. In addition, the whole network module is private, we don't need to maintain compatibility.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19884
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19884
  
    **[Test build #85246 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85246/testReport)** for PR 19884 at commit [`b0200ef`](https://github.com/apache/spark/commit/b0200efd30c6fe77ec6e57d65f3bc828be0e1802).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19884#discussion_r158173623
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -3356,6 +3356,7 @@ def test_schema_conversion_roundtrip(self):
             self.assertEquals(self.schema, schema_rt)
     
     
    +@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not installed")
    --- End diff --
    
    I can't take a closer look now but let's do this if it passes the tests cc @ueshin 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19884#discussion_r161397435
  
    --- Diff: common/network-common/src/main/java/org/apache/spark/network/protocol/MessageWithHeader.java ---
    @@ -91,7 +91,7 @@ public long position() {
       }
     
       @Override
    -  public long transfered() {
    +  public long transferred() {
    --- End diff --
    
    I see. Thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19884
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85100/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19884
  
    @BryanCutler, did we resolve https://github.com/apache/spark/pull/19884#issuecomment-353276931? If not, shall we file a JIRA?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19884
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85099/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19884: [WIP][SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19884
  
    **[Test build #84663 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84663/testReport)** for PR 19884 at commit [`fdba406`](https://github.com/apache/spark/commit/fdba406f29216b8ef592de45dc36461217113410).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19884#discussion_r158212056
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -2141,22 +2141,23 @@ def pandas_udf(f=None, returnType=None, functionType=None):
     
            >>> from pyspark.sql.functions import pandas_udf, PandasUDFType
            >>> from pyspark.sql.types import IntegerType, StringType
    -       >>> slen = pandas_udf(lambda s: s.str.len(), IntegerType())
    -       >>> @pandas_udf(StringType())
    +       >>> slen = pandas_udf(lambda s: s.str.len(), IntegerType())  # doctest: +SKIP
    +       >>> @pandas_udf(StringType())  # doctest: +SKIP
            ... def to_upper(s):
            ...     return s.str.upper()
            ...
    -       >>> @pandas_udf("integer", PandasUDFType.SCALAR)
    +       >>> @pandas_udf("integer", PandasUDFType.SCALAR)  # doctest: +SKIP
            ... def add_one(x):
            ...     return x + 1
            ...
    -       >>> df = spark.createDataFrame([(1, "John Doe", 21)], ("id", "name", "age"))
    +       >>> df = spark.createDataFrame([(1, "John Doe", 21)],
    +       ...                            ("id", "name", "age"))  # doctest: +SKIP
            >>> df.select(slen("name").alias("slen(name)"), to_upper("name"), add_one("age")) \\
            ...     .show()  # doctest: +SKIP
            +----------+--------------+------------+
            |slen(name)|to_upper(name)|add_one(age)|
            +----------+--------------+------------+
    -       |         8|      JOHN DOE|          22|
    +       |         8|          JOHN|          22|
    --- End diff --
    
    oops, done!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

Posted by shaneknapp <gi...@git.apache.org>.

Github user shaneknapp commented on the issue:

    https://github.com/apache/spark/pull/19884
  
    @HyukjinKwon @wesm @BryanCutler 
    
    alright.  here's my plan for right now:
    * python 3.4.5 -- upgrade pyarrow --> 0.8.0  (confirmed working on my staging environment)
    
    what i'm not going to do today:
    * install pyarrow for python 2.7 
    * mess with the pypy installation
    
    i should have pyarrow updated across all workers in ~15 mins, tops.
    
    and please note that spark is only built on centos and ubuntu *nix distros @ RISELab (neé AMPLab).  we do not have, nor plan on having any windows build nodes in the immediate future.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19884: [WIP][SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19884
  
    cc @zsxwing as well, I saw you opened a JIRA about this - SPARK-22656



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19884#discussion_r157960643
  
    --- Diff: python/pyspark/sql/udf.py ---
    @@ -33,6 +33,10 @@ def _wrap_function(sc, func, returnType):
     
     
     def _create_udf(f, returnType, evalType):
    +    from pyspark.sql.utils import _require_minimum_pyarrow_version
    +
    +    _require_minimum_pyarrow_version()
    --- End diff --
    
    Yeah, that is not good!  I was a little hesitant to put it in `def pandas_udf` because things are a little different when used as a decorator.  How about leave it in `_create_udf` only when the eval type is a Pandas form?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19884
  
    LGTM
    
    Merged to master.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19884
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85195/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19884#discussion_r158206309
  
    --- Diff: python/pyspark/sql/utils.py ---
    @@ -110,3 +110,12 @@ def toJArray(gateway, jtype, arr):
         for i in range(0, len(arr)):
             jarr[i] = arr[i]
         return jarr
    +
    +
    +def _require_minimum_pyarrow_version():
    --- End diff --
    
    @ueshin did we do the same thing for pandas?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19884
  
    **[Test build #85195 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85195/testReport)** for PR 19884 at commit [`d92ae90`](https://github.com/apache/spark/commit/d92ae90e05f55955eaad8e7f55e6324bf333a6bc).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

Posted by shaneknapp <gi...@git.apache.org>.

Github user shaneknapp commented on the issue:

    https://github.com/apache/spark/pull/19884
  
    pypy is 2.5.1, no pandas or pyarrow   (/usr/bin/pypy -- hand-rolled dist i put together ~3 years ago)
    
    python 3.4.5:  pyarrow 0.4.1, pandas 0.19.2  (managed by anaconda)
    
    python 2.7.13:  no pyarrow, pandas 0.16.0  (managed by anaconda)
    
    please correct me if i'm wrong, but i was under the impression that we're only supporting pyarrow w/python3.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19884
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19884
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19884
  
    **[Test build #85220 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85220/testReport)** for PR 19884 at commit [`423b68c`](https://github.com/apache/spark/commit/423b68cc2831106bcd7d59e84c86c4511e6fb347).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org