You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/10/15 18:55:12 UTC

[GitHub] [spark] BryanCutler commented on issue #26133: [WIP][SPARK-29376][SQL][PYTHON] Upgrade Apache Arrow to version 0.15.0

BryanCutler commented on issue #26133: [WIP][SPARK-29376][SQL][PYTHON] Upgrade Apache Arrow to version 0.15.0
URL: https://github.com/apache/spark/pull/26133#issuecomment-542355414

I'm marking this as a WIP because we need to decide if we will increase the minimum required version of PyArrow to 0.15.0 as well. The main issue is that there is a change in the Arrow binary IPC format.

If we do not increase the minimum pyarrow, it will not work by default. There will need to be some configuration set so python and java are using the same IPC format, which can be done 2 ways:

1. Have both python and java write the legacy IPC format by default. This would not require too many additional changes, but not ideal because at some point the ability to write the legacy format will be dropped from pyarrow and no longer work.

2. Detect the version of pyarrow being used and configure the Java writers accordingly. This will keep compatibility with all pyarrow <= 0.14.1 and future versions too. The problem is it could be messy to dynamically configure Java writers for `pandas_udf`s because the data is written before we even import pyarrow to be able to tell the version.

If we increase the minimum required version of PyArrow, then all python and java writers will use the default settings, which is the cleanest change and will be compatible with future versions of PyArrow. Also, I believe SparkR will require Arrow 0.15.0 also so this would keep Python and R inline.

NOTE: this change in the Arrow binary IPC format only affects writing temporary binary data with Python <-> Java and is not present in any long-term storage.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org