You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/09/10 04:29:36 UTC

[GitHub] [spark] HyukjinKwon opened a new pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

HyukjinKwon opened a new pull request #29703:
URL: https://github.com/apache/spark/pull/29703


   ### What changes were proposed in this pull request?
   
   This PR proposes to add a way to select Hadoop and Hive versions in pip installation.
   Users can select Hive or Hadoop versions as below:
   
   ```bash
   HADOOP_VERSION=3.2 pip install pyspark
   HIVE_VERSION=1.2 pip install pyspark
   HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark
   ```
   
   When the environment variables are set, internally it downloads the corresponding Spark version and then sets the Spark home to it.
   
   **Please NOTE that:**
   - We cannot currently leverage pip's native installation option, for example:
   
       ```bash
       pip install pyspark --install-option="hadoop3.2"
       ```
   
       This is because of a limitation and bug in pip itself. Once they fix this issue, we can switch from the environment variables to the proper installation options, see SPARK-32837.
   
       It IS possible to workaround but very ugly or hacky with a big change. See [this PR](https://github.com/microsoft/nni/pull/139/files) as an example.
   
   - In pip installation, we pack the relevant jars together. This PR _does not touch existing packaging way_ in order to prevent any behaviour changes.
   
     Once this experimental way is proven to be safe, we can avoid packing the relevant jars together (and keep only the relevant Python scripts). And downloads the Spark distribution as this PR proposes.
   
   - This way is sort of consistent with SparkR:
   
     SparkR provides a method `SparkR::install_spark` to support CRAN installation. This is fine because SparkR is provided purely as a R library. For example, `sparkr` script is not packed together.
   
     PySpark cannot take this approach because PySpark packaging ships relevant executable script together, e.g.) `pyspark` shell.
   
     If PySpark has a method such as `pyspark.install_spark`, users cannot call it in `pyspark` because `pyspark` already assumes relevant Spark is installed, JVM is launched, etc.
   
   - There looks no way to release that contains different Hadoop or Hive to PyPI due to [the version semantics](https://www.python.org/dev/peps/pep-0440/). This is not an option.
   
     The usual way looks either `--install-option` above with hacks or environment variables given my investigation.
   
   - I am going to document this as a followup once https://github.com/apache/spark/pull/29640 is merged.
   
   ### Why are the changes needed?
   
   To provide users the options to select Hadoop and Hive versions.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, users will be able to select Hive and Hadoop version as below when they install it from `pip`;
   
   ```bash
   HADOOP_VERSION=3.2 pip install pyspark
   HIVE_VERSION=1.2 pip install pyspark
   HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark
   ```
   
   ### How was this patch tested?
   
   Unit tests were added. I also manually tested in Mac and Windows (after building Spark with `python/dist/pyspark-3.1.0.dev0.tar.gz`):
   
   Mac:
   
   ```bash
   SPARK_VERSION=3.0.1 HADOOP_VERSION=3.2 pip install pyspark-3.1.0.dev0.tar.gz
   ```
   
   Windows:
   
   ```bash
   set HADOOP_VERSION=3.2
   set SPARK_VERSION=3.0.1
   pip install pyspark-3.1.0.dev0.tar.gz
   ```
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon removed a comment on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
HyukjinKwon removed a comment on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-690035463


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon edited a comment on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
HyukjinKwon edited a comment on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-690820353


   > 1. How does this interact with the ability to specify dependencies in a requirements file? It seems odd to have to do something like HADOOP_VERSION=3.2 pip install -r requirements.txt, because, kinda like with --install-option, we've now modified pip's behavior across all the libraries it's going to install.
   >
   >    I also wonder if this plays well with tools like pip-tools that compile down requirements files into the full list of their transitive dependencies. I'm guessing users will need to manually preserve the environment variables, because they will not be reflected in the compiled requirements.
   
   I agree that it doesn't look very pip friendly. That's why I had to investigate a lot and write down what I checked in the PR description. 
   
   `--instal-option` is supported via `requirement.txt` so once pip provides a proper way to configure this, we will switch to this (at SPARK-32837). We can't use this option for now due to https://github.com/pypa/pip/issues/1883 (see also https://github.com/pypa/pip/issues/5771). There seems no other ways possible given my investigation.
   
   We can just keep this as an experimental mode for the time being in this way, and switch it to the proper pip installation option once they support in the future.
   
   > 2. Have you considered publishing these alternate builds under different package names? e.g. pyspark-hadoop3.2. This avoids the need to mess with environment variables, and delivers a more vanilla install experience. But it will also push us to define upfront what combinations to publish builds for to PyPI.
   
   I have thought about this option too but ..
   - I think we'll end up with having multiple packages per the profiles we support.
   - I still think using pip's native configuration is the ideal way. By using environment variables, we can easily switch it to use pip's option in the future.
   - Minor but .. It will be difficult to track the usage (https://pypistats.org/packages/pyspark)
   
   > 3. Are you sure it's OK to point at archive.apache.org? Everyone installing a non-current version of PySpark with alternate versions of Hadoop / Hive specified will hit the archive. Unlike PyPI, the Apache archive is not backed by a generous CDN:
   >
   >     Do note that a daily limit of 5GB per IP is being enforced on archive.apache.org, to prevent abuse.
   >
   >     In Flintrock, I never touch the archive out of fear of being an "abusive user". This is another argument for publishing alternate packages to PyPI.
   
   Yeah, I understand this can be a valid concern. But this is already available to use and people use it. Also it's being used in our own CI:
   
   https://github.com/apache/spark/blob/b84ed4146d93b37adb2b83ca642c7978a1ac853e/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala#L82
   
   The PR makes it easier to use them to download old versions. We can make it configurable as well via exposing an environment variable.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-693778917






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29703:
URL: https://github.com/apache/spark/pull/29703#discussion_r489919482



##########
File path: python/docs/source/getting_started/installation.rst
##########
@@ -38,8 +38,36 @@ PySpark installation using `PyPI <https://pypi.org/project/pyspark/>`_
 .. code-block:: bash
 
     pip install pyspark
-	
-Using Conda  
+
+For PySpark with different Hadoop and/or Hive, you can install it by using ``HIVE_VERSION`` and ``HADOOP_VERSION`` environment variables as below:
+
+.. code-block:: bash
+
+    HIVE_VERSION=2.3 pip install pyspark
+    HADOOP_VERSION=2.7 pip install pyspark
+    HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark
+
+The default distribution has built-in Hadoop 3.2 and Hive 2.3. If users specify different versions, the pip installation automatically
+downloads a different version and use it in PySpark. Downloading it can take a while depending on the network and the mirror chosen.
+It is recommended to use `-v` option in `pip` to track the installation and download status.
+
+.. code-block:: bash
+
+    HADOOP_VERSION=2.7 pip install pyspark -v
+
+Supported versions are as below:
+
+====================================== ====================================== ======================================
+``HADOOP_VERSION`` \\ ``HIVE_VERSION`` 1.2                                    2.3 (default)
+====================================== ====================================== ======================================
+**2.7**                                O                                      O
+**3.2 (default)**                      X                                      O
+**without**                            X                                      O
+====================================== ====================================== ======================================
+
+Note that this installation of PySpark with different versions of Hadoop and Hive is experimental. It can change or be removed betweem minor releases.

Review comment:
       betweem -> between

##########
File path: python/pyspark/install.py
##########
@@ -0,0 +1,170 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import os
+import re
+import tarfile
+import traceback
+import urllib.request
+from shutil import rmtree
+# NOTE that we shouldn't import pyspark here because this is used in
+# setup.py, and assume there's no PySpark imported.
+
+DEFAULT_HADOOP = "hadoop3.2"
+DEFAULT_HIVE = "hive2.3"
+SUPPORTED_HADOOP_VERSIONS = ["hadoop2.7", "hadoop3.2", "without-hadoop"]
+SUPPORTED_HIVE_VERSIONS = ["hive1.2", "hive2.3"]
+UNSUPPORTED_COMBINATIONS = [
+    ("without-hadoop", "hive1.2"),
+    ("hadoop3.2", "hive1.2"),
+]
+
+
+def checked_package_name(spark_version, hadoop_version, hive_version):
+    if hive_version == "hive1.2":
+        return "%s-bin-%s-%s" % (spark_version, hadoop_version, hive_version)
+    else:
+        return "%s-bin-%s" % (spark_version, hadoop_version)
+
+
+def checked_versions(spark_version, hadoop_version, hive_version):
+    """
+    Check the valid combinations of supported versions in Spark distributions.
+
+    :param spark_version: Spark version. It should be X.X.X such as '3.0.0' or spark-3.0.0.
+    :param hadoop_version: Hadoop version. It should be X.X such as '2.7' or 'hadoop2.7'.
+        'without' and 'without-hadoop' are supported as special keywords for Hadoop free
+        distribution.
+    :param hive_version: Hive version. It should be X.X such as '1.2' or 'hive1.2'.
+
+    :return it returns fully-qualified versions of Spark, Hadoop and Hive in a tuple.
+        For example, spark-3.0.0, hadoop3.2 and hive2.3.
+    """
+    if re.match("^[0-9]+\\.[0-9]+\\.[0-9]+$", spark_version):
+        spark_version = "spark-%s" % spark_version
+    if not spark_version.startswith("spark-"):
+        raise RuntimeError(
+            "Spark version should start with 'spark-' prefix; however, "
+            "got %s" % spark_version)
+
+    if hadoop_version == "without":
+        hadoop_version = "without-hadoop"

Review comment:
       Is "without-hadoop" also supported as special keyword? Seems not see it is matched here?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-692390497


   I documented it. I believe this is ready for a review.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #29703:
URL: https://github.com/apache/spark/pull/29703#discussion_r489952888



##########
File path: python/docs/source/getting_started/installation.rst
##########
@@ -38,8 +38,36 @@ PySpark installation using `PyPI <https://pypi.org/project/pyspark/>`_
 .. code-block:: bash
 
     pip install pyspark
-	
-Using Conda  
+
+For PySpark with different Hadoop and/or Hive, you can install it by using ``HIVE_VERSION`` and ``HADOOP_VERSION`` environment variables as below:
+
+.. code-block:: bash
+
+    HIVE_VERSION=2.3 pip install pyspark
+    HADOOP_VERSION=2.7 pip install pyspark
+    HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark
+
+The default distribution has built-in Hadoop 3.2 and Hive 2.3. If users specify different versions, the pip installation automatically
+downloads a different version and use it in PySpark. Downloading it can take a while depending on the network and the mirror chosen.
+It is recommended to use `-v` option in `pip` to track the installation and download status.
+
+.. code-block:: bash
+
+    HADOOP_VERSION=2.7 pip install pyspark -v
+
+Supported versions are as below:
+
+====================================== ====================================== ======================================
+``HADOOP_VERSION`` \\ ``HIVE_VERSION`` 1.2                                    2.3 (default)
+====================================== ====================================== ======================================
+**2.7**                                O                                      O
+**3.2 (default)**                      X                                      O
+**without**                            X                                      O
+====================================== ====================================== ======================================
+
+Note that this installation of PySpark with different versions of Hadoop and Hive is experimental. It can change or be removed betweem minor releases.

Review comment:
       ```suggestion
   Note that this installation of PySpark with different versions of Hadoop and Hive is experimental. It can change or be removed between minor releases.
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #29703:
URL: https://github.com/apache/spark/pull/29703#discussion_r489909606



##########
File path: python/docs/source/getting_started/installation.rst
##########
@@ -38,8 +38,36 @@ PySpark installation using `PyPI <https://pypi.org/project/pyspark/>`_
 .. code-block:: bash

Review comment:
       I am going to rewrite this page after this PR gets merged.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-695502727


   **[Test build #128903 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128903/testReport)** for PR 29703 at commit [`058e61a`](https://github.com/apache/spark/commit/058e61ae143a3619fc13fb0c316044a75f7bc15f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-696600742


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-693796417


   **[Test build #128793 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128793/testReport)** for PR 29703 at commit [`09997b7`](https://github.com/apache/spark/commit/09997b7c92d608ea675d86d9d6d28e641654dc9f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-693778917






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-696502238






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-696547400


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-690036231






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-689978218


   Thank you for pinging me, @HyukjinKwon .
   
   cc @gatorsmile . Do you have any opinion on Hive 1.2?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-690155745






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-696503583






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-696470026






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #29703:
URL: https://github.com/apache/spark/pull/29703#discussion_r492425304



##########
File path: python/pyspark/find_spark_home.py
##########
@@ -42,7 +42,11 @@ def is_spark_home(path):
     import_error_raised = False
     from importlib.util import find_spec
     try:
+        # Spark distribution can be downloaded when HADOOP_VERSION environment variable is set.
+        # We should look up this directory first, see also SPARK-32017.
+        spark_dist_dir = "spark-distribution"
         module_home = os.path.dirname(find_spec("pyspark").origin)
+        paths.append(os.path.join(module_home, spark_dist_dir))

Review comment:
       `os.path.dirname(os.path.realpath(__file__))` is usually in a script directory because `find_spark_home.py` is included as a script at `setup.py`: https://github.com/apache/spark/blob/058e61ae143a3619fc13fb0c316044a75f7bc15f/python/setup.py#L187
   
   for example, it will be user's `bin`.
   
   I thought `os.path.dirname(os.path.realpath(__file__))` is more just for dev purpose or non-pip installed PySpark. I checked that it correctly points out the newly installed Spark home.
   
   However, sure. why not be safer :-) I will update.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-692045260






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-696469772


   **[Test build #128956 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128956/testReport)** for PR 29703 at commit [`33594e1`](https://github.com/apache/spark/commit/33594e147263cedcefffae3441a7cfdaf3619a5c).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ueshin commented on a change in pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
ueshin commented on a change in pull request #29703:
URL: https://github.com/apache/spark/pull/29703#discussion_r492417508



##########
File path: python/pyspark/find_spark_home.py
##########
@@ -42,7 +42,11 @@ def is_spark_home(path):
     import_error_raised = False
     from importlib.util import find_spec
     try:
+        # Spark distribution can be downloaded when HADOOP_VERSION environment variable is set.
+        # We should look up this directory first, see also SPARK-32017.
+        spark_dist_dir = "spark-distribution"
         module_home = os.path.dirname(find_spec("pyspark").origin)
+        paths.append(os.path.join(module_home, spark_dist_dir))

Review comment:
       We should insert this entry between `"../"` and `os.path.dirname(os.path.realpath(__file__))`, which is the module home; otherwise, `os.path.dirname(os.path.realpath(__file__))` will be the spark home.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-696547410






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-696500598






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #29703:
URL: https://github.com/apache/spark/pull/29703#discussion_r492425304



##########
File path: python/pyspark/find_spark_home.py
##########
@@ -42,7 +42,11 @@ def is_spark_home(path):
     import_error_raised = False
     from importlib.util import find_spec
     try:
+        # Spark distribution can be downloaded when HADOOP_VERSION environment variable is set.
+        # We should look up this directory first, see also SPARK-32017.
+        spark_dist_dir = "spark-distribution"
         module_home = os.path.dirname(find_spec("pyspark").origin)
+        paths.append(os.path.join(module_home, spark_dist_dir))

Review comment:
       `os.path.dirname(os.path.realpath(__file__))` is usually in a script directory because `find_spark_home.py` is included as a script at `setup.py`: https://github.com/apache/spark/blob/058e61ae143a3619fc13fb0c316044a75f7bc15f/python/setup.py#L187
   
   for example, it will be user's `bin`.
   
   I thought `os.path.dirname(os.path.realpath(__file__))` is more just for dev purpose. I checked that it correctly points out the newly installed Spark home.
   
   However, sure. why not be safer :-) I will update.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ueshin commented on a change in pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
ueshin commented on a change in pull request #29703:
URL: https://github.com/apache/spark/pull/29703#discussion_r492417508



##########
File path: python/pyspark/find_spark_home.py
##########
@@ -42,7 +42,11 @@ def is_spark_home(path):
     import_error_raised = False
     from importlib.util import find_spec
     try:
+        # Spark distribution can be downloaded when HADOOP_VERSION environment variable is set.
+        # We should look up this directory first, see also SPARK-32017.
+        spark_dist_dir = "spark-distribution"
         module_home = os.path.dirname(find_spec("pyspark").origin)
+        paths.append(os.path.join(module_home, spark_dist_dir))

Review comment:
       We should insert this entry between `"../"` and `os.path.dirname(os.path.realpath(__file__))`, which is the module home; otherwise, `os.path.dirname(os.path.realpath(__file__))` will be the spark home.

##########
File path: python/pyspark/find_spark_home.py
##########
@@ -42,7 +42,11 @@ def is_spark_home(path):
     import_error_raised = False
     from importlib.util import find_spec
     try:
+        # Spark distribution can be downloaded when HADOOP_VERSION environment variable is set.
+        # We should look up this directory first, see also SPARK-32017.
+        spark_dist_dir = "spark-distribution"
         module_home = os.path.dirname(find_spec("pyspark").origin)
+        paths.append(os.path.join(module_home, spark_dist_dir))

Review comment:
       We should insert this entry between `"../"` and `os.path.dirname(os.path.realpath(__file__))`, which is the same as the module home if pip is used for the installation; otherwise, `os.path.dirname(os.path.realpath(__file__))` will be the spark home.

##########
File path: python/pyspark/find_spark_home.py
##########
@@ -42,7 +42,11 @@ def is_spark_home(path):
     import_error_raised = False
     from importlib.util import find_spec
     try:
+        # Spark distribution can be downloaded when HADOOP_VERSION environment variable is set.
+        # We should look up this directory first, see also SPARK-32017.
+        spark_dist_dir = "spark-distribution"
         module_home = os.path.dirname(find_spec("pyspark").origin)
+        paths.append(os.path.join(module_home, spark_dist_dir))

Review comment:
       We should insert this entry between `"../"` and `os.path.dirname(os.path.realpath(__file__))`, which is the same as the module home if pip is used for the installation; otherwise, the module home will be the spark home and the distribution under `spark-distribution` is not used.

##########
File path: python/pyspark/find_spark_home.py
##########
@@ -42,7 +42,11 @@ def is_spark_home(path):
     import_error_raised = False
     from importlib.util import find_spec
     try:
+        # Spark distribution can be downloaded when HADOOP_VERSION environment variable is set.
+        # We should look up this directory first, see also SPARK-32017.
+        spark_dist_dir = "spark-distribution"
         module_home = os.path.dirname(find_spec("pyspark").origin)
+        paths.append(os.path.join(module_home, spark_dist_dir))

Review comment:
       We should insert this entry at first or between `"../"` and `os.path.dirname(os.path.realpath(__file__))`, which is the same as the module home if pip is used for the installation; otherwise, the module home will be the spark home and the distribution under `spark-distribution` is not used.

##########
File path: python/pyspark/find_spark_home.py
##########
@@ -42,7 +42,11 @@ def is_spark_home(path):
     import_error_raised = False
     from importlib.util import find_spec
     try:
+        # Spark distribution can be downloaded when HADOOP_VERSION environment variable is set.
+        # We should look up this directory first, see also SPARK-32017.
+        spark_dist_dir = "spark-distribution"
         module_home = os.path.dirname(find_spec("pyspark").origin)
+        paths.append(os.path.join(module_home, spark_dist_dir))

Review comment:
       We should insert this entry before `os.path.dirname(os.path.realpath(__file__))`, which is the same as the module home if pip is used for the installation; otherwise, the module home will be the spark home and the distribution under `spark-distribution` is not used.

##########
File path: python/pyspark/find_spark_home.py
##########
@@ -42,7 +42,11 @@ def is_spark_home(path):
     import_error_raised = False
     from importlib.util import find_spec
     try:
+        # Spark distribution can be downloaded when HADOOP_VERSION environment variable is set.
+        # We should look up this directory first, see also SPARK-32017.
+        spark_dist_dir = "spark-distribution"
         module_home = os.path.dirname(find_spec("pyspark").origin)
+        paths.append(os.path.join(module_home, spark_dist_dir))

Review comment:
       The function in the file `find_spark_home.py` is called from `launch_gateway`, so if the users initialize Spark by themselves in their Python repr, the path will be the module home?
   
   ```py
   $ python
   >>> from pyspark.sql import SparkSession
   >>> spark = SparkSession.builder.getOrCreate()
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-696469772






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #29703:
URL: https://github.com/apache/spark/pull/29703#discussion_r492425304



##########
File path: python/pyspark/find_spark_home.py
##########
@@ -42,7 +42,11 @@ def is_spark_home(path):
     import_error_raised = False
     from importlib.util import find_spec
     try:
+        # Spark distribution can be downloaded when HADOOP_VERSION environment variable is set.
+        # We should look up this directory first, see also SPARK-32017.
+        spark_dist_dir = "spark-distribution"
         module_home = os.path.dirname(find_spec("pyspark").origin)
+        paths.append(os.path.join(module_home, spark_dist_dir))

Review comment:
       `os.path.dirname(os.path.realpath(__file__))` is usually in a script directory because `find_spark_home.py` is included as a script at `setup.py`: https://github.com/apache/spark/blob/058e61ae143a3619fc13fb0c316044a75f7bc15f/python/setup.py#L187
   
   for example, it will be user's `bin`.
   
   I thought `os.path.dirname(os.path.realpath(__file__))` is more just dev purpose. I checked that it correctly points out the newly installed Spark home.
   
   However, sure. why not be safer :-) I will update.

##########
File path: python/pyspark/find_spark_home.py
##########
@@ -42,7 +42,11 @@ def is_spark_home(path):
     import_error_raised = False
     from importlib.util import find_spec
     try:
+        # Spark distribution can be downloaded when HADOOP_VERSION environment variable is set.
+        # We should look up this directory first, see also SPARK-32017.
+        spark_dist_dir = "spark-distribution"
         module_home = os.path.dirname(find_spec("pyspark").origin)
+        paths.append(os.path.join(module_home, spark_dist_dir))

Review comment:
       `os.path.dirname(os.path.realpath(__file__))` is usually in a script directory because `find_spark_home.py` is included as a script at `setup.py`: https://github.com/apache/spark/blob/058e61ae143a3619fc13fb0c316044a75f7bc15f/python/setup.py#L187
   
   for example, it will be user's `bin`.
   
   I thought `os.path.dirname(os.path.realpath(__file__))` is more just for dev purpose or non-pip installed PySpark. I checked that it correctly points out the newly installed Spark home.
   
   However, sure. why not be safer :-) I will update.

##########
File path: python/pyspark/find_spark_home.py
##########
@@ -42,7 +42,11 @@ def is_spark_home(path):
     import_error_raised = False
     from importlib.util import find_spec
     try:
+        # Spark distribution can be downloaded when HADOOP_VERSION environment variable is set.
+        # We should look up this directory first, see also SPARK-32017.
+        spark_dist_dir = "spark-distribution"
         module_home = os.path.dirname(find_spec("pyspark").origin)
+        paths.append(os.path.join(module_home, spark_dist_dir))

Review comment:
       `os.path.dirname(os.path.realpath(__file__))` is usually in a script directory because `find_spark_home.py` is included as a script at `setup.py`: https://github.com/apache/spark/blob/058e61ae143a3619fc13fb0c316044a75f7bc15f/python/setup.py#L187
   
   for example, it will be user's `bin`.
   
   I thought `os.path.dirname(os.path.realpath(__file__))` is more just for dev purpose. I checked that it correctly points out the newly installed Spark home.
   
   However, sure. why not be safer :-) I will update.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-689978218


   Thank you for pinging me, @HyukjinKwon .
   
   cc @gatorsmile . Do you have any opinion on Hive 1.2 at Apache Spark 3.1.0?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-696601565






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-696470026






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-696547400






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-696601021


   **[Test build #128973 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128973/testReport)** for PR 29703 at commit [`20491e0`](https://github.com/apache/spark/commit/20491e0cdd5b1fae207cf20d8091d4c456728b39).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-691960962






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-690036231






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-696601565






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-691960448


   **[Test build #128639 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128639/testReport)** for PR 29703 at commit [`83815e0`](https://github.com/apache/spark/commit/83815e03ee18813b5ee9f41ee48c7408529f259b).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-693778396


   **[Test build #128789 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128789/testReport)** for PR 29703 at commit [`033a33e`](https://github.com/apache/spark/commit/033a33ee515b95342e8c5a74e63054d915661579).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-693796417


   **[Test build #128793 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128793/testReport)** for PR 29703 at commit [`09997b7`](https://github.com/apache/spark/commit/09997b7c92d608ea675d86d9d6d28e641654dc9f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-695746290


   **[Test build #128903 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128903/testReport)** for PR 29703 at commit [`058e61a`](https://github.com/apache/spark/commit/058e61ae143a3619fc13fb0c316044a75f7bc15f).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-696601021


   **[Test build #128973 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128973/testReport)** for PR 29703 at commit [`20491e0`](https://github.com/apache/spark/commit/20491e0cdd5b1fae207cf20d8091d4c456728b39).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon closed pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
HyukjinKwon closed pull request #29703:
URL: https://github.com/apache/spark/pull/29703


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-691960448


   **[Test build #128639 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128639/testReport)** for PR 29703 at commit [`83815e0`](https://github.com/apache/spark/commit/83815e03ee18813b5ee9f41ee48c7408529f259b).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-693796763






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #29703:
URL: https://github.com/apache/spark/pull/29703#discussion_r486062434



##########
File path: python/pyspark/install.py
##########
@@ -0,0 +1,170 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import os
+import re
+import tarfile
+import traceback
+import urllib.request
+from shutil import rmtree
+# NOTE that we shouldn't import pyspark here because this is used in
+# setup.py, and assume there's no PySpark imported.
+
+DEFAULT_HADOOP = "hadoop3.2"
+DEFAULT_HIVE = "hive2.3"
+SUPPORTED_HADOOP_VERSIONS = ["hadoop2.7", "hadoop3.2", "without-hadoop"]
+SUPPORTED_HIVE_VERSIONS = ["hive1.2", "hive2.3"]
+UNSUPPORTED_COMBINATIONS = [
+    ("without-hadoop", "hive1.2"),
+    ("hadoop3.2", "hive1.2"),
+]
+
+
+def checked_package_name(spark_version, hadoop_version, hive_version):
+    if hive_version == "hive1.2":
+        return "%s-bin-%s-%s" % (spark_version, hadoop_version, hive_version)
+    else:
+        return "%s-bin-%s" % (spark_version, hadoop_version)
+
+
+def checked_versions(spark_version, hadoop_version, hive_version):
+    """
+    Check the valid combinations of supported versions in Spark distributions.
+
+    :param spark_version: Spark version. It should be X.X.X such as '3.0.0' or spark-3.0.0.
+    :param hadoop_version: Hadoop version. It should be X.X such as '2.7' or 'hadoop2.7'.
+        'without' and 'without-hadoop' are supported as special keywords for Hadoop free
+        distribution.
+    :param hive_version: Hive version. It should be X.X such as '1.2' or 'hive1.2'.
+
+    :return it returns fully-qualified versions of Spark, Hadoop and Hive in a tuple.
+        For example, spark-3.0.0, hadoop3.2 and hive2.3.
+    """
+    if re.match("^[0-9]+\\.[0-9]+\\.[0-9]+$", spark_version):
+        spark_version = "spark-%s" % spark_version
+    if not spark_version.startswith("spark-"):
+        raise RuntimeError(
+            "Spark version should start with 'spark-' prefix; however, "
+            "got %s" % spark_version)
+
+    if hadoop_version == "without":
+        hadoop_version = "without-hadoop"
+    elif re.match("^[0-9]+\\.[0-9]+$", hadoop_version):
+        hadoop_version = "hadoop%s" % hadoop_version
+
+    if hadoop_version not in SUPPORTED_HADOOP_VERSIONS:
+        raise RuntimeError(
+            "Spark distribution of %s is not supported. Hadoop version should be "
+            "one of [%s]" % (hadoop_version, ", ".join(
+                SUPPORTED_HADOOP_VERSIONS)))
+
+    if re.match("^[0-9]+\\.[0-9]+$", hive_version):
+        hive_version = "hive%s" % hive_version
+
+    if hive_version not in SUPPORTED_HIVE_VERSIONS:
+        raise RuntimeError(
+            "Spark distribution of %s is not supported. Hive version should be "
+            "one of [%s]" % (hive_version, ", ".join(
+                SUPPORTED_HADOOP_VERSIONS)))
+
+    if (hadoop_version, hive_version) in UNSUPPORTED_COMBINATIONS:
+        raise RuntimeError("Hive 1.2 should only be with Hadoop 2.7.")
+
+    return spark_version, hadoop_version, hive_version
+
+
+def install_spark(dest, spark_version, hadoop_version, hive_version):
+    """
+    Installs Spark that corresponds to the given Hadoop version in the current
+    library directory.
+
+    :param dest: The location to download and install the Spark.
+    :param spark_version: Spark version. It should be spark-X.X.X form.
+    :param hadoop_version: Hadoop version. It should be hadoopX.X
+        such as 'hadoop2.7' or 'without-hadoop'.
+    :param hive_version: Hive version. It should be hiveX.X such as 'hive1.2'.
+    """
+
+    package_name = checked_package_name(spark_version, hadoop_version, hive_version)
+    package_local_path = os.path.join(dest, "%s.tgz" % package_name)
+    sites = get_preferred_mirrors()
+    print("Trying to download Spark %s from [%s]" % (spark_version, ", ".join(sites)))
+
+    pretty_pkg_name = "%s for Hadoop %s" % (
+        spark_version,
+        "Free build" if hadoop_version == "without" else hadoop_version)
+
+    for site in sites:
+        os.makedirs(dest, exist_ok=True)
+        url = "%s/spark/%s/%s.tgz" % (site, spark_version, package_name)
+
+        tar = None
+        try:
+            print("Downloading %s from:\n- %s" % (pretty_pkg_name, url))
+            download_to_file(urllib.request.urlopen(url), package_local_path)
+
+            print("Installing to %s" % dest)
+            tar = tarfile.open(package_local_path, "r:gz")
+            for member in tar.getmembers():
+                if member.name == package_name:
+                    # Skip the root directory.
+                    continue
+                member.name = os.path.relpath(member.name, package_name + os.path.sep)
+                tar.extract(member, dest)
+            return
+        except Exception:
+            print("Failed to download %s from %s:" % (pretty_pkg_name, url))
+            traceback.print_exc()
+            rmtree(dest, ignore_errors=True)
+        finally:
+            if tar is not None:
+                tar.close()
+            if os.path.exists(package_local_path):
+                os.remove(package_local_path)
+    raise IOError("Unable to download %s." % pretty_pkg_name)
+
+
+def get_preferred_mirrors():
+    mirror_urls = []
+    for _ in range(3):
+        try:
+            response = urllib.request.urlopen(
+                "https://www.apache.org/dyn/closer.lua?preferred=true")
+            mirror_urls.append(response.read().decode('utf-8'))
+        except Exception:
+            # If we can't get a mirror URL, skip it. No retry.
+            pass
+
+    default_sites = [
+        "https://archive.apache.org/dist", "https://dist.apache.org/repos/dist/release"]
+    return list(set(mirror_urls)) + default_sites
+
+
+def download_to_file(response, path, chunk_size=1024 * 1024):
+    total_size = int(response.info().get('Content-Length').strip())
+    bytes_so_far = 0
+
+    with open(path, mode="wb") as dest:
+        while True:
+            chunk = response.read(chunk_size)
+            bytes_so_far += len(chunk)
+            if not chunk:
+                break
+            dest.write(chunk)
+            print("Downloaded %d of %d bytes (%0.2f%%)" % (
+                bytes_so_far,
+                total_size,
+                round(float(bytes_so_far) / total_size * 100, 2)))

Review comment:
       The purpose of showing the progress are two:
   
   - The print out will be seen when `pip install ... -v`.
   - The output from plan `pip` without `-v` shows like it's in progress (otherwise it looks like it hangs). For example:
   
       ```
         Building wheel for pyspark (setup.py) ... -
         Building wheel for pyspark (setup.py) ... \
         Building wheel for pyspark (setup.py) ... |
       ``` 
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-690038912


   **[Test build #128495 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128495/testReport)** for PR 29703 at commit [`75dbcaf`](https://github.com/apache/spark/commit/75dbcafacee9574575c1c317aa8b4e294e43af3d).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-692045260






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-696503296






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-696469772


   **[Test build #128956 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128956/testReport)** for PR 29703 at commit [`33594e1`](https://github.com/apache/spark/commit/33594e147263cedcefffae3441a7cfdaf3619a5c).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-696500598


   I tested again on both Windows and Mac.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-696470026






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-690820353


   > 1. How does this interact with the ability to specify dependencies in a requirements file? It seems odd to have to do something like HADOOP_VERSION=3.2 pip install -r requirements.txt, because, kinda like with --install-option, we've now modified pip's behavior across all the libraries it's going to install.
   >
   >    I also wonder if this plays well with tools like pip-tools that compile down requirements files into the full list of their transitive dependencies. I'm guessing users will need to manually preserve the environment variables, because they will not be reflected in the compiled requirements.
   
   I agree that it doesn't look very pip friendly. That's why I had to investigate a lot and write down what I checked in the PR description. 
   
   `--instal-option` is supported via `requirement.txt` so once pip provides a proper way to configure this, we will switch to this (at SPARK-32837). We can't use this option for now due to https://github.com/pypa/pip/issues/1883. There seems no other ways possible given my investigation.
   
   We can just keep this as an experimental mode for the time being in this way, and switch it to the proper pip installation option once they support in the future.
   
   > 2. Have you considered publishing these alternate builds under different package names? e.g. pyspark-hadoop3.2. This avoids the need to mess with environment variables, and delivers a more vanilla install experience. But it will also push us to define upfront what combinations to publish builds for to PyPI.
   
   I have thought about this option too but ..
   - I think we'll end up with having multiple packages per the profiles we support.
   - I still think using pip's native configuration is the ideal way. By using environment variables, we can easily switch it to use pip's option in the future.
   - Minor but .. It will be difficult to track the usage (https://pypistats.org/packages/pyspark)
   
   > 3. Are you sure it's OK to point at archive.apache.org? Everyone installing a non-current version of PySpark with alternate versions of Hadoop / Hive specified will hit the archive. Unlike PyPI, the Apache archive is not backed by a generous CDN:
   >
   >     Do note that a daily limit of 5GB per IP is being enforced on archive.apache.org, to prevent abuse.
   >
   >     In Flintrock, I never touch the archive out of fear of being an "abusive user". This is another argument for publishing alternate packages to PyPI.
   
   Yeah, I understand this can be a valid concern. But this is already available to use and people use it. Also it's being used in our own CI:
   
   https://github.com/apache/spark/blob/b84ed4146d93b37adb2b83ca642c7978a1ac853e/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala#L82
   
   The PR makes it easier to use them to download old versions. We can make it configurable as well via exposing an environment variable.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ueshin commented on a change in pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
ueshin commented on a change in pull request #29703:
URL: https://github.com/apache/spark/pull/29703#discussion_r492417508



##########
File path: python/pyspark/find_spark_home.py
##########
@@ -42,7 +42,11 @@ def is_spark_home(path):
     import_error_raised = False
     from importlib.util import find_spec
     try:
+        # Spark distribution can be downloaded when HADOOP_VERSION environment variable is set.
+        # We should look up this directory first, see also SPARK-32017.
+        spark_dist_dir = "spark-distribution"
         module_home = os.path.dirname(find_spec("pyspark").origin)
+        paths.append(os.path.join(module_home, spark_dist_dir))

Review comment:
       We should insert this entry between `"../"` and `os.path.dirname(os.path.realpath(__file__))`, which is the same as the module home if pip is used for the installation; otherwise, `os.path.dirname(os.path.realpath(__file__))` will be the spark home.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-693950638






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #29703:
URL: https://github.com/apache/spark/pull/29703#discussion_r486056702



##########
File path: python/pyspark/install.py
##########
@@ -0,0 +1,170 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import os
+import re
+import tarfile
+import traceback
+import urllib.request
+from shutil import rmtree
+# NOTE that we shouldn't import pyspark here because this is used in
+# setup.py, and assume there's no PySpark imported.
+
+DEFAULT_HADOOP = "hadoop3.2"
+DEFAULT_HIVE = "hive2.3"
+SUPPORTED_HADOOP_VERSIONS = ["hadoop2.7", "hadoop3.2", "without-hadoop"]
+SUPPORTED_HIVE_VERSIONS = ["hive1.2", "hive2.3"]
+UNSUPPORTED_COMBINATIONS = [
+    ("without-hadoop", "hive1.2"),
+    ("hadoop3.2", "hive1.2"),
+]
+
+
+def checked_package_name(spark_version, hadoop_version, hive_version):
+    if hive_version == "hive1.2":
+        return "%s-bin-%s-%s" % (spark_version, hadoop_version, hive_version)
+    else:
+        return "%s-bin-%s" % (spark_version, hadoop_version)
+
+
+def checked_versions(spark_version, hadoop_version, hive_version):
+    """
+    Check the valid combinations of supported versions in Spark distributions.
+
+    :param spark_version: Spark version. It should be X.X.X such as '3.0.0' or spark-3.0.0.
+    :param hadoop_version: Hadoop version. It should be X.X such as '2.7' or 'hadoop2.7'.
+        'without' and 'without-hadoop' are supported as special keywords for Hadoop free
+        distribution.
+    :param hive_version: Hive version. It should be X.X such as '1.2' or 'hive1.2'.
+
+    :return it returns fully-qualified versions of Spark, Hadoop and Hive in a tuple.
+        For example, spark-3.0.0, hadoop3.2 and hive2.3.
+    """
+    if re.match("^[0-9]+\\.[0-9]+\\.[0-9]+$", spark_version):
+        spark_version = "spark-%s" % spark_version
+    if not spark_version.startswith("spark-"):
+        raise RuntimeError(
+            "Spark version should start with 'spark-' prefix; however, "
+            "got %s" % spark_version)
+
+    if hadoop_version == "without":
+        hadoop_version = "without-hadoop"
+    elif re.match("^[0-9]+\\.[0-9]+$", hadoop_version):
+        hadoop_version = "hadoop%s" % hadoop_version
+
+    if hadoop_version not in SUPPORTED_HADOOP_VERSIONS:
+        raise RuntimeError(
+            "Spark distribution of %s is not supported. Hadoop version should be "
+            "one of [%s]" % (hadoop_version, ", ".join(
+                SUPPORTED_HADOOP_VERSIONS)))
+
+    if re.match("^[0-9]+\\.[0-9]+$", hive_version):
+        hive_version = "hive%s" % hive_version
+
+    if hive_version not in SUPPORTED_HIVE_VERSIONS:
+        raise RuntimeError(
+            "Spark distribution of %s is not supported. Hive version should be "
+            "one of [%s]" % (hive_version, ", ".join(
+                SUPPORTED_HADOOP_VERSIONS)))
+
+    if (hadoop_version, hive_version) in UNSUPPORTED_COMBINATIONS:
+        raise RuntimeError("Hive 1.2 should only be with Hadoop 2.7.")
+
+    return spark_version, hadoop_version, hive_version
+
+
+def install_spark(dest, spark_version, hadoop_version, hive_version):

Review comment:
       I basically referred to https://github.com/apache/spark/blob/b84ed4146d93b37adb2b83ca642c7978a1ac853e/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala#L70-L111
   and
   https://github.com/apache/spark/blob/f53d8c63e80172295e2fbc805c0c391bdececcaa/R/pkg/R/install.R#L68-L161




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-689975060


   **[Test build #128485 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128485/testReport)** for PR 29703 at commit [`75dbcaf`](https://github.com/apache/spark/commit/75dbcafacee9574575c1c317aa8b4e294e43af3d).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-693777672


   I proofread, tested again and fixed some docs.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-697052017


   Merged to master.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #29703:
URL: https://github.com/apache/spark/pull/29703#discussion_r489909606



##########
File path: python/docs/source/getting_started/installation.rst
##########
@@ -38,8 +38,36 @@ PySpark installation using `PyPI <https://pypi.org/project/pyspark/>`_
 .. code-block:: bash

Review comment:
       I am going to rewrite this page after this PR gets merged.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-690035157






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-694006704


   **[Test build #128793 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128793/testReport)** for PR 29703 at commit [`09997b7`](https://github.com/apache/spark/commit/09997b7c92d608ea675d86d9d6d28e641654dc9f).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-694008850






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #29703:
URL: https://github.com/apache/spark/pull/29703#discussion_r489937250



##########
File path: python/pyspark/install.py
##########
@@ -0,0 +1,170 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import os
+import re
+import tarfile
+import traceback
+import urllib.request
+from shutil import rmtree
+# NOTE that we shouldn't import pyspark here because this is used in
+# setup.py, and assume there's no PySpark imported.
+
+DEFAULT_HADOOP = "hadoop3.2"
+DEFAULT_HIVE = "hive2.3"
+SUPPORTED_HADOOP_VERSIONS = ["hadoop2.7", "hadoop3.2", "without-hadoop"]
+SUPPORTED_HIVE_VERSIONS = ["hive1.2", "hive2.3"]
+UNSUPPORTED_COMBINATIONS = [
+    ("without-hadoop", "hive1.2"),
+    ("hadoop3.2", "hive1.2"),
+]
+
+
+def checked_package_name(spark_version, hadoop_version, hive_version):
+    if hive_version == "hive1.2":
+        return "%s-bin-%s-%s" % (spark_version, hadoop_version, hive_version)
+    else:
+        return "%s-bin-%s" % (spark_version, hadoop_version)
+
+
+def checked_versions(spark_version, hadoop_version, hive_version):
+    """
+    Check the valid combinations of supported versions in Spark distributions.
+
+    :param spark_version: Spark version. It should be X.X.X such as '3.0.0' or spark-3.0.0.
+    :param hadoop_version: Hadoop version. It should be X.X such as '2.7' or 'hadoop2.7'.
+        'without' and 'without-hadoop' are supported as special keywords for Hadoop free
+        distribution.
+    :param hive_version: Hive version. It should be X.X such as '1.2' or 'hive1.2'.
+
+    :return it returns fully-qualified versions of Spark, Hadoop and Hive in a tuple.
+        For example, spark-3.0.0, hadoop3.2 and hive2.3.
+    """
+    if re.match("^[0-9]+\\.[0-9]+\\.[0-9]+$", spark_version):
+        spark_version = "spark-%s" % spark_version
+    if not spark_version.startswith("spark-"):
+        raise RuntimeError(
+            "Spark version should start with 'spark-' prefix; however, "
+            "got %s" % spark_version)
+
+    if hadoop_version == "without":
+        hadoop_version = "without-hadoop"

Review comment:
       It is verified below `if hadoop_version not in SUPPORTED_HADOOP_VERSIONS:` later.  There's a test case here: https://github.com/apache/spark/pull/29703/files/033a33ee515b95342e8c5a74e63054d915661579#diff-e23af4eb5cc3bf6af4bc26cb801b7e84R69 and https://github.com/apache/spark/pull/29703/files/033a33ee515b95342e8c5a74e63054d915661579#diff-e23af4eb5cc3bf6af4bc26cb801b7e84R88
   
   Users can specify the Hadoop and Hive versions such as `hadoop3.2` and `hive2.3` as well but I didn't document this. These keywords are actually ported from SparkR as are `SparkR::install.spark`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-696470026






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #29703:
URL: https://github.com/apache/spark/pull/29703#discussion_r486718812



##########
File path: python/pyspark/install.py
##########
@@ -0,0 +1,170 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import os
+import re
+import tarfile
+import traceback
+import urllib.request
+from shutil import rmtree
+# NOTE that we shouldn't import pyspark here because this is used in
+# setup.py, and assume there's no PySpark imported.
+
+DEFAULT_HADOOP = "hadoop3.2"
+DEFAULT_HIVE = "hive2.3"
+SUPPORTED_HADOOP_VERSIONS = ["hadoop2.7", "hadoop3.2", "without-hadoop"]
+SUPPORTED_HIVE_VERSIONS = ["hive1.2", "hive2.3"]
+UNSUPPORTED_COMBINATIONS = [
+    ("without-hadoop", "hive1.2"),
+    ("hadoop3.2", "hive1.2"),
+]
+
+
+def checked_package_name(spark_version, hadoop_version, hive_version):
+    if hive_version == "hive1.2":
+        return "%s-bin-%s-%s" % (spark_version, hadoop_version, hive_version)
+    else:
+        return "%s-bin-%s" % (spark_version, hadoop_version)
+
+
+def checked_versions(spark_version, hadoop_version, hive_version):
+    """
+    Check the valid combinations of supported versions in Spark distributions.
+
+    :param spark_version: Spark version. It should be X.X.X such as '3.0.0' or spark-3.0.0.
+    :param hadoop_version: Hadoop version. It should be X.X such as '2.7' or 'hadoop2.7'.
+        'without' and 'without-hadoop' are supported as special keywords for Hadoop free
+        distribution.
+    :param hive_version: Hive version. It should be X.X such as '1.2' or 'hive1.2'.
+
+    :return it returns fully-qualified versions of Spark, Hadoop and Hive in a tuple.
+        For example, spark-3.0.0, hadoop3.2 and hive2.3.
+    """
+    if re.match("^[0-9]+\\.[0-9]+\\.[0-9]+$", spark_version):
+        spark_version = "spark-%s" % spark_version
+    if not spark_version.startswith("spark-"):
+        raise RuntimeError(
+            "Spark version should start with 'spark-' prefix; however, "
+            "got %s" % spark_version)
+
+    if hadoop_version == "without":
+        hadoop_version = "without-hadoop"
+    elif re.match("^[0-9]+\\.[0-9]+$", hadoop_version):
+        hadoop_version = "hadoop%s" % hadoop_version
+
+    if hadoop_version not in SUPPORTED_HADOOP_VERSIONS:
+        raise RuntimeError(
+            "Spark distribution of %s is not supported. Hadoop version should be "
+            "one of [%s]" % (hadoop_version, ", ".join(
+                SUPPORTED_HADOOP_VERSIONS)))
+
+    if re.match("^[0-9]+\\.[0-9]+$", hive_version):
+        hive_version = "hive%s" % hive_version
+
+    if hive_version not in SUPPORTED_HIVE_VERSIONS:
+        raise RuntimeError(
+            "Spark distribution of %s is not supported. Hive version should be "
+            "one of [%s]" % (hive_version, ", ".join(
+                SUPPORTED_HADOOP_VERSIONS)))
+
+    if (hadoop_version, hive_version) in UNSUPPORTED_COMBINATIONS:
+        raise RuntimeError("Hive 1.2 should only be with Hadoop 2.7.")
+
+    return spark_version, hadoop_version, hive_version
+
+
+def install_spark(dest, spark_version, hadoop_version, hive_version):
+    """
+    Installs Spark that corresponds to the given Hadoop version in the current
+    library directory.
+
+    :param dest: The location to download and install the Spark.
+    :param spark_version: Spark version. It should be spark-X.X.X form.
+    :param hadoop_version: Hadoop version. It should be hadoopX.X
+        such as 'hadoop2.7' or 'without-hadoop'.
+    :param hive_version: Hive version. It should be hiveX.X such as 'hive1.2'.
+    """
+
+    package_name = checked_package_name(spark_version, hadoop_version, hive_version)
+    package_local_path = os.path.join(dest, "%s.tgz" % package_name)
+    sites = get_preferred_mirrors()
+    print("Trying to download Spark %s from [%s]" % (spark_version, ", ".join(sites)))
+
+    pretty_pkg_name = "%s for Hadoop %s" % (
+        spark_version,
+        "Free build" if hadoop_version == "without" else hadoop_version)
+
+    for site in sites:
+        os.makedirs(dest, exist_ok=True)
+        url = "%s/spark/%s/%s.tgz" % (site, spark_version, package_name)
+
+        tar = None
+        try:
+            print("Downloading %s from:\n- %s" % (pretty_pkg_name, url))
+            download_to_file(urllib.request.urlopen(url), package_local_path)
+
+            print("Installing to %s" % dest)
+            tar = tarfile.open(package_local_path, "r:gz")
+            for member in tar.getmembers():
+                if member.name == package_name:
+                    # Skip the root directory.
+                    continue
+                member.name = os.path.relpath(member.name, package_name + os.path.sep)
+                tar.extract(member, dest)
+            return
+        except Exception:
+            print("Failed to download %s from %s:" % (pretty_pkg_name, url))
+            traceback.print_exc()
+            rmtree(dest, ignore_errors=True)
+        finally:
+            if tar is not None:
+                tar.close()
+            if os.path.exists(package_local_path):
+                os.remove(package_local_path)
+    raise IOError("Unable to download %s." % pretty_pkg_name)
+
+
+def get_preferred_mirrors():
+    mirror_urls = []
+    for _ in range(3):
+        try:
+            response = urllib.request.urlopen(
+                "https://www.apache.org/dyn/closer.lua?preferred=true")
+            mirror_urls.append(response.read().decode('utf-8'))
+        except Exception:
+            # If we can't get a mirror URL, skip it. No retry.
+            pass
+
+    default_sites = [
+        "https://archive.apache.org/dist", "https://dist.apache.org/repos/dist/release"]

Review comment:
       When they install, I think it will likely be the latest in most of cases. I guess that's the reason why we moved the old versions into these archive and keep the the latest versions in the mirrors.
   
   People are already using this to download old versions or to setup the CI. This PR just makes it easier to do it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-690038912


   **[Test build #128495 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128495/testReport)** for PR 29703 at commit [`75dbcaf`](https://github.com/apache/spark/commit/75dbcafacee9574575c1c317aa8b4e294e43af3d).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-695504335






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-694008850


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-689973252






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-696547712






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-695504335






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon closed pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
HyukjinKwon closed pull request #29703:
URL: https://github.com/apache/spark/pull/29703


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #29703:
URL: https://github.com/apache/spark/pull/29703#discussion_r486064585



##########
File path: dev/create-release/release-build.sh
##########
@@ -275,6 +275,8 @@ if [[ "$1" == "package" ]]; then
   # In dry run mode, only build the first one. The keys in BINARY_PKGS_ARGS are used as the
   # list of packages to be built, so it's ok for things to be missing in BINARY_PKGS_EXTRA.
 
+  # NOTE: Don't forget to update the valid combinations of distributions at
+  #   'python/pyspark.install.py' if you're changing them.
   declare -A BINARY_PKGS_ARGS
   BINARY_PKGS_ARGS["hadoop3.2"]="-Phadoop-3.2 $HIVE_PROFILES"

Review comment:
       If we happen to drop Hive 1.2 (or add other combinations of profiles in the distributions), we'll have to change this and [here](https://github.com/apache/spark/pull/29703/files#diff-87e663b6bc59c82beaf09ead1840ac4aR26-R41). I believe this could be done separately later.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-696502086






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ueshin commented on a change in pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
ueshin commented on a change in pull request #29703:
URL: https://github.com/apache/spark/pull/29703#discussion_r492417508



##########
File path: python/pyspark/find_spark_home.py
##########
@@ -42,7 +42,11 @@ def is_spark_home(path):
     import_error_raised = False
     from importlib.util import find_spec
     try:
+        # Spark distribution can be downloaded when HADOOP_VERSION environment variable is set.
+        # We should look up this directory first, see also SPARK-32017.
+        spark_dist_dir = "spark-distribution"
         module_home = os.path.dirname(find_spec("pyspark").origin)
+        paths.append(os.path.join(module_home, spark_dist_dir))

Review comment:
       We should insert this entry before `os.path.dirname(os.path.realpath(__file__))`, which is the same as the module home if pip is used for the installation; otherwise, the module home will be the spark home and the distribution under `spark-distribution` is not used.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-690820353


   > 1. How does this interact with the ability to specify dependencies in a requirements file? It seems odd to have to do something like HADOOP_VERSION=3.2 pip install -r requirements.txt, because, kinda like with --install-option, we've now modified pip's behavior across all the libraries it's going to install.
   >
   >    I also wonder if this plays well with tools like pip-tools that compile down requirements files into the full list of their transitive dependencies. I'm guessing users will need to manually preserve the environment variables, because they will not be reflected in the compiled requirements.
   
   I agree that it doesn't look very pip friendly. That's why I had to investigate a lot and write down what I checked in the PR description. 
   
   `--instal-option` is supported via `requirement.txt` so once pip provides a proper way to configure this, we will switch to this (at SPARK-32837). We can't use this option for now due to https://github.com/pypa/pip/issues/1883. There seems no other ways possible given my investigation.
   
   We can just keep this as an experimental mode for the time being in this way, and switch it to the proper pip installation option once they support in the future.
   
   > 2. Have you considered publishing these alternate builds under different package names? e.g. pyspark-hadoop3.2. This avoids the need to mess with environment variables, and delivers a more vanilla install experience. But it will also push us to define upfront what combinations to publish builds for to PyPI.
   
   I have thought about this option too but ..
   - I think we'll end up with having multiple packages per the profiles we support.
   - I still think using pip's native configuration is the ideal way. By using environment variables, we can easily switch it to use pip's option in the future.
   - Minor but .. It will be difficult to track the usage (https://pypistats.org/packages/pyspark)
   
   > 3. Are you sure it's OK to point at archive.apache.org? Everyone installing a non-current version of PySpark with alternate versions of Hadoop / Hive specified will hit the archive. Unlike PyPI, the Apache archive is not backed by a generous CDN:
   >
   >     Do note that a daily limit of 5GB per IP is being enforced on archive.apache.org, to prevent abuse.
   >
   >     In Flintrock, I never touch the archive out of fear of being an "abusive user". This is another argument for publishing alternate packages to PyPI.
   
   Yeah, I understand this can be a valid concern. But this is already available to use and people use it. Also it's being used in our own CI:
   
   https://github.com/apache/spark/blob/b84ed4146d93b37adb2b83ca642c7978a1ac853e/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala#L82
   
   The PR makes it easier to use them to download old versions. We can make it configurable as well via exposing an environment variable.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-689975255






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-696470026






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ueshin commented on a change in pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
ueshin commented on a change in pull request #29703:
URL: https://github.com/apache/spark/pull/29703#discussion_r492417508



##########
File path: python/pyspark/find_spark_home.py
##########
@@ -42,7 +42,11 @@ def is_spark_home(path):
     import_error_raised = False
     from importlib.util import find_spec
     try:
+        # Spark distribution can be downloaded when HADOOP_VERSION environment variable is set.
+        # We should look up this directory first, see also SPARK-32017.
+        spark_dist_dir = "spark-distribution"
         module_home = os.path.dirname(find_spec("pyspark").origin)
+        paths.append(os.path.join(module_home, spark_dist_dir))

Review comment:
       We should insert this entry at first or between `"../"` and `os.path.dirname(os.path.realpath(__file__))`, which is the same as the module home if pip is used for the installation; otherwise, the module home will be the spark home and the distribution under `spark-distribution` is not used.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-696669333






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] nchammas commented on a change in pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
nchammas commented on a change in pull request #29703:
URL: https://github.com/apache/spark/pull/29703#discussion_r486712211



##########
File path: python/pyspark/install.py
##########
@@ -0,0 +1,170 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import os
+import re
+import tarfile
+import traceback
+import urllib.request
+from shutil import rmtree
+# NOTE that we shouldn't import pyspark here because this is used in
+# setup.py, and assume there's no PySpark imported.
+
+DEFAULT_HADOOP = "hadoop3.2"
+DEFAULT_HIVE = "hive2.3"
+SUPPORTED_HADOOP_VERSIONS = ["hadoop2.7", "hadoop3.2", "without-hadoop"]
+SUPPORTED_HIVE_VERSIONS = ["hive1.2", "hive2.3"]
+UNSUPPORTED_COMBINATIONS = [
+    ("without-hadoop", "hive1.2"),
+    ("hadoop3.2", "hive1.2"),
+]
+
+
+def checked_package_name(spark_version, hadoop_version, hive_version):
+    if hive_version == "hive1.2":
+        return "%s-bin-%s-%s" % (spark_version, hadoop_version, hive_version)
+    else:
+        return "%s-bin-%s" % (spark_version, hadoop_version)
+
+
+def checked_versions(spark_version, hadoop_version, hive_version):
+    """
+    Check the valid combinations of supported versions in Spark distributions.
+
+    :param spark_version: Spark version. It should be X.X.X such as '3.0.0' or spark-3.0.0.
+    :param hadoop_version: Hadoop version. It should be X.X such as '2.7' or 'hadoop2.7'.
+        'without' and 'without-hadoop' are supported as special keywords for Hadoop free
+        distribution.
+    :param hive_version: Hive version. It should be X.X such as '1.2' or 'hive1.2'.
+
+    :return it returns fully-qualified versions of Spark, Hadoop and Hive in a tuple.
+        For example, spark-3.0.0, hadoop3.2 and hive2.3.
+    """
+    if re.match("^[0-9]+\\.[0-9]+\\.[0-9]+$", spark_version):
+        spark_version = "spark-%s" % spark_version
+    if not spark_version.startswith("spark-"):
+        raise RuntimeError(
+            "Spark version should start with 'spark-' prefix; however, "
+            "got %s" % spark_version)
+
+    if hadoop_version == "without":
+        hadoop_version = "without-hadoop"
+    elif re.match("^[0-9]+\\.[0-9]+$", hadoop_version):
+        hadoop_version = "hadoop%s" % hadoop_version
+
+    if hadoop_version not in SUPPORTED_HADOOP_VERSIONS:
+        raise RuntimeError(
+            "Spark distribution of %s is not supported. Hadoop version should be "
+            "one of [%s]" % (hadoop_version, ", ".join(
+                SUPPORTED_HADOOP_VERSIONS)))
+
+    if re.match("^[0-9]+\\.[0-9]+$", hive_version):
+        hive_version = "hive%s" % hive_version
+
+    if hive_version not in SUPPORTED_HIVE_VERSIONS:
+        raise RuntimeError(
+            "Spark distribution of %s is not supported. Hive version should be "
+            "one of [%s]" % (hive_version, ", ".join(
+                SUPPORTED_HADOOP_VERSIONS)))
+
+    if (hadoop_version, hive_version) in UNSUPPORTED_COMBINATIONS:
+        raise RuntimeError("Hive 1.2 should only be with Hadoop 2.7.")
+
+    return spark_version, hadoop_version, hive_version
+
+
+def install_spark(dest, spark_version, hadoop_version, hive_version):
+    """
+    Installs Spark that corresponds to the given Hadoop version in the current
+    library directory.
+
+    :param dest: The location to download and install the Spark.
+    :param spark_version: Spark version. It should be spark-X.X.X form.
+    :param hadoop_version: Hadoop version. It should be hadoopX.X
+        such as 'hadoop2.7' or 'without-hadoop'.
+    :param hive_version: Hive version. It should be hiveX.X such as 'hive1.2'.
+    """
+
+    package_name = checked_package_name(spark_version, hadoop_version, hive_version)
+    package_local_path = os.path.join(dest, "%s.tgz" % package_name)
+    sites = get_preferred_mirrors()
+    print("Trying to download Spark %s from [%s]" % (spark_version, ", ".join(sites)))
+
+    pretty_pkg_name = "%s for Hadoop %s" % (
+        spark_version,
+        "Free build" if hadoop_version == "without" else hadoop_version)
+
+    for site in sites:
+        os.makedirs(dest, exist_ok=True)
+        url = "%s/spark/%s/%s.tgz" % (site, spark_version, package_name)
+
+        tar = None
+        try:
+            print("Downloading %s from:\n- %s" % (pretty_pkg_name, url))
+            download_to_file(urllib.request.urlopen(url), package_local_path)
+
+            print("Installing to %s" % dest)
+            tar = tarfile.open(package_local_path, "r:gz")
+            for member in tar.getmembers():
+                if member.name == package_name:
+                    # Skip the root directory.
+                    continue
+                member.name = os.path.relpath(member.name, package_name + os.path.sep)
+                tar.extract(member, dest)
+            return
+        except Exception:
+            print("Failed to download %s from %s:" % (pretty_pkg_name, url))
+            traceback.print_exc()
+            rmtree(dest, ignore_errors=True)
+        finally:
+            if tar is not None:
+                tar.close()
+            if os.path.exists(package_local_path):
+                os.remove(package_local_path)
+    raise IOError("Unable to download %s." % pretty_pkg_name)
+
+
+def get_preferred_mirrors():
+    mirror_urls = []
+    for _ in range(3):
+        try:
+            response = urllib.request.urlopen(
+                "https://www.apache.org/dyn/closer.lua?preferred=true")
+            mirror_urls.append(response.read().decode('utf-8'))
+        except Exception:
+            # If we can't get a mirror URL, skip it. No retry.
+            pass
+
+    default_sites = [
+        "https://archive.apache.org/dist", "https://dist.apache.org/repos/dist/release"]

Review comment:
       All non-current versions of Spark will hit the archive, since the mirrors only maintain the latest version. I don't think the archive will be able to handle the volume of traffic that will eventually come its way from various people downloading (and re-downloading) Spark, e.g. as part of CI setup.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-690153942


   **[Test build #128495 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128495/testReport)** for PR 29703 at commit [`75dbcaf`](https://github.com/apache/spark/commit/75dbcafacee9574575c1c317aa8b4e294e43af3d).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-689975255






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-692043314


   **[Test build #128639 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128639/testReport)** for PR 29703 at commit [`83815e0`](https://github.com/apache/spark/commit/83815e03ee18813b5ee9f41ee48c7408529f259b).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-696668412


   **[Test build #128973 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128973/testReport)** for PR 29703 at commit [`20491e0`](https://github.com/apache/spark/commit/20491e0cdd5b1fae207cf20d8091d4c456728b39).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-696669333






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-689973644


   cc @srowen, @dongjoon-hyun, @holdenk, @BryanCutler, @viirya, @ueshin FYI


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ueshin commented on a change in pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
ueshin commented on a change in pull request #29703:
URL: https://github.com/apache/spark/pull/29703#discussion_r492417508



##########
File path: python/pyspark/find_spark_home.py
##########
@@ -42,7 +42,11 @@ def is_spark_home(path):
     import_error_raised = False
     from importlib.util import find_spec
     try:
+        # Spark distribution can be downloaded when HADOOP_VERSION environment variable is set.
+        # We should look up this directory first, see also SPARK-32017.
+        spark_dist_dir = "spark-distribution"
         module_home = os.path.dirname(find_spec("pyspark").origin)
+        paths.append(os.path.join(module_home, spark_dist_dir))

Review comment:
       We should insert this entry between `"../"` and `os.path.dirname(os.path.realpath(__file__))`, which is the same as the module home if pip is used for the installation; otherwise, the module home will be the spark home and the distribution under `spark-distribution` is not used.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-696501755


   **[Test build #128956 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128956/testReport)** for PR 29703 at commit [`33594e1`](https://github.com/apache/spark/commit/33594e147263cedcefffae3441a7cfdaf3619a5c).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-696503296


   **[Test build #128961 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128961/testReport)** for PR 29703 at commit [`20491e0`](https://github.com/apache/spark/commit/20491e0cdd5b1fae207cf20d8091d4c456728b39).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
srowen commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-690270759


   I think exposing the option is fine.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-696502238






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-689975060






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-693950638






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-693796763






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #29703:
URL: https://github.com/apache/spark/pull/29703#discussion_r492425304



##########
File path: python/pyspark/find_spark_home.py
##########
@@ -42,7 +42,11 @@ def is_spark_home(path):
     import_error_raised = False
     from importlib.util import find_spec
     try:
+        # Spark distribution can be downloaded when HADOOP_VERSION environment variable is set.
+        # We should look up this directory first, see also SPARK-32017.
+        spark_dist_dir = "spark-distribution"
         module_home = os.path.dirname(find_spec("pyspark").origin)
+        paths.append(os.path.join(module_home, spark_dist_dir))

Review comment:
       `os.path.dirname(os.path.realpath(__file__))` is usually in a script directory because `find_spark_home.py` is included as a script at `setup.py`: https://github.com/apache/spark/blob/058e61ae143a3619fc13fb0c316044a75f7bc15f/python/setup.py#L187
   
   for example, it will be user's `bin`.
   
   I thought `os.path.dirname(os.path.realpath(__file__))` is more just dev purpose. I checked that it correctly points out the newly installed Spark home.
   
   However, sure. why not be safer :-) I will update.

##########
File path: python/pyspark/find_spark_home.py
##########
@@ -42,7 +42,11 @@ def is_spark_home(path):
     import_error_raised = False
     from importlib.util import find_spec
     try:
+        # Spark distribution can be downloaded when HADOOP_VERSION environment variable is set.
+        # We should look up this directory first, see also SPARK-32017.
+        spark_dist_dir = "spark-distribution"
         module_home = os.path.dirname(find_spec("pyspark").origin)
+        paths.append(os.path.join(module_home, spark_dist_dir))

Review comment:
       `os.path.dirname(os.path.realpath(__file__))` is usually in a script directory because `find_spark_home.py` is included as a script at `setup.py`: https://github.com/apache/spark/blob/058e61ae143a3619fc13fb0c316044a75f7bc15f/python/setup.py#L187
   
   for example, it will be user's `bin`.
   
   I thought `os.path.dirname(os.path.realpath(__file__))` is more just for dev purpose or non-pip installed PySpark. I checked that it correctly points out the newly installed Spark home.
   
   However, sure. why not be safer :-) I will update.

##########
File path: python/pyspark/find_spark_home.py
##########
@@ -42,7 +42,11 @@ def is_spark_home(path):
     import_error_raised = False
     from importlib.util import find_spec
     try:
+        # Spark distribution can be downloaded when HADOOP_VERSION environment variable is set.
+        # We should look up this directory first, see also SPARK-32017.
+        spark_dist_dir = "spark-distribution"
         module_home = os.path.dirname(find_spec("pyspark").origin)
+        paths.append(os.path.join(module_home, spark_dist_dir))

Review comment:
       `os.path.dirname(os.path.realpath(__file__))` is usually in a script directory because `find_spark_home.py` is included as a script at `setup.py`: https://github.com/apache/spark/blob/058e61ae143a3619fc13fb0c316044a75f7bc15f/python/setup.py#L187
   
   for example, it will be user's `bin`.
   
   I thought `os.path.dirname(os.path.realpath(__file__))` is more just for dev purpose. I checked that it correctly points out the newly installed Spark home.
   
   However, sure. why not be safer :-) I will update.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-696503583






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-696469772


   **[Test build #128956 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128956/testReport)** for PR 29703 at commit [`33594e1`](https://github.com/apache/spark/commit/33594e147263cedcefffae3441a7cfdaf3619a5c).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-695746460






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-693778396


   **[Test build #128789 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128789/testReport)** for PR 29703 at commit [`033a33e`](https://github.com/apache/spark/commit/033a33ee515b95342e8c5a74e63054d915661579).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #29703:
URL: https://github.com/apache/spark/pull/29703#discussion_r486718812



##########
File path: python/pyspark/install.py
##########
@@ -0,0 +1,170 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import os
+import re
+import tarfile
+import traceback
+import urllib.request
+from shutil import rmtree
+# NOTE that we shouldn't import pyspark here because this is used in
+# setup.py, and assume there's no PySpark imported.
+
+DEFAULT_HADOOP = "hadoop3.2"
+DEFAULT_HIVE = "hive2.3"
+SUPPORTED_HADOOP_VERSIONS = ["hadoop2.7", "hadoop3.2", "without-hadoop"]
+SUPPORTED_HIVE_VERSIONS = ["hive1.2", "hive2.3"]
+UNSUPPORTED_COMBINATIONS = [
+    ("without-hadoop", "hive1.2"),
+    ("hadoop3.2", "hive1.2"),
+]
+
+
+def checked_package_name(spark_version, hadoop_version, hive_version):
+    if hive_version == "hive1.2":
+        return "%s-bin-%s-%s" % (spark_version, hadoop_version, hive_version)
+    else:
+        return "%s-bin-%s" % (spark_version, hadoop_version)
+
+
+def checked_versions(spark_version, hadoop_version, hive_version):
+    """
+    Check the valid combinations of supported versions in Spark distributions.
+
+    :param spark_version: Spark version. It should be X.X.X such as '3.0.0' or spark-3.0.0.
+    :param hadoop_version: Hadoop version. It should be X.X such as '2.7' or 'hadoop2.7'.
+        'without' and 'without-hadoop' are supported as special keywords for Hadoop free
+        distribution.
+    :param hive_version: Hive version. It should be X.X such as '1.2' or 'hive1.2'.
+
+    :return it returns fully-qualified versions of Spark, Hadoop and Hive in a tuple.
+        For example, spark-3.0.0, hadoop3.2 and hive2.3.
+    """
+    if re.match("^[0-9]+\\.[0-9]+\\.[0-9]+$", spark_version):
+        spark_version = "spark-%s" % spark_version
+    if not spark_version.startswith("spark-"):
+        raise RuntimeError(
+            "Spark version should start with 'spark-' prefix; however, "
+            "got %s" % spark_version)
+
+    if hadoop_version == "without":
+        hadoop_version = "without-hadoop"
+    elif re.match("^[0-9]+\\.[0-9]+$", hadoop_version):
+        hadoop_version = "hadoop%s" % hadoop_version
+
+    if hadoop_version not in SUPPORTED_HADOOP_VERSIONS:
+        raise RuntimeError(
+            "Spark distribution of %s is not supported. Hadoop version should be "
+            "one of [%s]" % (hadoop_version, ", ".join(
+                SUPPORTED_HADOOP_VERSIONS)))
+
+    if re.match("^[0-9]+\\.[0-9]+$", hive_version):
+        hive_version = "hive%s" % hive_version
+
+    if hive_version not in SUPPORTED_HIVE_VERSIONS:
+        raise RuntimeError(
+            "Spark distribution of %s is not supported. Hive version should be "
+            "one of [%s]" % (hive_version, ", ".join(
+                SUPPORTED_HADOOP_VERSIONS)))
+
+    if (hadoop_version, hive_version) in UNSUPPORTED_COMBINATIONS:
+        raise RuntimeError("Hive 1.2 should only be with Hadoop 2.7.")
+
+    return spark_version, hadoop_version, hive_version
+
+
+def install_spark(dest, spark_version, hadoop_version, hive_version):
+    """
+    Installs Spark that corresponds to the given Hadoop version in the current
+    library directory.
+
+    :param dest: The location to download and install the Spark.
+    :param spark_version: Spark version. It should be spark-X.X.X form.
+    :param hadoop_version: Hadoop version. It should be hadoopX.X
+        such as 'hadoop2.7' or 'without-hadoop'.
+    :param hive_version: Hive version. It should be hiveX.X such as 'hive1.2'.
+    """
+
+    package_name = checked_package_name(spark_version, hadoop_version, hive_version)
+    package_local_path = os.path.join(dest, "%s.tgz" % package_name)
+    sites = get_preferred_mirrors()
+    print("Trying to download Spark %s from [%s]" % (spark_version, ", ".join(sites)))
+
+    pretty_pkg_name = "%s for Hadoop %s" % (
+        spark_version,
+        "Free build" if hadoop_version == "without" else hadoop_version)
+
+    for site in sites:
+        os.makedirs(dest, exist_ok=True)
+        url = "%s/spark/%s/%s.tgz" % (site, spark_version, package_name)
+
+        tar = None
+        try:
+            print("Downloading %s from:\n- %s" % (pretty_pkg_name, url))
+            download_to_file(urllib.request.urlopen(url), package_local_path)
+
+            print("Installing to %s" % dest)
+            tar = tarfile.open(package_local_path, "r:gz")
+            for member in tar.getmembers():
+                if member.name == package_name:
+                    # Skip the root directory.
+                    continue
+                member.name = os.path.relpath(member.name, package_name + os.path.sep)
+                tar.extract(member, dest)
+            return
+        except Exception:
+            print("Failed to download %s from %s:" % (pretty_pkg_name, url))
+            traceback.print_exc()
+            rmtree(dest, ignore_errors=True)
+        finally:
+            if tar is not None:
+                tar.close()
+            if os.path.exists(package_local_path):
+                os.remove(package_local_path)
+    raise IOError("Unable to download %s." % pretty_pkg_name)
+
+
+def get_preferred_mirrors():
+    mirror_urls = []
+    for _ in range(3):
+        try:
+            response = urllib.request.urlopen(
+                "https://www.apache.org/dyn/closer.lua?preferred=true")
+            mirror_urls.append(response.read().decode('utf-8'))
+        except Exception:
+            # If we can't get a mirror URL, skip it. No retry.
+            pass
+
+    default_sites = [
+        "https://archive.apache.org/dist", "https://dist.apache.org/repos/dist/release"]

Review comment:
       When they install, I think it will likely be the latest in most of cases. I guess that's the reason why we moved the old versions into these archive and keep the the latest versions in the mirrors.
   
   People are already using this to download old versions or to setup the CI. This PR just makes it easier to do it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ueshin commented on a change in pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
ueshin commented on a change in pull request #29703:
URL: https://github.com/apache/spark/pull/29703#discussion_r492428187



##########
File path: python/pyspark/find_spark_home.py
##########
@@ -42,7 +42,11 @@ def is_spark_home(path):
     import_error_raised = False
     from importlib.util import find_spec
     try:
+        # Spark distribution can be downloaded when HADOOP_VERSION environment variable is set.
+        # We should look up this directory first, see also SPARK-32017.
+        spark_dist_dir = "spark-distribution"
         module_home = os.path.dirname(find_spec("pyspark").origin)
+        paths.append(os.path.join(module_home, spark_dist_dir))

Review comment:
       The function in the file `find_spark_home.py` is called from `launch_gateway`, so if the users initialize Spark by themselves in their Python repr, the path will be the module home?
   
   ```py
   $ python
   >>> from pyspark.sql import SparkSession
   >>> spark = SparkSession.builder.getOrCreate()
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-689972908


   **[Test build #128484 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128484/testReport)** for PR 29703 at commit [`3da776e`](https://github.com/apache/spark/commit/3da776e48d131e75360599fc26b447d01386a351).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #29703:
URL: https://github.com/apache/spark/pull/29703#discussion_r492425304



##########
File path: python/pyspark/find_spark_home.py
##########
@@ -42,7 +42,11 @@ def is_spark_home(path):
     import_error_raised = False
     from importlib.util import find_spec
     try:
+        # Spark distribution can be downloaded when HADOOP_VERSION environment variable is set.
+        # We should look up this directory first, see also SPARK-32017.
+        spark_dist_dir = "spark-distribution"
         module_home = os.path.dirname(find_spec("pyspark").origin)
+        paths.append(os.path.join(module_home, spark_dist_dir))

Review comment:
       `os.path.dirname(os.path.realpath(__file__))` is usually in a script directory because `find_spark_home.py` is included as a script at `setup.py`: https://github.com/apache/spark/blob/058e61ae143a3619fc13fb0c316044a75f7bc15f/python/setup.py#L187
   
   for example, it will be user's `bin`.
   
   I thought `os.path.dirname(os.path.realpath(__file__))` is more just dev purpose. I checked that it correctly points out the newly installed Spark home.
   
   However, sure. why not be safer :-) I will update.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-691960962






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-695746460






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-690036275






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-693948094


   **[Test build #128789 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128789/testReport)** for PR 29703 at commit [`033a33e`](https://github.com/apache/spark/commit/033a33ee515b95342e8c5a74e63054d915661579).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-696469772






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon edited a comment on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
HyukjinKwon edited a comment on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-690820353


   > 1. How does this interact with the ability to specify dependencies in a requirements file? It seems odd to have to do something like HADOOP_VERSION=3.2 pip install -r requirements.txt, because, kinda like with --install-option, we've now modified pip's behavior across all the libraries it's going to install.
   >
   >    I also wonder if this plays well with tools like pip-tools that compile down requirements files into the full list of their transitive dependencies. I'm guessing users will need to manually preserve the environment variables, because they will not be reflected in the compiled requirements.
   
   I agree that it doesn't look very pip friendly. That's why I had to investigate a lot and write down what I checked in the PR description. 
   
   `--instal-option` is supported via `requirement.txt` so once pip provides a proper way to configure this, we will switch to this (at SPARK-32837). We can't use this option for now due to https://github.com/pypa/pip/issues/1883 (see also https://github.com/pypa/pip/issues/5771). There seems no other ways possible given my investigation.
   
   We can just keep this as an experimental mode for the time being in this way, and switch it to the proper pip installation option once they support in the future.
   
   > 2. Have you considered publishing these alternate builds under different package names? e.g. pyspark-hadoop3.2. This avoids the need to mess with environment variables, and delivers a more vanilla install experience. But it will also push us to define upfront what combinations to publish builds for to PyPI.
   
   I have thought about this option too but ..
   - I think we'll end up with having multiple packages per the profiles we support.
   - I still think using pip's native configuration is the ideal way. By using environment variables, we can easily switch it to use pip's option in the future.
   - Minor but .. It will be difficult to track the usage (https://pypistats.org/packages/pyspark)
   
   > 3. Are you sure it's OK to point at archive.apache.org? Everyone installing a non-current version of PySpark with alternate versions of Hadoop / Hive specified will hit the archive. Unlike PyPI, the Apache archive is not backed by a generous CDN:
   >
   >     Do note that a daily limit of 5GB per IP is being enforced on archive.apache.org, to prevent abuse.
   >
   >     In Flintrock, I never touch the archive out of fear of being an "abusive user". This is another argument for publishing alternate packages to PyPI.
   
   Yeah, I understand this can be a valid concern. But this is already available to use and people use it. Also it's being used in our own CI:
   
   https://github.com/apache/spark/blob/b84ed4146d93b37adb2b83ca642c7978a1ac853e/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala#L82
   
   The PR makes it easier to use them to download old versions. We can make it configurable as well via exposing an environment variable.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-690155745






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-696546734






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-696547717


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/128960/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-689973252






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-697052285


   Thanks @viirya @ueshin @nchammas, @srowen and @dongjoon-hyun for reviewing this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-696470026






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-695502727


   **[Test build #128903 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128903/testReport)** for PR 29703 at commit [`058e61a`](https://github.com/apache/spark/commit/058e61ae143a3619fc13fb0c316044a75f7bc15f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-690039716






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29703:
URL: https://github.com/apache/spark/pull/29703#discussion_r486078357



##########
File path: python/setup.py
##########
@@ -16,14 +16,19 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
+import importlib.util
 import glob
 import os
 import sys
 from setuptools import setup
+from setuptools.command.install import install
 from shutil import copyfile, copytree, rmtree
 
 try:
     exec(open('pyspark/version.py').read())
+    spec = importlib.util.spec_from_file_location("install", "pyspark/install.py")
+    install_module = importlib.util.module_from_spec(spec)
+    spec.loader.exec_module(install_module)
 except IOError:
     print("Failed to load PySpark version file for packaging. You must be in Spark's python dir.",

Review comment:
       When we do packaging, we also need exec `install_module`? Or we need to update this message?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-690035463


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-696502086






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] nchammas commented on a change in pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
nchammas commented on a change in pull request #29703:
URL: https://github.com/apache/spark/pull/29703#discussion_r486712211



##########
File path: python/pyspark/install.py
##########
@@ -0,0 +1,170 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import os
+import re
+import tarfile
+import traceback
+import urllib.request
+from shutil import rmtree
+# NOTE that we shouldn't import pyspark here because this is used in
+# setup.py, and assume there's no PySpark imported.
+
+DEFAULT_HADOOP = "hadoop3.2"
+DEFAULT_HIVE = "hive2.3"
+SUPPORTED_HADOOP_VERSIONS = ["hadoop2.7", "hadoop3.2", "without-hadoop"]
+SUPPORTED_HIVE_VERSIONS = ["hive1.2", "hive2.3"]
+UNSUPPORTED_COMBINATIONS = [
+    ("without-hadoop", "hive1.2"),
+    ("hadoop3.2", "hive1.2"),
+]
+
+
+def checked_package_name(spark_version, hadoop_version, hive_version):
+    if hive_version == "hive1.2":
+        return "%s-bin-%s-%s" % (spark_version, hadoop_version, hive_version)
+    else:
+        return "%s-bin-%s" % (spark_version, hadoop_version)
+
+
+def checked_versions(spark_version, hadoop_version, hive_version):
+    """
+    Check the valid combinations of supported versions in Spark distributions.
+
+    :param spark_version: Spark version. It should be X.X.X such as '3.0.0' or spark-3.0.0.
+    :param hadoop_version: Hadoop version. It should be X.X such as '2.7' or 'hadoop2.7'.
+        'without' and 'without-hadoop' are supported as special keywords for Hadoop free
+        distribution.
+    :param hive_version: Hive version. It should be X.X such as '1.2' or 'hive1.2'.
+
+    :return it returns fully-qualified versions of Spark, Hadoop and Hive in a tuple.
+        For example, spark-3.0.0, hadoop3.2 and hive2.3.
+    """
+    if re.match("^[0-9]+\\.[0-9]+\\.[0-9]+$", spark_version):
+        spark_version = "spark-%s" % spark_version
+    if not spark_version.startswith("spark-"):
+        raise RuntimeError(
+            "Spark version should start with 'spark-' prefix; however, "
+            "got %s" % spark_version)
+
+    if hadoop_version == "without":
+        hadoop_version = "without-hadoop"
+    elif re.match("^[0-9]+\\.[0-9]+$", hadoop_version):
+        hadoop_version = "hadoop%s" % hadoop_version
+
+    if hadoop_version not in SUPPORTED_HADOOP_VERSIONS:
+        raise RuntimeError(
+            "Spark distribution of %s is not supported. Hadoop version should be "
+            "one of [%s]" % (hadoop_version, ", ".join(
+                SUPPORTED_HADOOP_VERSIONS)))
+
+    if re.match("^[0-9]+\\.[0-9]+$", hive_version):
+        hive_version = "hive%s" % hive_version
+
+    if hive_version not in SUPPORTED_HIVE_VERSIONS:
+        raise RuntimeError(
+            "Spark distribution of %s is not supported. Hive version should be "
+            "one of [%s]" % (hive_version, ", ".join(
+                SUPPORTED_HADOOP_VERSIONS)))
+
+    if (hadoop_version, hive_version) in UNSUPPORTED_COMBINATIONS:
+        raise RuntimeError("Hive 1.2 should only be with Hadoop 2.7.")
+
+    return spark_version, hadoop_version, hive_version
+
+
+def install_spark(dest, spark_version, hadoop_version, hive_version):
+    """
+    Installs Spark that corresponds to the given Hadoop version in the current
+    library directory.
+
+    :param dest: The location to download and install the Spark.
+    :param spark_version: Spark version. It should be spark-X.X.X form.
+    :param hadoop_version: Hadoop version. It should be hadoopX.X
+        such as 'hadoop2.7' or 'without-hadoop'.
+    :param hive_version: Hive version. It should be hiveX.X such as 'hive1.2'.
+    """
+
+    package_name = checked_package_name(spark_version, hadoop_version, hive_version)
+    package_local_path = os.path.join(dest, "%s.tgz" % package_name)
+    sites = get_preferred_mirrors()
+    print("Trying to download Spark %s from [%s]" % (spark_version, ", ".join(sites)))
+
+    pretty_pkg_name = "%s for Hadoop %s" % (
+        spark_version,
+        "Free build" if hadoop_version == "without" else hadoop_version)
+
+    for site in sites:
+        os.makedirs(dest, exist_ok=True)
+        url = "%s/spark/%s/%s.tgz" % (site, spark_version, package_name)
+
+        tar = None
+        try:
+            print("Downloading %s from:\n- %s" % (pretty_pkg_name, url))
+            download_to_file(urllib.request.urlopen(url), package_local_path)
+
+            print("Installing to %s" % dest)
+            tar = tarfile.open(package_local_path, "r:gz")
+            for member in tar.getmembers():
+                if member.name == package_name:
+                    # Skip the root directory.
+                    continue
+                member.name = os.path.relpath(member.name, package_name + os.path.sep)
+                tar.extract(member, dest)
+            return
+        except Exception:
+            print("Failed to download %s from %s:" % (pretty_pkg_name, url))
+            traceback.print_exc()
+            rmtree(dest, ignore_errors=True)
+        finally:
+            if tar is not None:
+                tar.close()
+            if os.path.exists(package_local_path):
+                os.remove(package_local_path)
+    raise IOError("Unable to download %s." % pretty_pkg_name)
+
+
+def get_preferred_mirrors():
+    mirror_urls = []
+    for _ in range(3):
+        try:
+            response = urllib.request.urlopen(
+                "https://www.apache.org/dyn/closer.lua?preferred=true")
+            mirror_urls.append(response.read().decode('utf-8'))
+        except Exception:
+            # If we can't get a mirror URL, skip it. No retry.
+            pass
+
+    default_sites = [
+        "https://archive.apache.org/dist", "https://dist.apache.org/repos/dist/release"]

Review comment:
       All non-current versions of Spark will hit the archive, since the mirrors only maintain the latest version. I don't think the archive will be able to handle the volume of traffic that will eventually come its way from various people downloading (and re-downloading) Spark, e.g. as part of CI setup.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-694008876


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/128793/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-696501892


   **[Test build #128960 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128960/testReport)** for PR 29703 at commit [`1e3507f`](https://github.com/apache/spark/commit/1e3507fcfc3a90ab9e32e179e271ae0f56b3bcea).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ueshin commented on a change in pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

Posted by GitBox <gi...@apache.org>.
ueshin commented on a change in pull request #29703:
URL: https://github.com/apache/spark/pull/29703#discussion_r492417508



##########
File path: python/pyspark/find_spark_home.py
##########
@@ -42,7 +42,11 @@ def is_spark_home(path):
     import_error_raised = False
     from importlib.util import find_spec
     try:
+        # Spark distribution can be downloaded when HADOOP_VERSION environment variable is set.
+        # We should look up this directory first, see also SPARK-32017.
+        spark_dist_dir = "spark-distribution"
         module_home = os.path.dirname(find_spec("pyspark").origin)
+        paths.append(os.path.join(module_home, spark_dist_dir))

Review comment:
       We should insert this entry between `"../"` and `os.path.dirname(os.path.realpath(__file__))`, which is the module home; otherwise, `os.path.dirname(os.path.realpath(__file__))` will be the spark home.

##########
File path: python/pyspark/find_spark_home.py
##########
@@ -42,7 +42,11 @@ def is_spark_home(path):
     import_error_raised = False
     from importlib.util import find_spec
     try:
+        # Spark distribution can be downloaded when HADOOP_VERSION environment variable is set.
+        # We should look up this directory first, see also SPARK-32017.
+        spark_dist_dir = "spark-distribution"
         module_home = os.path.dirname(find_spec("pyspark").origin)
+        paths.append(os.path.join(module_home, spark_dist_dir))

Review comment:
       We should insert this entry between `"../"` and `os.path.dirname(os.path.realpath(__file__))`, which is the same as the module home if pip is used for the installation; otherwise, `os.path.dirname(os.path.realpath(__file__))` will be the spark home.

##########
File path: python/pyspark/find_spark_home.py
##########
@@ -42,7 +42,11 @@ def is_spark_home(path):
     import_error_raised = False
     from importlib.util import find_spec
     try:
+        # Spark distribution can be downloaded when HADOOP_VERSION environment variable is set.
+        # We should look up this directory first, see also SPARK-32017.
+        spark_dist_dir = "spark-distribution"
         module_home = os.path.dirname(find_spec("pyspark").origin)
+        paths.append(os.path.join(module_home, spark_dist_dir))

Review comment:
       We should insert this entry between `"../"` and `os.path.dirname(os.path.realpath(__file__))`, which is the same as the module home if pip is used for the installation; otherwise, the module home will be the spark home and the distribution under `spark-distribution` is not used.

##########
File path: python/pyspark/find_spark_home.py
##########
@@ -42,7 +42,11 @@ def is_spark_home(path):
     import_error_raised = False
     from importlib.util import find_spec
     try:
+        # Spark distribution can be downloaded when HADOOP_VERSION environment variable is set.
+        # We should look up this directory first, see also SPARK-32017.
+        spark_dist_dir = "spark-distribution"
         module_home = os.path.dirname(find_spec("pyspark").origin)
+        paths.append(os.path.join(module_home, spark_dist_dir))

Review comment:
       We should insert this entry at first or between `"../"` and `os.path.dirname(os.path.realpath(__file__))`, which is the same as the module home if pip is used for the installation; otherwise, the module home will be the spark home and the distribution under `spark-distribution` is not used.

##########
File path: python/pyspark/find_spark_home.py
##########
@@ -42,7 +42,11 @@ def is_spark_home(path):
     import_error_raised = False
     from importlib.util import find_spec
     try:
+        # Spark distribution can be downloaded when HADOOP_VERSION environment variable is set.
+        # We should look up this directory first, see also SPARK-32017.
+        spark_dist_dir = "spark-distribution"
         module_home = os.path.dirname(find_spec("pyspark").origin)
+        paths.append(os.path.join(module_home, spark_dist_dir))

Review comment:
       We should insert this entry before `os.path.dirname(os.path.realpath(__file__))`, which is the same as the module home if pip is used for the installation; otherwise, the module home will be the spark home and the distribution under `spark-distribution` is not used.

##########
File path: python/pyspark/find_spark_home.py
##########
@@ -42,7 +42,11 @@ def is_spark_home(path):
     import_error_raised = False
     from importlib.util import find_spec
     try:
+        # Spark distribution can be downloaded when HADOOP_VERSION environment variable is set.
+        # We should look up this directory first, see also SPARK-32017.
+        spark_dist_dir = "spark-distribution"
         module_home = os.path.dirname(find_spec("pyspark").origin)
+        paths.append(os.path.join(module_home, spark_dist_dir))

Review comment:
       The function in the file `find_spark_home.py` is called from `launch_gateway`, so if the users initialize Spark by themselves in their Python repr, the path will be the module home?
   
   ```py
   $ python
   >>> from pyspark.sql import SparkSession
   >>> spark = SparkSession.builder.getOrCreate()
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org