You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/10/15 17:30:28 UTC

[GitHub] [spark] dongjoon-hyun opened a new pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

dongjoon-hyun opened a new pull request #30059:
URL: https://github.com/apache/spark/pull/30059


   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709634856


   BTW, you don't need to put everything into `Dockerfile`.
   Let say, it's the same with the current `ubuntu:20.04`. Like you described, the contributor don't need to know about `pre-built image`. Instead, they can add new commands on GitHub Workflow file only. Then, it will be installed on the fly. And, we can move it at every release preparation. How do you think about this?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709865453






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #30059:
URL: https://github.com/apache/spark/pull/30059#discussion_r505945707



##########
File path: .github/workflows/build_and_test.yml
##########
@@ -201,6 +173,95 @@ jobs:
         name: unit-tests-log-${{ matrix.modules }}-${{ matrix.comment }}-${{ matrix.java }}-${{ matrix.hadoop }}-${{ matrix.hive }}
         path: "**/target/unit-tests.log"
 
+  pyspark:
+    name: "Build modules: ${{ matrix.modules }}"
+    runs-on: ubuntu-20.04
+    container:
+      image: dongjoon/apache-spark-github-action-image:20201015
+    strategy:
+      fail-fast: false
+      matrix:
+        modules:
+          - >-
+            pyspark-sql, pyspark-mllib, pyspark-resource
+          - >-
+            pyspark-core, pyspark-streaming, pyspark-ml
+    env:
+      MODULES_TO_TEST: ${{ matrix.modules }}
+      HADOOP_PROFILE: hadoop3.2
+      HIVE_PROFILE: hive2.3
+      # GitHub Actions' default miniconda to use in pip packaging test.
+      CONDA_PREFIX: /usr/share/miniconda
+      GITHUB_PREV_SHA: ${{ github.event.before }}
+      GITHUB_INPUT_BRANCH: ${{ github.event.inputs.target }}
+    steps:
+    - name: Checkout Spark repository
+      uses: actions/checkout@v2
+      # In order to fetch changed files
+      with:
+        fetch-depth: 0
+    - name: Merge dispatched input branch
+      if: ${{ github.event.inputs.target != '' }}
+      run: git merge --progress --ff-only origin/${{ github.event.inputs.target }}
+    # Cache local repositories. Note that GitHub Actions cache has a 2G limit.
+    - name: Cache Scala, SBT, Maven and Zinc
+      uses: actions/cache@v2
+      with:
+        path: |
+          build/apache-maven-*
+          build/zinc-*
+          build/scala-*
+          build/*.jar
+        key: build-${{ hashFiles('**/pom.xml', 'project/build.properties', 'build/mvn', 'build/sbt', 'build/sbt-launch-lib.bash', 'build/spark-build-info') }}
+        restore-keys: |
+          build-
+    - name: Cache Maven local repository
+      uses: actions/cache@v2
+      with:
+        path: ~/.m2/repository
+        key: pyspark-maven-${{ hashFiles('**/pom.xml') }}
+        restore-keys: |
+          pyspark-maven-
+    - name: Cache Ivy local repository
+      uses: actions/cache@v2
+      with:
+        path: ~/.ivy2/cache
+        key: pyspark-ivy-${{ hashFiles('**/pom.xml', '**/plugins.sbt') }}
+        restore-keys: |
+          pyspark-ivy-
+    - name: Install Python 3.6
+      uses: actions/setup-python@v2
+      if: contains(matrix.modules, 'pyspark')
+      with:

Review comment:
       Nit but .. I think we can add this condition when we actually add some combinations of other modules. We don't know if we'll fix this job or fold this one to the main job above, also considering the name of the current job is `pyspark`.
   
   ```suggestion
         with:
   ```

##########
File path: .github/workflows/build_and_test.yml
##########
@@ -201,6 +173,95 @@ jobs:
         name: unit-tests-log-${{ matrix.modules }}-${{ matrix.comment }}-${{ matrix.java }}-${{ matrix.hadoop }}-${{ matrix.hive }}
         path: "**/target/unit-tests.log"
 
+  pyspark:
+    name: "Build modules: ${{ matrix.modules }}"
+    runs-on: ubuntu-20.04
+    container:
+      image: dongjoon/apache-spark-github-action-image:20201015
+    strategy:
+      fail-fast: false
+      matrix:
+        modules:
+          - >-
+            pyspark-sql, pyspark-mllib, pyspark-resource
+          - >-
+            pyspark-core, pyspark-streaming, pyspark-ml
+    env:
+      MODULES_TO_TEST: ${{ matrix.modules }}
+      HADOOP_PROFILE: hadoop3.2
+      HIVE_PROFILE: hive2.3
+      # GitHub Actions' default miniconda to use in pip packaging test.
+      CONDA_PREFIX: /usr/share/miniconda
+      GITHUB_PREV_SHA: ${{ github.event.before }}
+      GITHUB_INPUT_BRANCH: ${{ github.event.inputs.target }}
+    steps:
+    - name: Checkout Spark repository
+      uses: actions/checkout@v2
+      # In order to fetch changed files
+      with:
+        fetch-depth: 0
+    - name: Merge dispatched input branch
+      if: ${{ github.event.inputs.target != '' }}
+      run: git merge --progress --ff-only origin/${{ github.event.inputs.target }}
+    # Cache local repositories. Note that GitHub Actions cache has a 2G limit.
+    - name: Cache Scala, SBT, Maven and Zinc
+      uses: actions/cache@v2
+      with:
+        path: |
+          build/apache-maven-*
+          build/zinc-*
+          build/scala-*
+          build/*.jar
+        key: build-${{ hashFiles('**/pom.xml', 'project/build.properties', 'build/mvn', 'build/sbt', 'build/sbt-launch-lib.bash', 'build/spark-build-info') }}
+        restore-keys: |
+          build-
+    - name: Cache Maven local repository
+      uses: actions/cache@v2
+      with:
+        path: ~/.m2/repository
+        key: pyspark-maven-${{ hashFiles('**/pom.xml') }}
+        restore-keys: |
+          pyspark-maven-
+    - name: Cache Ivy local repository
+      uses: actions/cache@v2
+      with:
+        path: ~/.ivy2/cache
+        key: pyspark-ivy-${{ hashFiles('**/pom.xml', '**/plugins.sbt') }}
+        restore-keys: |
+          pyspark-ivy-
+    - name: Install Python 3.6
+      uses: actions/setup-python@v2
+      if: contains(matrix.modules, 'pyspark')
+      with:
+        python-version: 3.6
+        architecture: x64
+    # This step takes much less time (~30s) than other Python versions so it is not included
+    # in the Docker image being used. There is also a technical issue to install Python 3.6 on
+    # Ubuntu 20.04. See also SPARK-33162.
+    - name: Install Python packages (Python 3.6)
+      if: contains(matrix.modules, 'pyspark')
+      run: |

Review comment:
       ```suggestion
         run: |
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709668136


   Ah.. GitHub Suggestion is making you as a committer instead of author.. Weird..


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709562898


   @holdenk . 
   - This is a migration from AS-IS `Github Action` jobs. We don't test `Python 3.7` at all.
   - I tried already `USER` but it's not-trivial work in Github Action. We still is using `GitHub Actions` like `actions/checkout@v2` which assumes `root`. If you are search that, there are multiple threads there.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709634856


   BTW, you don't need to put everything into `Dockerfile`.
   Let say, it's the same with the current one we are using `ubuntu:20.04`. Like you described, the contributor don't need to know about `pre-built image`. Instead, they can add new commands on GitHub Workflow file only. Then, it will be installed on the fly. And, we can move it at every release preparation. How do you think about this?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709555763


   Ya. In addition to that, I want to move forward to the following direction in the near future.
   1. Apply the official Apache dockerhub location for Apache Spark
   2. Make a new branch inside Apache Spark repo and move `Dockerfile` and `Github Action build file` by moving `https://github.com/dongjoon-hyun/ApacheSparkGitHubActionImage`'s main branch to that Apache Spark branch.
   
   Currently, `Github Action` runs the job as a `root`. So, some Scala UT fails because they assumes `non-root` users. Also, One of `sparkr` UT fails because of the following. I'm trying to move forward one by one.
   ```
   root@b2aa77a38c8c:/# Rscript -e "Sys.timezone()"
   System has not been booted with systemd as init system (PID 1). Can't operate.
   Failed to create bus connection: Host is down
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709631983


   Because currently you can easily change workflow yml files and see what test result can be immediately. By switching to pre-built image, we might need change docker file first, publish updated image, then verify updated image work as expected with GitHub Actions workflow.
   
   So for contributors, when they want to change something in the workflow, they might need to change docker file, update image at their dockerhub location, then use their custom image in workflow file.
   
   This should be a minor concern because I believe we won't frequently change workflow stuffs. But just want to raise it explicitly.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709865465


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/129864/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709595064






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709624882


   Ur, why do you think like that? Our plan is [here](url).
   
   1. If you are saying about `repo` or `branch`. The project contributors can change PySpark job inside Apache Spark repo in the future. It's the same cost like changing `branch-2.4` or `branch-3.0`. And, it's easier than changing `spark-website`. 
   > As the pre-built image cannot be easily changed like modifying workflow yml file, the project contributors might be a bit harder to change Github Action PySpark job.
   
   2. If you are saying about `Dockerfile`, we already have 7 `Dockerfile` which is open for contributor. It consists of easy linux commands.
   ```
   $ find . -name Dockerfile
   ./resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile
   ./resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/python/Dockerfile
   ./resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/R/Dockerfile
   ./dev/create-release/spark-rm/Dockerfile
   ./external/docker/spark-test/master/Dockerfile
   ./external/docker/spark-test/worker/Dockerfile
   ./external/docker/spark-test/base/Dockerfile
   ```
   
   BTW, I want to focus on the improvement which we can get. As you know, I tried various approach in the following PR already, but nothing is better than this. This PR reduces from `3 hours 14 minutes` to `2 hours 23 minutes`.
   - https://github.com/apache/spark/pull/30012


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #30059:
URL: https://github.com/apache/spark/pull/30059#discussion_r505916958



##########
File path: .github/workflows/build_and_test.yml
##########
@@ -201,6 +173,92 @@ jobs:
         name: unit-tests-log-${{ matrix.modules }}-${{ matrix.comment }}-${{ matrix.java }}-${{ matrix.hadoop }}-${{ matrix.hive }}
         path: "**/target/unit-tests.log"
 
+  pyspark:
+    name: "Build modules: ${{ matrix.modules }}"
+    runs-on: ubuntu-20.04
+    container:
+      image: dongjoon/apache-spark-github-action-image:20201015

Review comment:
       Yes. It's required. Many people complains about that. :)
   - https://github.community/t/confused-with-runs-on-and-container-options/16258




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709595045


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34460/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709562898


   @holdenk . 
   - This is a migration from AS-IS `Github Action` jobs. We don't test `Python 3.7` at all.
   - I tried `USER` already but it's not-trivial work in Github Action. We still is using `GitHub Actions` like `actions/checkout@v2` which assumes `root`. If you are search that, there are multiple threads there.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709682603


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709662984


   Thanks, @HyukjinKwon . I applied your suggestion.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709562886






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #30059:
URL: https://github.com/apache/spark/pull/30059#discussion_r505880625



##########
File path: .github/workflows/build_and_test.yml
##########
@@ -128,41 +124,17 @@ jobs:
       uses: actions/setup-java@v1
       with:
         java-version: ${{ matrix.java }}
-    # PySpark
-    - name: Install PyPy3
-      # Note that order of Python installations here matters because default python3 is
-      # overridden by pypy3.
-      uses: actions/setup-python@v2
-      if: contains(matrix.modules, 'pyspark')
-      with:
-        python-version: pypy3
-        architecture: x64
-    - name: Install Python 3.6
-      uses: actions/setup-python@v2
-      if: contains(matrix.modules, 'pyspark')
-      with:
-        python-version: 3.6
-        architecture: x64
     - name: Install Python 3.8
       uses: actions/setup-python@v2
       # We should install one Python that is higher then 3+ for SQL and Yarn because:
       # - SQL component also has Python related tests, for example, IntegratedUDFTestUtils.
       # - Yarn has a Python specific test too, for example, YarnClusterSuite.
-      if: contains(matrix.modules, 'yarn') || contains(matrix.modules, 'pyspark') || (contains(matrix.modules, 'sql') && !contains(matrix.modules, 'sql-'))
+      if: contains(matrix.modules, 'yarn') || (contains(matrix.modules, 'sql') && !contains(matrix.modules, 'sql-'))
       with:
         python-version: 3.8
         architecture: x64
-    - name: Install Python packages (Python 3.6 and PyPy3)
-      if: contains(matrix.modules, 'pyspark')
-      # PyArrow is not supported in PyPy yet, see ARROW-2651.
-      run: |
-        python3.6 -m pip install numpy pyarrow pandas scipy xmlrunner
-        python3.6 -m pip list
-        # PyPy does not have xmlrunner
-        pypy3 -m pip install numpy pandas scipy
-        pypy3 -m pip list
     - name: Install Python packages (Python 3.8)
-      if: contains(matrix.modules, 'pyspark') || (contains(matrix.modules, 'sql') && !contains(matrix.modules, 'sql-'))
+      if: (contains(matrix.modules, 'sql') && !contains(matrix.modules, 'sql-'))
       run: |
         python3.8 -m pip install numpy pyarrow pandas scipy xmlrunner

Review comment:
       Are these packages still necessary for sql related tests?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #30059:
URL: https://github.com/apache/spark/pull/30059#discussion_r505936731



##########
File path: .github/workflows/build_and_test.yml
##########
@@ -201,6 +173,92 @@ jobs:
         name: unit-tests-log-${{ matrix.modules }}-${{ matrix.comment }}-${{ matrix.java }}-${{ matrix.hadoop }}-${{ matrix.hive }}
         path: "**/target/unit-tests.log"
 
+  pyspark:
+    name: "Build modules: ${{ matrix.modules }}"
+    runs-on: ubuntu-20.04
+    container:
+      image: dongjoon/apache-spark-github-action-image:20201015

Review comment:
       That's the reason why I refocus on this only which is most time-saving one.
   We can revisit the others later.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709505880


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34456/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709629882






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #30059:
URL: https://github.com/apache/spark/pull/30059#discussion_r505936498



##########
File path: .github/workflows/build_and_test.yml
##########
@@ -201,6 +173,92 @@ jobs:
         name: unit-tests-log-${{ matrix.modules }}-${{ matrix.comment }}-${{ matrix.java }}-${{ matrix.hadoop }}-${{ matrix.hive }}
         path: "**/target/unit-tests.log"
 
+  pyspark:
+    name: "Build modules: ${{ matrix.modules }}"
+    runs-on: ubuntu-20.04
+    container:
+      image: dongjoon/apache-spark-github-action-image:20201015

Review comment:
       @HyukjinKwon . I tried already. Please see https://github.com/apache/spark/pull/30059#issuecomment-709555763 .




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #30059:
URL: https://github.com/apache/spark/pull/30059#discussion_r505941601



##########
File path: .github/workflows/build_and_test.yml
##########
@@ -201,6 +173,92 @@ jobs:
         name: unit-tests-log-${{ matrix.modules }}-${{ matrix.comment }}-${{ matrix.java }}-${{ matrix.hadoop }}-${{ matrix.hive }}
         path: "**/target/unit-tests.log"
 
+  pyspark:
+    name: "Build modules: ${{ matrix.modules }}"
+    runs-on: ubuntu-20.04
+    container:
+      image: dongjoon/apache-spark-github-action-image:20201015
+    strategy:
+      fail-fast: false
+      matrix:
+        modules:
+          - >-
+            pyspark-sql, pyspark-mllib, pyspark-resource
+          - >-
+            pyspark-core, pyspark-streaming, pyspark-ml
+    env:
+      MODULES_TO_TEST: ${{ matrix.modules }}
+      HADOOP_PROFILE: hadoop3.2
+      HIVE_PROFILE: hive2.3
+      # GitHub Actions' default miniconda to use in pip packaging test.
+      CONDA_PREFIX: /usr/share/miniconda
+      GITHUB_PREV_SHA: ${{ github.event.before }}
+      GITHUB_INPUT_BRANCH: ${{ github.event.inputs.target }}
+    steps:
+    - name: Checkout Spark repository
+      uses: actions/checkout@v2
+      # In order to fetch changed files
+      with:
+        fetch-depth: 0
+    - name: Merge dispatched input branch
+      if: ${{ github.event.inputs.target != '' }}
+      run: git merge --progress --ff-only origin/${{ github.event.inputs.target }}
+    # Cache local repositories. Note that GitHub Actions cache has a 2G limit.
+    - name: Cache Scala, SBT, Maven and Zinc
+      uses: actions/cache@v2
+      with:
+        path: |
+          build/apache-maven-*
+          build/zinc-*
+          build/scala-*
+          build/*.jar
+        key: build-${{ hashFiles('**/pom.xml', 'project/build.properties', 'build/mvn', 'build/sbt', 'build/sbt-launch-lib.bash', 'build/spark-build-info') }}
+        restore-keys: |
+          build-
+    - name: Cache Maven local repository
+      uses: actions/cache@v2
+      with:
+        path: ~/.m2/repository
+        key: pyspark-maven-${{ hashFiles('**/pom.xml') }}
+        restore-keys: |
+          pyspark-maven-
+    - name: Cache Ivy local repository
+      uses: actions/cache@v2
+      with:
+        path: ~/.ivy2/cache
+        key: pyspark-ivy-${{ hashFiles('**/pom.xml', '**/plugins.sbt') }}
+        restore-keys: |
+          pyspark-ivy-
+    - name: Install Python 3.6
+      uses: actions/setup-python@v2
+      if: contains(matrix.modules, 'pyspark')
+      with:
+        python-version: 3.6
+        architecture: x64
+    - name: Install Python packages (Python 3.6)

Review comment:
       ```suggestion
       # This step takes much less time (~30s) than other Python versions so it is not included
       # in the Docker image being used. There is also a technical issue to install Python 3.6 on
       # Ubuntu 20.04. See also SPARK-33162.
       - name: Install Python packages (Python 3.6)
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709682603






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709667935


   Ur, it's weird. The coauthorship seems to be screwed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709520735






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709664186


   **[Test build #129864 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129864/testReport)** for PR 30059 at commit [`f39ac87`](https://github.com/apache/spark/commit/f39ac871fc38e8ec8c02b7f6661748e2c7d431e9).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709634856


   BTW, you don't need to put everything into `Dockerfile`.
   Let say, it's the same with the current `ubuntu:20.04`. Like you described, the contributors don't need to know about `pre-built image`. Instead, they can add new commands on GitHub Workflow file only. Then, it will be installed on the fly during Github Action execution time. And, we can move it at every release preparation. How do you think about this?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #30059:
URL: https://github.com/apache/spark/pull/30059#discussion_r505935473



##########
File path: .github/workflows/build_and_test.yml
##########
@@ -201,6 +173,92 @@ jobs:
         name: unit-tests-log-${{ matrix.modules }}-${{ matrix.comment }}-${{ matrix.java }}-${{ matrix.hadoop }}-${{ matrix.hive }}
         path: "**/target/unit-tests.log"
 
+  pyspark:
+    name: "Build modules: ${{ matrix.modules }}"
+    runs-on: ubuntu-20.04
+    container:
+      image: dongjoon/apache-spark-github-action-image:20201015

Review comment:
       @dongjoon-hyun, quick question. what about we just set this image for the main job above (meaning use it for all tests and build), and remove this new job here? I think that'll be easier to move to use Docker image for all test cases.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #30059:
URL: https://github.com/apache/spark/pull/30059#discussion_r505940317



##########
File path: .github/workflows/build_and_test.yml
##########
@@ -201,6 +173,92 @@ jobs:
         name: unit-tests-log-${{ matrix.modules }}-${{ matrix.comment }}-${{ matrix.java }}-${{ matrix.hadoop }}-${{ matrix.hive }}
         path: "**/target/unit-tests.log"
 
+  pyspark:
+    name: "Build modules: ${{ matrix.modules }}"
+    runs-on: ubuntu-20.04
+    container:
+      image: dongjoon/apache-spark-github-action-image:20201015

Review comment:
       👌 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709481869


   **[Test build #129850 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129850/testReport)** for PR 30059 at commit [`7c7a748`](https://github.com/apache/spark/commit/7c7a748e0444c8f2a20ba53e2416995d40da1945).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709635817


   This PR doesn't block users to use `.github/workflows/build_and_test.yml` directly. The main benefit is just boosting the performance by removing repeated and flaky installation tasks.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709668029


   That's completely no problem :-)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709634063


   Got it. I agree with you at those points.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709562898


   @holdenk . 
   - This is a migration for AS-IS `Github Action` jobs. We don't test `Python 3.7` at all.
   - I tried already `USER` but it's not-trivial work in Github Action. We still is using `GitHub Actions` like `actions/checkout@v2` which assumes `root`. If you are search that, there are multiple threads there.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #30059:
URL: https://github.com/apache/spark/pull/30059#discussion_r505939807



##########
File path: .github/workflows/build_and_test.yml
##########
@@ -201,6 +173,92 @@ jobs:
         name: unit-tests-log-${{ matrix.modules }}-${{ matrix.comment }}-${{ matrix.java }}-${{ matrix.hadoop }}-${{ matrix.hive }}
         path: "**/target/unit-tests.log"
 
+  pyspark:
+    name: "Build modules: ${{ matrix.modules }}"
+    runs-on: ubuntu-20.04
+    container:
+      image: dongjoon/apache-spark-github-action-image:20201015
+    strategy:
+      fail-fast: false
+      matrix:
+        modules:
+          - >-
+            pyspark-sql, pyspark-mllib, pyspark-resource
+          - >-
+            pyspark-core, pyspark-streaming, pyspark-ml
+    env:
+      MODULES_TO_TEST: ${{ matrix.modules }}
+      HADOOP_PROFILE: hadoop3.2
+      HIVE_PROFILE: hive2.3
+      # GitHub Actions' default miniconda to use in pip packaging test.
+      CONDA_PREFIX: /usr/share/miniconda
+      GITHUB_PREV_SHA: ${{ github.event.before }}
+      GITHUB_INPUT_BRANCH: ${{ github.event.inputs.target }}
+    steps:
+    - name: Checkout Spark repository
+      uses: actions/checkout@v2
+      # In order to fetch changed files
+      with:
+        fetch-depth: 0
+    - name: Merge dispatched input branch
+      if: ${{ github.event.inputs.target != '' }}
+      run: git merge --progress --ff-only origin/${{ github.event.inputs.target }}
+    # Cache local repositories. Note that GitHub Actions cache has a 2G limit.
+    - name: Cache Scala, SBT, Maven and Zinc
+      uses: actions/cache@v2
+      with:
+        path: |
+          build/apache-maven-*
+          build/zinc-*
+          build/scala-*
+          build/*.jar
+        key: build-${{ hashFiles('**/pom.xml', 'project/build.properties', 'build/mvn', 'build/sbt', 'build/sbt-launch-lib.bash', 'build/spark-build-info') }}
+        restore-keys: |
+          build-
+    - name: Cache Maven local repository
+      uses: actions/cache@v2
+      with:
+        path: ~/.m2/repository
+        key: pyspark-maven-${{ hashFiles('**/pom.xml') }}
+        restore-keys: |
+          pyspark-maven-
+    - name: Cache Ivy local repository
+      uses: actions/cache@v2
+      with:
+        path: ~/.ivy2/cache
+        key: pyspark-ivy-${{ hashFiles('**/pom.xml', '**/plugins.sbt') }}
+        restore-keys: |
+          pyspark-ivy-
+    - name: Install Python 3.6
+      uses: actions/setup-python@v2
+      if: contains(matrix.modules, 'pyspark')
+      with:
+        python-version: 3.6
+        architecture: x64
+    - name: Install Python packages (Python 3.6)

Review comment:
       Gotya.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #30059:
URL: https://github.com/apache/spark/pull/30059#discussion_r505948397



##########
File path: .github/workflows/build_and_test.yml
##########
@@ -201,6 +173,95 @@ jobs:
         name: unit-tests-log-${{ matrix.modules }}-${{ matrix.comment }}-${{ matrix.java }}-${{ matrix.hadoop }}-${{ matrix.hive }}
         path: "**/target/unit-tests.log"
 
+  pyspark:
+    name: "Build modules: ${{ matrix.modules }}"
+    runs-on: ubuntu-20.04
+    container:
+      image: dongjoon/apache-spark-github-action-image:20201015
+    strategy:
+      fail-fast: false
+      matrix:
+        modules:
+          - >-
+            pyspark-sql, pyspark-mllib, pyspark-resource
+          - >-
+            pyspark-core, pyspark-streaming, pyspark-ml
+    env:
+      MODULES_TO_TEST: ${{ matrix.modules }}
+      HADOOP_PROFILE: hadoop3.2
+      HIVE_PROFILE: hive2.3
+      # GitHub Actions' default miniconda to use in pip packaging test.
+      CONDA_PREFIX: /usr/share/miniconda
+      GITHUB_PREV_SHA: ${{ github.event.before }}
+      GITHUB_INPUT_BRANCH: ${{ github.event.inputs.target }}
+    steps:
+    - name: Checkout Spark repository
+      uses: actions/checkout@v2
+      # In order to fetch changed files
+      with:
+        fetch-depth: 0
+    - name: Merge dispatched input branch
+      if: ${{ github.event.inputs.target != '' }}
+      run: git merge --progress --ff-only origin/${{ github.event.inputs.target }}
+    # Cache local repositories. Note that GitHub Actions cache has a 2G limit.
+    - name: Cache Scala, SBT, Maven and Zinc
+      uses: actions/cache@v2
+      with:
+        path: |
+          build/apache-maven-*
+          build/zinc-*
+          build/scala-*
+          build/*.jar
+        key: build-${{ hashFiles('**/pom.xml', 'project/build.properties', 'build/mvn', 'build/sbt', 'build/sbt-launch-lib.bash', 'build/spark-build-info') }}
+        restore-keys: |
+          build-
+    - name: Cache Maven local repository
+      uses: actions/cache@v2
+      with:
+        path: ~/.m2/repository
+        key: pyspark-maven-${{ hashFiles('**/pom.xml') }}
+        restore-keys: |
+          pyspark-maven-
+    - name: Cache Ivy local repository
+      uses: actions/cache@v2
+      with:
+        path: ~/.ivy2/cache
+        key: pyspark-ivy-${{ hashFiles('**/pom.xml', '**/plugins.sbt') }}
+        restore-keys: |
+          pyspark-ivy-
+    - name: Install Python 3.6
+      uses: actions/setup-python@v2
+      if: contains(matrix.modules, 'pyspark')
+      with:

Review comment:
       Sure. Thanks.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709664186


   **[Test build #129864 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129864/testReport)** for PR 30059 at commit [`f39ac87`](https://github.com/apache/spark/commit/f39ac871fc38e8ec8c02b7f6661748e2c7d431e9).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709865453


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709675437


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34470/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #30059:
URL: https://github.com/apache/spark/pull/30059#discussion_r505917941



##########
File path: .github/workflows/build_and_test.yml
##########
@@ -201,6 +173,92 @@ jobs:
         name: unit-tests-log-${{ matrix.modules }}-${{ matrix.comment }}-${{ matrix.java }}-${{ matrix.hadoop }}-${{ matrix.hive }}
         path: "**/target/unit-tests.log"
 
+  pyspark:
+    name: "Build modules: ${{ matrix.modules }}"
+    runs-on: ubuntu-20.04
+    container:
+      image: dongjoon/apache-spark-github-action-image:20201015

Review comment:
       interesting... :)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709667482


   Oh btw, make sure the author is you @dongjoon-hyun.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709564708


   **[Test build #129854 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129854/testReport)** for PR 30059 at commit [`5ce11dc`](https://github.com/apache/spark/commit/5ce11dc1dae967494809fc2a8b0f9464f2c07dae).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya edited a comment on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
viirya edited a comment on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709636133


   Yes, I thought about it too. Except for very few things it is necessarily to change to Docker file, other changes can be in workflow file first. We need to collect these changes and move to Docker file occasionally. So it is just minor concern. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709667407


   Thank you, @HyukjinKwon and all.
   Merged to master!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709667567


   the number of my commits will suggest me as the main author in this PR in the merge script.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709635817


   This PR doesn't block users from using `.github/workflows/build_and_test.yml` directly. The main benefit is just boosting the performance by removing repeated and flaky installation tasks.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709520711


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34456/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709584156


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34460/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709564708


   **[Test build #129854 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129854/testReport)** for PR 30059 at commit [`5ce11dc`](https://github.com/apache/spark/commit/5ce11dc1dae967494809fc2a8b0f9464f2c07dae).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #30059:
URL: https://github.com/apache/spark/pull/30059#discussion_r505937095



##########
File path: .github/workflows/build_and_test.yml
##########
@@ -201,6 +173,92 @@ jobs:
         name: unit-tests-log-${{ matrix.modules }}-${{ matrix.comment }}-${{ matrix.java }}-${{ matrix.hadoop }}-${{ matrix.hive }}
         path: "**/target/unit-tests.log"
 
+  pyspark:
+    name: "Build modules: ${{ matrix.modules }}"
+    runs-on: ubuntu-20.04
+    container:
+      image: dongjoon/apache-spark-github-action-image:20201015
+    strategy:
+      fail-fast: false
+      matrix:
+        modules:
+          - >-
+            pyspark-sql, pyspark-mllib, pyspark-resource
+          - >-
+            pyspark-core, pyspark-streaming, pyspark-ml
+    env:
+      MODULES_TO_TEST: ${{ matrix.modules }}
+      HADOOP_PROFILE: hadoop3.2
+      HIVE_PROFILE: hive2.3
+      # GitHub Actions' default miniconda to use in pip packaging test.
+      CONDA_PREFIX: /usr/share/miniconda
+      GITHUB_PREV_SHA: ${{ github.event.before }}
+      GITHUB_INPUT_BRANCH: ${{ github.event.inputs.target }}
+    steps:
+    - name: Checkout Spark repository
+      uses: actions/checkout@v2
+      # In order to fetch changed files
+      with:
+        fetch-depth: 0
+    - name: Merge dispatched input branch
+      if: ${{ github.event.inputs.target != '' }}
+      run: git merge --progress --ff-only origin/${{ github.event.inputs.target }}
+    # Cache local repositories. Note that GitHub Actions cache has a 2G limit.
+    - name: Cache Scala, SBT, Maven and Zinc
+      uses: actions/cache@v2
+      with:
+        path: |
+          build/apache-maven-*
+          build/zinc-*
+          build/scala-*
+          build/*.jar
+        key: build-${{ hashFiles('**/pom.xml', 'project/build.properties', 'build/mvn', 'build/sbt', 'build/sbt-launch-lib.bash', 'build/spark-build-info') }}
+        restore-keys: |
+          build-
+    - name: Cache Maven local repository
+      uses: actions/cache@v2
+      with:
+        path: ~/.m2/repository
+        key: pyspark-maven-${{ hashFiles('**/pom.xml') }}
+        restore-keys: |
+          pyspark-maven-
+    - name: Cache Ivy local repository
+      uses: actions/cache@v2
+      with:
+        path: ~/.ivy2/cache
+        key: pyspark-ivy-${{ hashFiles('**/pom.xml', '**/plugins.sbt') }}
+        restore-keys: |
+          pyspark-ivy-
+    - name: Install Python 3.6
+      uses: actions/setup-python@v2
+      if: contains(matrix.modules, 'pyspark')
+      with:
+        python-version: 3.6
+        architecture: x64
+    - name: Install Python packages (Python 3.6)

Review comment:
       There is a technical issue to install `Python 3.6` on `Ubuntu 20.04`. The following is the docker image.
   - https://github.com/dongjoon-hyun/ApacheSparkGitHubActionImage/blob/main/Dockerfile
   
   And, GitHub Action's `Python 3.6` installation and package installation only takes less than 30 seconds. It's not big.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709668366


   Ah .. looks they changed their behaviour on that. No problem!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709595064






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #30059:
URL: https://github.com/apache/spark/pull/30059#discussion_r505898511



##########
File path: .github/workflows/build_and_test.yml
##########
@@ -128,41 +124,17 @@ jobs:
       uses: actions/setup-java@v1
       with:
         java-version: ${{ matrix.java }}
-    # PySpark
-    - name: Install PyPy3
-      # Note that order of Python installations here matters because default python3 is
-      # overridden by pypy3.
-      uses: actions/setup-python@v2
-      if: contains(matrix.modules, 'pyspark')
-      with:
-        python-version: pypy3
-        architecture: x64
-    - name: Install Python 3.6
-      uses: actions/setup-python@v2
-      if: contains(matrix.modules, 'pyspark')
-      with:
-        python-version: 3.6
-        architecture: x64
     - name: Install Python 3.8
       uses: actions/setup-python@v2
       # We should install one Python that is higher then 3+ for SQL and Yarn because:
       # - SQL component also has Python related tests, for example, IntegratedUDFTestUtils.
       # - Yarn has a Python specific test too, for example, YarnClusterSuite.
-      if: contains(matrix.modules, 'yarn') || contains(matrix.modules, 'pyspark') || (contains(matrix.modules, 'sql') && !contains(matrix.modules, 'sql-'))
+      if: contains(matrix.modules, 'yarn') || (contains(matrix.modules, 'sql') && !contains(matrix.modules, 'sql-'))
       with:
         python-version: 3.8
         architecture: x64
-    - name: Install Python packages (Python 3.6 and PyPy3)
-      if: contains(matrix.modules, 'pyspark')
-      # PyArrow is not supported in PyPy yet, see ARROW-2651.
-      run: |
-        python3.6 -m pip install numpy pyarrow pandas scipy xmlrunner
-        python3.6 -m pip list
-        # PyPy does not have xmlrunner
-        pypy3 -m pip install numpy pandas scipy
-        pypy3 -m pip list
     - name: Install Python packages (Python 3.8)
-      if: contains(matrix.modules, 'pyspark') || (contains(matrix.modules, 'sql') && !contains(matrix.modules, 'sql-'))
+      if: (contains(matrix.modules, 'sql') && !contains(matrix.modules, 'sql-'))
       run: |
         python3.8 -m pip install numpy pyarrow pandas scipy xmlrunner

Review comment:
       Hi, @viirya 
   `Python 3.8` is not the target of this PR because it doesn't consume any time (it's zero seconds).
   
   ![Screen Shot 2020-10-15 at 3 22 48 PM](https://user-images.githubusercontent.com/9700541/96192180-55db9980-0efa-11eb-9b14-393ed7055a7e.png)
   
   In addition to that, I want to keep it here without touching it. When we switch `Scala/Java` tests into `pre-built image`, it will use the shared python libraries on the image in any way.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun closed pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun closed pull request #30059:
URL: https://github.com/apache/spark/pull/30059


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #30059:
URL: https://github.com/apache/spark/pull/30059#discussion_r505936205



##########
File path: .github/workflows/build_and_test.yml
##########
@@ -201,6 +173,92 @@ jobs:
         name: unit-tests-log-${{ matrix.modules }}-${{ matrix.comment }}-${{ matrix.java }}-${{ matrix.hadoop }}-${{ matrix.hive }}
         path: "**/target/unit-tests.log"
 
+  pyspark:
+    name: "Build modules: ${{ matrix.modules }}"
+    runs-on: ubuntu-20.04
+    container:
+      image: dongjoon/apache-spark-github-action-image:20201015
+    strategy:
+      fail-fast: false
+      matrix:
+        modules:
+          - >-
+            pyspark-sql, pyspark-mllib, pyspark-resource
+          - >-
+            pyspark-core, pyspark-streaming, pyspark-ml
+    env:
+      MODULES_TO_TEST: ${{ matrix.modules }}
+      HADOOP_PROFILE: hadoop3.2
+      HIVE_PROFILE: hive2.3
+      # GitHub Actions' default miniconda to use in pip packaging test.
+      CONDA_PREFIX: /usr/share/miniconda
+      GITHUB_PREV_SHA: ${{ github.event.before }}
+      GITHUB_INPUT_BRANCH: ${{ github.event.inputs.target }}
+    steps:
+    - name: Checkout Spark repository
+      uses: actions/checkout@v2
+      # In order to fetch changed files
+      with:
+        fetch-depth: 0
+    - name: Merge dispatched input branch
+      if: ${{ github.event.inputs.target != '' }}
+      run: git merge --progress --ff-only origin/${{ github.event.inputs.target }}
+    # Cache local repositories. Note that GitHub Actions cache has a 2G limit.
+    - name: Cache Scala, SBT, Maven and Zinc
+      uses: actions/cache@v2
+      with:
+        path: |
+          build/apache-maven-*
+          build/zinc-*
+          build/scala-*
+          build/*.jar
+        key: build-${{ hashFiles('**/pom.xml', 'project/build.properties', 'build/mvn', 'build/sbt', 'build/sbt-launch-lib.bash', 'build/spark-build-info') }}
+        restore-keys: |
+          build-
+    - name: Cache Maven local repository
+      uses: actions/cache@v2
+      with:
+        path: ~/.m2/repository
+        key: pyspark-maven-${{ hashFiles('**/pom.xml') }}
+        restore-keys: |
+          pyspark-maven-
+    - name: Cache Ivy local repository
+      uses: actions/cache@v2
+      with:
+        path: ~/.ivy2/cache
+        key: pyspark-ivy-${{ hashFiles('**/pom.xml', '**/plugins.sbt') }}
+        restore-keys: |
+          pyspark-ivy-
+    - name: Install Python 3.6
+      uses: actions/setup-python@v2
+      if: contains(matrix.modules, 'pyspark')
+      with:
+        python-version: 3.6
+        architecture: x64
+    - name: Install Python packages (Python 3.6)

Review comment:
       Another quick question. So dose the docker image has Python 3.8 and PyPy3 with the packages installed? I wonder if we can install them about Python 3.6 as well.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] holdenk commented on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
holdenk commented on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709557255


   One quick question: the docker image doesn't seem to install python3.7?
   Can you add a USER to the dockerfile to specify a non-root user?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709667640


   Nice!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709561878


   **[Test build #129850 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129850/testReport)** for PR 30059 at commit [`7c7a748`](https://github.com/apache/spark/commit/7c7a748e0444c8f2a20ba53e2416995d40da1945).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709566982


   BTW, this PR is irrelevant to `root` user issue as you see here.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709634856


   BTW, you don't need to put everything into `Dockerfile`.
   Let say, it's the same with the current `ubuntu:20.04`. Like you described, the contributors don't need to know about `pre-built image`. Instead, they can add new commands on GitHub Workflow file only. Then, it will be installed on the fly. And, we can move it at every release preparation. How do you think about this?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709555763


   Ya. In addition to that, I want to move forward to the following direction in the near future.
   1. Apply the official Apache dockerhub location for Apache Spark
   2. Make a new branch inside Apache Spark repo and move `Dockerfile` and `Github Action build file` by moving `https://github.com/dongjoon-hyun/ApacheSparkGitHubActionImage`'s main branch to that Apache Spark branch.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709864009


   **[Test build #129864 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129864/testReport)** for PR 30059 at commit [`f39ac87`](https://github.com/apache/spark/commit/f39ac871fc38e8ec8c02b7f6661748e2c7d431e9).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709481869


   **[Test build #129850 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129850/testReport)** for PR 30059 at commit [`7c7a748`](https://github.com/apache/spark/commit/7c7a748e0444c8f2a20ba53e2416995d40da1945).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709562886






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709634856


   BTW, you don't need to put everything into `Dockerfile`.
   Let say, it's the same with the current `ubuntu:20.04`. Like you described, the contributors don't need to know about `pre-built image`. Instead, they can add new commands on GitHub Workflow file only. Then, it will be installed on the fly during Github Action execution time. And, we can move some of them to `Dockerfile` at every release preparation. How do you think about this?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709669104


   Sorry, I'll adjust by myself next time. :) Thank you for your suggestion!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #30059:
URL: https://github.com/apache/spark/pull/30059#discussion_r505936661



##########
File path: .github/workflows/build_and_test.yml
##########
@@ -201,6 +173,92 @@ jobs:
         name: unit-tests-log-${{ matrix.modules }}-${{ matrix.comment }}-${{ matrix.java }}-${{ matrix.hadoop }}-${{ matrix.hive }}
         path: "**/target/unit-tests.log"
 
+  pyspark:
+    name: "Build modules: ${{ matrix.modules }}"
+    runs-on: ubuntu-20.04
+    container:
+      image: dongjoon/apache-spark-github-action-image:20201015
+    strategy:
+      fail-fast: false
+      matrix:
+        modules:
+          - >-
+            pyspark-sql, pyspark-mllib, pyspark-resource
+          - >-
+            pyspark-core, pyspark-streaming, pyspark-ml
+    env:
+      MODULES_TO_TEST: ${{ matrix.modules }}
+      HADOOP_PROFILE: hadoop3.2
+      HIVE_PROFILE: hive2.3
+      # GitHub Actions' default miniconda to use in pip packaging test.
+      CONDA_PREFIX: /usr/share/miniconda
+      GITHUB_PREV_SHA: ${{ github.event.before }}
+      GITHUB_INPUT_BRANCH: ${{ github.event.inputs.target }}
+    steps:
+    - name: Checkout Spark repository
+      uses: actions/checkout@v2
+      # In order to fetch changed files
+      with:
+        fetch-depth: 0
+    - name: Merge dispatched input branch
+      if: ${{ github.event.inputs.target != '' }}
+      run: git merge --progress --ff-only origin/${{ github.event.inputs.target }}
+    # Cache local repositories. Note that GitHub Actions cache has a 2G limit.
+    - name: Cache Scala, SBT, Maven and Zinc
+      uses: actions/cache@v2
+      with:
+        path: |
+          build/apache-maven-*
+          build/zinc-*
+          build/scala-*
+          build/*.jar
+        key: build-${{ hashFiles('**/pom.xml', 'project/build.properties', 'build/mvn', 'build/sbt', 'build/sbt-launch-lib.bash', 'build/spark-build-info') }}
+        restore-keys: |
+          build-
+    - name: Cache Maven local repository
+      uses: actions/cache@v2
+      with:
+        path: ~/.m2/repository
+        key: pyspark-maven-${{ hashFiles('**/pom.xml') }}
+        restore-keys: |
+          pyspark-maven-
+    - name: Cache Ivy local repository
+      uses: actions/cache@v2
+      with:
+        path: ~/.ivy2/cache
+        key: pyspark-ivy-${{ hashFiles('**/pom.xml', '**/plugins.sbt') }}
+        restore-keys: |
+          pyspark-ivy-
+    - name: Install Python 3.6
+      uses: actions/setup-python@v2
+      if: contains(matrix.modules, 'pyspark')
+      with:
+        python-version: 3.6
+        architecture: x64
+    - name: Install Python packages (Python 3.6)

Review comment:
       I mean, I got that it takes less time but wondering if we can just pre-install it to make it look consistent.
   BTW, it's interesting that installing `numpy pyarrow pandas scipy xmlrunner` only takes 20s seconds .. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon removed a comment on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
HyukjinKwon removed a comment on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709667567






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #30059:
URL: https://github.com/apache/spark/pull/30059#discussion_r505916127



##########
File path: .github/workflows/build_and_test.yml
##########
@@ -201,6 +173,92 @@ jobs:
         name: unit-tests-log-${{ matrix.modules }}-${{ matrix.comment }}-${{ matrix.java }}-${{ matrix.hadoop }}-${{ matrix.hive }}
         path: "**/target/unit-tests.log"
 
+  pyspark:
+    name: "Build modules: ${{ matrix.modules }}"
+    runs-on: ubuntu-20.04
+    container:
+      image: dongjoon/apache-spark-github-action-image:20201015

Review comment:
       As we have `container`, do we still need `runs-on`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #30059:
URL: https://github.com/apache/spark/pull/30059#discussion_r505903745



##########
File path: .github/workflows/build_and_test.yml
##########
@@ -128,41 +124,17 @@ jobs:
       uses: actions/setup-java@v1
       with:
         java-version: ${{ matrix.java }}
-    # PySpark
-    - name: Install PyPy3
-      # Note that order of Python installations here matters because default python3 is
-      # overridden by pypy3.
-      uses: actions/setup-python@v2
-      if: contains(matrix.modules, 'pyspark')
-      with:
-        python-version: pypy3
-        architecture: x64
-    - name: Install Python 3.6
-      uses: actions/setup-python@v2
-      if: contains(matrix.modules, 'pyspark')
-      with:
-        python-version: 3.6
-        architecture: x64
     - name: Install Python 3.8
       uses: actions/setup-python@v2
       # We should install one Python that is higher then 3+ for SQL and Yarn because:
       # - SQL component also has Python related tests, for example, IntegratedUDFTestUtils.
       # - Yarn has a Python specific test too, for example, YarnClusterSuite.
-      if: contains(matrix.modules, 'yarn') || contains(matrix.modules, 'pyspark') || (contains(matrix.modules, 'sql') && !contains(matrix.modules, 'sql-'))
+      if: contains(matrix.modules, 'yarn') || (contains(matrix.modules, 'sql') && !contains(matrix.modules, 'sql-'))
       with:
         python-version: 3.8
         architecture: x64
-    - name: Install Python packages (Python 3.6 and PyPy3)
-      if: contains(matrix.modules, 'pyspark')
-      # PyArrow is not supported in PyPy yet, see ARROW-2651.
-      run: |
-        python3.6 -m pip install numpy pyarrow pandas scipy xmlrunner
-        python3.6 -m pip list
-        # PyPy does not have xmlrunner
-        pypy3 -m pip install numpy pandas scipy
-        pypy3 -m pip list
     - name: Install Python packages (Python 3.8)
-      if: contains(matrix.modules, 'pyspark') || (contains(matrix.modules, 'sql') && !contains(matrix.modules, 'sql-'))
+      if: (contains(matrix.modules, 'sql') && !contains(matrix.modules, 'sql-'))
       run: |
         python3.8 -m pip install numpy pyarrow pandas scipy xmlrunner

Review comment:
       Correction: It can take `20s`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709682606


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/34470/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709566982


   BTW, this PR is irrelevant to `root` user issue as you see.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709660507


   BTW, @HyukjinKwon . If you are interested,
   - SparkR Prebuilt Testing is on the way here (https://github.com/dongjoon-hyun/spark/pull/37)
   - Initial Container try was done here (https://github.com/dongjoon-hyun/spark/pull/35).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #30059:
URL: https://github.com/apache/spark/pull/30059#discussion_r505935473



##########
File path: .github/workflows/build_and_test.yml
##########
@@ -201,6 +173,92 @@ jobs:
         name: unit-tests-log-${{ matrix.modules }}-${{ matrix.comment }}-${{ matrix.java }}-${{ matrix.hadoop }}-${{ matrix.hive }}
         path: "**/target/unit-tests.log"
 
+  pyspark:
+    name: "Build modules: ${{ matrix.modules }}"
+    runs-on: ubuntu-20.04
+    container:
+      image: dongjoon/apache-spark-github-action-image:20201015

Review comment:
       @dongjoon-hyun, quick question. what about we just set this image for the main job above (meaning use it for all tests and build), and remove this new job here? I think that'll be easier to move to use Docker image for all test cases in the near future.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709682594


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34470/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709520735






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709636133


   Yes, I thought about it too. Except for very few things it is necessarily to change to Docker file, other changes can be in workflow file first. So it is just minor concern. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709624882


   Ur, why do you think like that, @viirya ? Our plan is [here](url).
   
   1. If you are saying about `repo` or `branch`. The project contributors can change PySpark job inside Apache Spark repo in the future. It's the same cost like changing `branch-2.4` or `branch-3.0`. And, it's easier than changing `spark-website`. 
   > As the pre-built image cannot be easily changed like modifying workflow yml file, the project contributors might be a bit harder to change Github Action PySpark job.
   
   2. If you are saying about `Dockerfile`, we already have 7 `Dockerfile` which is open for contributor. It consists of easy linux commands.
   ```
   $ find . -name Dockerfile
   ./resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile
   ./resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/python/Dockerfile
   ./resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/R/Dockerfile
   ./dev/create-release/spark-rm/Dockerfile
   ./external/docker/spark-test/master/Dockerfile
   ./external/docker/spark-test/worker/Dockerfile
   ./external/docker/spark-test/base/Dockerfile
   ```
   
   BTW, I want to focus on the improvement which we can get. As you know, I tried various approach in the following PR already, but nothing is better than this. This PR reduces from `3 hours 14 minutes` to `2 hours 23 minutes`.
   - https://github.com/apache/spark/pull/30012


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #30059:
URL: https://github.com/apache/spark/pull/30059#discussion_r505937095



##########
File path: .github/workflows/build_and_test.yml
##########
@@ -201,6 +173,92 @@ jobs:
         name: unit-tests-log-${{ matrix.modules }}-${{ matrix.comment }}-${{ matrix.java }}-${{ matrix.hadoop }}-${{ matrix.hive }}
         path: "**/target/unit-tests.log"
 
+  pyspark:
+    name: "Build modules: ${{ matrix.modules }}"
+    runs-on: ubuntu-20.04
+    container:
+      image: dongjoon/apache-spark-github-action-image:20201015
+    strategy:
+      fail-fast: false
+      matrix:
+        modules:
+          - >-
+            pyspark-sql, pyspark-mllib, pyspark-resource
+          - >-
+            pyspark-core, pyspark-streaming, pyspark-ml
+    env:
+      MODULES_TO_TEST: ${{ matrix.modules }}
+      HADOOP_PROFILE: hadoop3.2
+      HIVE_PROFILE: hive2.3
+      # GitHub Actions' default miniconda to use in pip packaging test.
+      CONDA_PREFIX: /usr/share/miniconda
+      GITHUB_PREV_SHA: ${{ github.event.before }}
+      GITHUB_INPUT_BRANCH: ${{ github.event.inputs.target }}
+    steps:
+    - name: Checkout Spark repository
+      uses: actions/checkout@v2
+      # In order to fetch changed files
+      with:
+        fetch-depth: 0
+    - name: Merge dispatched input branch
+      if: ${{ github.event.inputs.target != '' }}
+      run: git merge --progress --ff-only origin/${{ github.event.inputs.target }}
+    # Cache local repositories. Note that GitHub Actions cache has a 2G limit.
+    - name: Cache Scala, SBT, Maven and Zinc
+      uses: actions/cache@v2
+      with:
+        path: |
+          build/apache-maven-*
+          build/zinc-*
+          build/scala-*
+          build/*.jar
+        key: build-${{ hashFiles('**/pom.xml', 'project/build.properties', 'build/mvn', 'build/sbt', 'build/sbt-launch-lib.bash', 'build/spark-build-info') }}
+        restore-keys: |
+          build-
+    - name: Cache Maven local repository
+      uses: actions/cache@v2
+      with:
+        path: ~/.m2/repository
+        key: pyspark-maven-${{ hashFiles('**/pom.xml') }}
+        restore-keys: |
+          pyspark-maven-
+    - name: Cache Ivy local repository
+      uses: actions/cache@v2
+      with:
+        path: ~/.ivy2/cache
+        key: pyspark-ivy-${{ hashFiles('**/pom.xml', '**/plugins.sbt') }}
+        restore-keys: |
+          pyspark-ivy-
+    - name: Install Python 3.6
+      uses: actions/setup-python@v2
+      if: contains(matrix.modules, 'pyspark')
+      with:
+        python-version: 3.6
+        architecture: x64
+    - name: Install Python packages (Python 3.6)

Review comment:
       There is a technical issue to install `Python 3.6` on `Ubuntu 20.04`. The following is the docker image.
   - https://github.com/dongjoon-hyun/ApacheSparkGitHubActionImage/blob/main/Dockerfile
   
   And, `Python 3.6` installation and package installation only takes 30 seconds. It's not big.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709624882


   Ur, why do you think like that, @viirya ? Our plan is [here](url).
   
   1. If you are saying about `repo` or `branch`. The project contributors can change PySpark job inside Apache Spark repo in the future. It's the same cost like changing `branch-2.4` or `branch-3.0`. And, it's easier than changing `spark-website`. 
       > As the pre-built image cannot be easily changed like modifying workflow yml file, the project contributors might be a bit harder to change Github Action PySpark job.
   
   2. If you are saying about `Dockerfile`, we already have 7 `Dockerfile` which is open for contributor. It consists of easy linux commands.
   ```
   $ find . -name Dockerfile
   ./resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile
   ./resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/python/Dockerfile
   ./resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/R/Dockerfile
   ./dev/create-release/spark-rm/Dockerfile
   ./external/docker/spark-test/master/Dockerfile
   ./external/docker/spark-test/worker/Dockerfile
   ./external/docker/spark-test/base/Dockerfile
   ```
   
   BTW, I want to focus on the improvement which we can get. As you know, I tried various approach in the following PR already, but nothing is better than this. This PR reduces from `3 hours 14 minutes` to `2 hours 23 minutes`.
   - https://github.com/apache/spark/pull/30012


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709624882


   Ur, why do you think like that, @viirya ? Our plan is [here](https://github.com/apache/spark/pull/30059#issuecomment-709555763).
   
   1. If you are saying about `repo` or `branch`. The project contributors can change PySpark job inside Apache Spark repo in the future. It's the same cost like changing `branch-2.4` or `branch-3.0`. And, it's easier than changing `spark-website`. 
       > As the pre-built image cannot be easily changed like modifying workflow yml file, the project contributors might be a bit harder to change Github Action PySpark job.
   
   2. If you are saying about `Dockerfile`, we already have 7 `Dockerfile` which is open for contributor. It consists of easy linux commands.
   ```
   $ find . -name Dockerfile
   ./resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile
   ./resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/python/Dockerfile
   ./resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/R/Dockerfile
   ./dev/create-release/spark-rm/Dockerfile
   ./external/docker/spark-test/master/Dockerfile
   ./external/docker/spark-test/worker/Dockerfile
   ./external/docker/spark-test/base/Dockerfile
   ```
   
   BTW, I want to focus on the improvement which we can get. As you know, I tried various approach in the following PR already, but nothing is better than this. This PR reduces from `3 hours 14 minutes` to `2 hours 23 minutes`.
   - https://github.com/apache/spark/pull/30012


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709547317


   Could you review this please, @srowen , @HyukjinKwon , @maropu , @viirya ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709629147


   **[Test build #129854 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129854/testReport)** for PR 30059 at commit [`5ce11dc`](https://github.com/apache/spark/commit/5ce11dc1dae967494809fc2a8b0f9464f2c07dae).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30059: [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30059:
URL: https://github.com/apache/spark/pull/30059#issuecomment-709629882






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org