You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by do...@apache.org on 2020/11/05 04:45:50 UTC
[spark] branch branch-3.0 updated: [SPARK-33162][INFRA][3.0] Use pre-built image at GitHub Action PySpark jobs

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
     new b43572e  [SPARK-33162][INFRA][3.0] Use pre-built image at GitHub Action PySpark jobs
b43572e is described below

commit b43572e73b2078cb8c554f023b12c89b0b97bda5
Author: Dongjoon Hyun <dh...@apple.com>
AuthorDate: Wed Nov 4 20:38:22 2020 -0800

    [SPARK-33162][INFRA][3.0] Use pre-built image at GitHub Action PySpark jobs
    
    ### What changes were proposed in this pull request?
    
    This is a backport of https://github.com/apache/spark/pull/30059 .
    
    This PR aims to use `pre-built image` at Github Action PySpark jobs. To isolate the changes, `pyspark` jobs are split from the main job. The docker image is built by the following.
    
    | Item                   | URL                |
    | --------------- | ------------- |
    | Dockerfile         | https://github.com/dongjoon-hyun/ApacheSparkGitHubActionImage/blob/main/Dockerfile |
    | Builder               | https://github.com/dongjoon-hyun/ApacheSparkGitHubActionImage/blob/main/.github/workflows/build.yml |
    | Image Location | https://hub.docker.com/r/dongjoon/apache-spark-github-action-image |
    
    Please note that.
    1. The community still will use `build_and_test.yml` to add new features like as we did until now. The `Dockerfile` will be updated regularly.
    2. When Apache Spark gets an official docker repository location, we will use it.
    3. Also, it's the best if we keep this docker file and builder script at a new Apache Spark dev branch instead of outside GitHub repository.
    
    ### Why are the changes needed?
    
    This will reduce the Python and its package installation time.
    
    **BEFORE (branch-3.0)**
    ![Screen Shot 2020-11-04 at 2 28 49 PM](https://user-images.githubusercontent.com/9700541/98174664-17f2e500-1eaa-11eb-9222-018eead9c418.png)
    
    **AFTER (branch-3.0)**
    ![Screen Shot 2020-11-04 at 2 29 43 PM](https://user-images.githubusercontent.com/9700541/98174758-378a0d80-1eaa-11eb-8e6a-929158c2fea3.png)
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Pass the GitHub Action on this PR without `package installation steps`.
    
    Closes #30253 from dongjoon-hyun/GHA-3.0.
    
    Authored-by: Dongjoon Hyun <dh...@apple.com>
    Signed-off-by: Dongjoon Hyun <dh...@apple.com>
---
 .github/workflows/build_and_test.yml | 112 +++++++++++++++++++++++++----------
 1 file changed, 82 insertions(+), 30 deletions(-)

diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml
index e4c1e84..7956d9e 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -38,10 +38,6 @@ jobs:
             mllib-local, mllib,
             yarn, mesos, kubernetes, hadoop-cloud, spark-ganglia-lgpl
           - >-
-            pyspark-sql, pyspark-mllib
-          - >-
-            pyspark-core, pyspark-streaming, pyspark-ml
-          - >-
             sparkr
         # Here, we split Hive and SQL tests into some of slow ones and the rest of them.
         included-tags: [""]
@@ -121,41 +117,17 @@ jobs:
       uses: actions/setup-java@v1
       with:
         java-version: ${{ matrix.java }}
-    # PySpark
-    - name: Install PyPy3
-      # Note that order of Python installations here matters because default python3 is
-      # overridden by pypy3.
-      uses: actions/setup-python@v2
-      if: contains(matrix.modules, 'pyspark')
-      with:
-        python-version: pypy3
-        architecture: x64
-    - name: Install Python 2.7
-      uses: actions/setup-python@v2
-      if: contains(matrix.modules, 'pyspark')
-      with:
-        python-version: 2.7
-        architecture: x64
     - name: Install Python 3.8
       uses: actions/setup-python@v2
       # We should install one Python that is higher then 3+ for SQL and Yarn because:
       # - SQL component also has Python related tests, for example, IntegratedUDFTestUtils.
       # - Yarn has a Python specific test too, for example, YarnClusterSuite.
-      if: contains(matrix.modules, 'yarn') || contains(matrix.modules, 'pyspark') || (contains(matrix.modules, 'sql') && !contains(matrix.modules, 'sql-'))
+      if: contains(matrix.modules, 'yarn') || (contains(matrix.modules, 'sql') && !contains(matrix.modules, 'sql-'))
       with:
         python-version: 3.8
         architecture: x64
-    - name: Install Python packages (Python 2.7 and PyPy3)
-      if: contains(matrix.modules, 'pyspark')
-      # PyArrow is not supported in PyPy yet, see ARROW-2651.
-      run: |
-        python2.7 -m pip install numpy 'pyarrow<3.0.0' pandas scipy xmlrunner
-        python2.7 -m pip list
-        # PyPy does not have xmlrunner
-        pypy3 -m pip install numpy pandas scipy
-        pypy3 -m pip list
     - name: Install Python packages (Python 3.8)
-      if: contains(matrix.modules, 'pyspark') || (contains(matrix.modules, 'sql') && !contains(matrix.modules, 'sql-'))
+      if: (contains(matrix.modules, 'sql') && !contains(matrix.modules, 'sql-'))
       run: |
         python3.8 -m pip install numpy 'pyarrow<3.0.0' pandas scipy xmlrunner
         python3.8 -m pip list
@@ -194,6 +166,86 @@ jobs:
         name: unit-tests-log-${{ matrix.modules }}-${{ matrix.comment }}-${{ matrix.java }}-${{ matrix.hadoop }}-${{ matrix.hive }}
         path: "**/target/unit-tests.log"
 
+  pyspark:
+    name: "Build modules: ${{ matrix.modules }}"
+    runs-on: ubuntu-20.04
+    container:
+       image: dongjoon/apache-spark-github-action-image:20201025
+    strategy:
+      fail-fast: false
+      matrix:
+        modules:
+          - >-
+            pyspark-sql, pyspark-mllib
+          - >-
+            pyspark-core, pyspark-streaming, pyspark-ml
+    env:
+      MODULES_TO_TEST: ${{ matrix.modules }}
+      HADOOP_PROFILE: hadoop2.7
+      HIVE_PROFILE: hive2.3
+      # GitHub Actions' default miniconda to use in pip packaging test.
+      CONDA_PREFIX: /usr/share/miniconda
+      GITHUB_PREV_SHA: ${{ github.event.before }}
+    steps:
+    - name: Checkout Spark repository
+      uses: actions/checkout@v2
+      # In order to fetch changed files
+      with:
+        fetch-depth: 0
+    # Cache local repositories. Note that GitHub Actions cache has a 2G limit.
+    - name: Cache Scala, SBT, Maven and Zinc
+      uses: actions/cache@v2
+      with:
+        path: |
+          build/apache-maven-*
+          build/zinc-*
+          build/scala-*
+          build/*.jar
+        key: build-${{ hashFiles('**/pom.xml', 'project/build.properties', 'build/mvn', 'build/sbt', 'build/sbt-launch-lib.bash', 'build/spark-build-info') }}
+        restore-keys: |
+          build-
+    - name: Cache Maven local repository
+      uses: actions/cache@v2
+      with:
+        path: ~/.m2/repository
+        key: pyspark-maven-${{ hashFiles('**/pom.xml') }}
+        restore-keys: |
+          pyspark-maven-
+    - name: Cache Ivy local repository
+      uses: actions/cache@v2
+      with:
+        path: ~/.ivy2/cache
+        key: pyspark-ivy-${{ hashFiles('**/pom.xml', '**/plugins.sbt') }}
+        restore-keys: |
+          pyspark-ivy-
+    - name: Install Python 2.7
+      uses: actions/setup-python@v2
+      with:
+        python-version: 2.7
+        architecture: x64
+    - name: Install Python packages (Python 2.7 )
+      run: |
+        python2.7 -m pip install numpy 'pyarrow<3.0.0' pandas scipy xmlrunner
+        python2.7 -m pip list
+    # Run the tests.
+    - name: Run tests
+      run: |
+        mkdir -p ~/.m2
+        ./dev/run-tests --parallelism 2 --modules "$MODULES_TO_TEST"
+        rm -rf ~/.m2/repository/org/apache/spark
+    - name: Upload test results to report
+      if: always()
+      uses: actions/upload-artifact@v2
+      with:
+        name: test-results-${{ matrix.modules }}--1.8-hadoop2.7-hive2.3
+        path: "**/target/test-reports/*.xml"
+    - name: Upload unit tests log files
+      if: failure()
+      uses: actions/upload-artifact@v2
+      with:
+        name: unit-tests-log-${{ matrix.modules }}--1.8-hadoop2.7-hive2.3
+        path: "**/target/unit-tests.log"
+
   # Static analysis, and documentation build
   lint:
     name: Linters, licenses, dependencies and documentation generation


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org