You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "zhengruifeng (via GitHub)" <gi...@apache.org> on 2023/07/26 09:34:20 UTC

[GitHub] [spark] zhengruifeng commented on a diff in pull request #42125: [WIP][SPARK-44098][INFRA] Introduce python breaking change detection

zhengruifeng commented on code in PR #42125:
URL: https://github.com/apache/spark/pull/42125#discussion_r1274578926


##########
dev/aexpy/LICENSE:
##########
@@ -0,0 +1,201 @@
+Apache License

Review Comment:
   since aexpy is also under apache 2.0, do we need this license doc? 
   
   if so, should we move this license file to `./licenses`? 
   
   @srowen @HyukjinKwon 



##########
dev/aexpy/README.md:
##########
@@ -0,0 +1,11 @@
+# ![AexPy](https://socialify.git.ci/StardustDL/aexpy/image?description=1&font=Bitter&forks=1&issues=1&language=1&owner=1&pulls=1&stargazers=1&theme=Light "AexPy")

Review Comment:
   we need to improve this doc, e.g. 
   
   - dependency
   - env and variables
   - example commands
   - how to explain the results
   
   but we can do it later



##########
dev/python-detect-breaking-changes.py:
##########
@@ -0,0 +1,182 @@
+#!/usr/bin/env python3
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# Original repository: https://github.com/StardustDL/aexpy
+# Copyright 2022 StardustDL <st...@163.com>
+#
+
+# Usage: compare current code base with a commit or branch
+# python dev/python-detect-breaking-changes.py a540995345bc935db3b8ccf5cb94cb7caabc4847 (branch name or commit id)
+#
+# Cache dir (detailed result and logs): dev/aexpy/cache
+
+from io import TextIOWrapper
+from pathlib import Path
+import os
+import shutil
+from contextlib import contextmanager
+import subprocess
+import sys
+import json
+import traceback
+
+PYTHON_EXE = sys.executable
+AEXPY_EXE = [PYTHON_EXE, "-m", "aexpy", "-vvv"]
+PYSPARK_RELPATH = "python"
+PYSPARK_TOPMODULES = ["pyspark"]
+CURRENT_CODE = "__current__"
+
+ROOT_PATH = Path(__file__).parent.parent
+AEXPY_DIR = Path(__file__).parent / "aexpy"
+CACHE_DIR = AEXPY_DIR / "cache"
+assert AEXPY_DIR.exists()
+
+
+def prepare():
+    if not CACHE_DIR.exists():
+        os.mkdir(CACHE_DIR)
+
+
+def runWithLog(logFile: TextIOWrapper = None, *args, **kwargs):
+    if logFile:
+        logFile.write(f"\nsubprocess.run({args=}, {kwargs=})\n\n")
+    result = subprocess.run(*args, **kwargs, capture_output=True, text=True, encoding="utf-8")
+    if logFile:
+        logFile.write(result.stdout + "\n")
+        logFile.write(result.stderr + "\n")
+        logFile.flush()
+    result.check_returncode()
+
+
+def runAexPy(cmd: list, logFile: TextIOWrapper = None):
+    runWithLog(
+        logFile,
+        AEXPY_EXE + cmd,
+        cwd=AEXPY_DIR,
+        env={**os.environ, "PYTHONUTF8": "1", "AEXPY_PYTHON_EXE": sys.executable},
+    )
+
+
+@contextmanager
+def checkout(branch: str = CURRENT_CODE, logFile: TextIOWrapper = None):
+    if branch == CURRENT_CODE:
+        yield ROOT_PATH
+    else:
+        srcDir = CACHE_DIR / "src" / branch
+        if srcDir.exists():
+            shutil.rmtree(srcDir)
+        runWithLog(logFile, ["git", "worktree", "prune"], cwd=ROOT_PATH)
+        os.makedirs(srcDir)
+        runWithLog(
+            logFile, ["git", "worktree", "add", str(srcDir.resolve()), branch], cwd=ROOT_PATH
+        )
+
+        yield srcDir
+
+        shutil.rmtree(srcDir)
+        runWithLog(logFile, ["git", "worktree", "prune"], cwd=ROOT_PATH)
+
+
+def extract(branch: str, srcDir: Path, logFile: TextIOWrapper = None):
+    distDir = CACHE_DIR / "dist" / branch
+    if not distDir.exists():
+        os.makedirs(distDir)
+
+    preprocess = distDir / "preprocess.json"
+    runAexPy(
+        ["preprocess", str((srcDir / PYSPARK_RELPATH).resolve())]
+        + PYSPARK_TOPMODULES
+        + ["-r", f"pyspark@{branch}", "-o", str(preprocess.resolve())],
+        logFile,
+    )
+
+    extract = distDir / "extract.json"
+    runAexPy(["extract", str(preprocess.resolve()), "-o", str(extract.resolve())], logFile)
+
+    return extract
+
+
+def diff(old: str, new: str = CURRENT_CODE):
+    distDir = CACHE_DIR / "dist" / new / f"with-{old}"
+    if not distDir.exists():
+        os.makedirs(distDir)
+    logPath = distDir / f"log.log"
+    try:
+        with open(logPath, "w", encoding="utf-8") as log:
+            with checkout(old, log) as src1:
+                log.write(f"Extract API of {old}...\n")
+                extract1 = extract(old, src1, log)
+                with checkout(new) as src2:
+                    log.write(f"Extract API of {new}...\n")
+                    extract2 = extract(new, src2, log)
+
+                    log.write(f"Diff APIs...\n")
+                    diff = distDir / f"diff.json"
+                    runAexPy(
+                        [
+                            "diff",
+                            str(extract1.resolve()),
+                            str(extract2.resolve()),
+                            "-o",
+                            str(diff.resolve()),
+                        ],
+                        log,
+                    )
+
+                    log.write(f"Generate changes...\n")
+                    report = distDir / f"report.json"
+                    runAexPy(["report", str(diff.resolve()), "-o", str(report.resolve())], log)
+
+                    result = json.loads(report.read_text("utf-8"))
+
+                    log.write(f"Result: {result}\n")
+
+                    content = result["content"]
+                    (distDir / "report.txt").write_text(content, encoding="utf-8")
+
+                    log.write(f"\n{content}\n")
+
+                    return content
+    except Exception as ex:
+        traceback.print_exception(ex)
+        print(f"An error occured, view log at {logPath.resolve()} .")
+        exit(1)
+
+
+def main():

Review Comment:
   since this script only works with python 3.10, we should throw a clear error message when python version doesn't match



##########
dev/python-detect-breaking-changes.py:
##########
@@ -0,0 +1,182 @@
+#!/usr/bin/env python3
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# Original repository: https://github.com/StardustDL/aexpy
+# Copyright 2022 StardustDL <st...@163.com>
+#
+
+# Usage: compare current code base with a commit or branch
+# python dev/python-detect-breaking-changes.py a540995345bc935db3b8ccf5cb94cb7caabc4847 (branch name or commit id)

Review Comment:
   normally, the `dev` tools are used without `python` in the commands, like `./dev/python-detect-breaking-changes.py xxx`



##########
dev/python-detect-breaking-changes.py:
##########
@@ -0,0 +1,182 @@
+#!/usr/bin/env python3
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# Original repository: https://github.com/StardustDL/aexpy
+# Copyright 2022 StardustDL <st...@163.com>
+#
+
+# Usage: compare current code base with a commit or branch
+# python dev/python-detect-breaking-changes.py a540995345bc935db3b8ccf5cb94cb7caabc4847 (branch name or commit id)

Review Comment:
   `python dev/python-detect-breaking-changes.py a540995345bc935db3b8ccf5cb94cb7caabc4847` works successfully,
   
   while `python dev/python-detect-breaking-changes.py branch-3.4` fails with 
   ```
   (spark_dev_310) ➜  spark git:(third-aexpy) python dev/python-detect-breaking-changes.py branch-3.4
   Traceback (most recent call last):
     File "/Users/ruifeng.zheng/Dev/spark/dev/python-detect-breaking-changes.py", line 121, in diff
       with checkout(old, log) as src1:
     File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_310/lib/python3.10/contextlib.py", line 135, in __enter__
       return next(self.gen)
     File "/Users/ruifeng.zheng/Dev/spark/dev/python-detect-breaking-changes.py", line 85, in checkout
       runWithLog(
     File "/Users/ruifeng.zheng/Dev/spark/dev/python-detect-breaking-changes.py", line 63, in runWithLog
       result.check_returncode()
     File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_310/lib/python3.10/subprocess.py", line 457, in check_returncode
       raise CalledProcessError(self.returncode, self.args, self.stdout,
   subprocess.CalledProcessError: Command '['git', 'worktree', 'add', '/Users/ruifeng.zheng/Dev/spark/dev/aexpy/cache/src/branch-3.4', 'branch-3.4']' returned non-zero exit status 128.
   An error occured, view log at /Users/ruifeng.zheng/Dev/spark/dev/aexpy/cache/dist/__current__/with-branch-3.4/log.log .
   (spark_dev_310) ➜  spark git:(third-aexpy)     
   (spark_dev_310) ➜  spark git:(third-aexpy) cat /Users/ruifeng.zheng/Dev/spark/dev/aexpy/cache/dist/__current__/with-branch-3.4/log.log
   
   subprocess.run(args=(['git', 'worktree', 'prune'],), kwargs={'cwd': PosixPath('/Users/ruifeng.zheng/Dev/spark')})
   
   
   
   
   subprocess.run(args=(['git', 'worktree', 'add', '/Users/ruifeng.zheng/Dev/spark/dev/aexpy/cache/src/branch-3.4', 'branch-3.4'],), kwargs={'cwd': PosixPath('/Users/ruifeng.zheng/Dev/spark')})
   
   
   fatal: invalid reference: branch-3.4
   
   (spark_dev_310) ➜  spark git:(third-aexpy) 
   
   ```



##########
dev/python-detect-breaking-changes.py:
##########
@@ -0,0 +1,182 @@
+#!/usr/bin/env python3
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# Original repository: https://github.com/StardustDL/aexpy
+# Copyright 2022 StardustDL <st...@163.com>
+#
+
+# Usage: compare current code base with a commit or branch
+# python dev/python-detect-breaking-changes.py a540995345bc935db3b8ccf5cb94cb7caabc4847 (branch name or commit id)

Review Comment:
   `python dev/python-detect-breaking-changes.py a540995345bc935db3b8ccf5cb94cb7caabc4847` works successfully,
   
   while `python dev/python-detect-breaking-changes.py branch-3.4` fails with 
   ```
   (spark_dev_310) ➜  spark git:(third-aexpy) python dev/python-detect-breaking-changes.py branch-3.4
   Traceback (most recent call last):
     File "/Users/ruifeng.zheng/Dev/spark/dev/python-detect-breaking-changes.py", line 121, in diff
       with checkout(old, log) as src1:
     File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_310/lib/python3.10/contextlib.py", line 135, in __enter__
       return next(self.gen)
     File "/Users/ruifeng.zheng/Dev/spark/dev/python-detect-breaking-changes.py", line 85, in checkout
       runWithLog(
     File "/Users/ruifeng.zheng/Dev/spark/dev/python-detect-breaking-changes.py", line 63, in runWithLog
       result.check_returncode()
     File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_310/lib/python3.10/subprocess.py", line 457, in check_returncode
       raise CalledProcessError(self.returncode, self.args, self.stdout,
   subprocess.CalledProcessError: Command '['git', 'worktree', 'add', '/Users/ruifeng.zheng/Dev/spark/dev/aexpy/cache/src/branch-3.4', 'branch-3.4']' returned non-zero exit status 128.
   An error occured, view log at /Users/ruifeng.zheng/Dev/spark/dev/aexpy/cache/dist/__current__/with-branch-3.4/log.log .
   (spark_dev_310) ➜  spark git:(third-aexpy)     
   (spark_dev_310) ➜  spark git:(third-aexpy) cat /Users/ruifeng.zheng/Dev/spark/dev/aexpy/cache/dist/__current__/with-branch-3.4/log.log
   
   subprocess.run(args=(['git', 'worktree', 'prune'],), kwargs={'cwd': PosixPath('/Users/ruifeng.zheng/Dev/spark')})
   
   
   
   
   subprocess.run(args=(['git', 'worktree', 'add', '/Users/ruifeng.zheng/Dev/spark/dev/aexpy/cache/src/branch-3.4', 'branch-3.4'],), kwargs={'cwd': PosixPath('/Users/ruifeng.zheng/Dev/spark')})
   
   
   fatal: invalid reference: branch-3.4
   
   (spark_dev_310) ➜  spark git:(third-aexpy) 
   
   ```



##########
dev/aexpy/aexpy/__init__.py:
##########
@@ -0,0 +1,65 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# Original repository: https://github.com/StardustDL/aexpy
+# Copyright 2022 StardustDL <st...@163.com>

Review Comment:
   if we keep the license file, I think we don't need the two lines. also cc @srowen @HyukjinKwon 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org