You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "itholic (via GitHub)" <gi...@apache.org> on 2023/12/06 07:47:21 UTC

[PR] [SPARK-46112][BUILD][PYTHON] Implement lint check for PySpark custom errors [spark]

itholic opened a new pull request, #44203:
URL: https://github.com/apache/spark/pull/44203

   ### What changes were proposed in this pull request?
   
   This PR proposes to introduce a new script, `check_pyspark_custom_errors.py`, integrated into the `dev/lint-python` process. This script is designed to enforce the use of PySpark's custom error classes.
   
   
   ### Why are the changes needed?
   
   To keep the consistency of errors generated from PySpark systematically.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. It is a development tool enhancement to ensure code quality and consistency in error handling within the PySpark codebase.
   
   For example, the `lint-python` script will detect the exact location where PySpark custom error is not used and generate the following error:
   
   ```
   (pyspark-dev-env) haejoon.lee@NQ679Q495F spark % ./dev/lint-python
   starting python compilation test...
   python compilation succeeded.
   
   starting black test...
   black checks passed.
   
   starting PySpark custom errors check...
   PySpark custom errors check failed!
   
   PySpark custom errors check found issues in the following files:
   python/pyspark/sql/connect/utils.py:48: raise ImportError(
   python/pyspark/sql/connect/utils.py:52: raise ImportError(
   python/pyspark/sql/connect/utils.py:65: raise ImportError(
   python/pyspark/sql/connect/utils.py:77: raise ImportError(
   python/pyspark/sql/connect/client/retries.py:236: raise RetriesExceeded from exception
   python/pyspark/sql/pandas/serializers.py:741: raise ValueError(
   python/pyspark/sql/pandas/utils.py:35: raise ImportError(
   python/pyspark/sql/pandas/utils.py:39: raise ImportError(
   python/pyspark/sql/pandas/utils.py:60: raise ImportError(
   python/pyspark/sql/pandas/utils.py:65: raise ImportError(
   
   Please use PySpark custom errors defined in pyspark.errors.
   ```
   
   
   ### How was this patch tested?
   
   Run `dev/lint-python`
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46112][BUILD][PYTHON] Implement lint check for PySpark custom errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.
itholic commented on code in PR #44203:
URL: https://github.com/apache/spark/pull/44203#discussion_r1418176189


##########
dev/check_pyspark_custom_errors.py:
##########
@@ -0,0 +1,100 @@
+import sys
+
+sys.path.insert(0, "python")
+import os
+from pyspark import errors as pyspark_errors
+from pyspark.errors.exceptions import connect as pyspark_connect_errors
+
+
+def find_py_files(path, exclude_paths):
+    """
+    Find all .py files in a directory, excluding files in specified subdirectories.
+
+    Parameters
+    ----------
+    path : str
+        The base directory to search for .py files.
+    exclude_paths : list of str
+        A list of subdirectories to exclude from the search.
+
+    Returns
+    -------
+    list of str
+        A list of paths to .py files.
+    """
+    py_files = []
+    for root, dirs, files in os.walk(path):
+        if any(exclude_path in root for exclude_path in exclude_paths):
+            continue
+        for file in files:
+            if file.endswith(".py"):
+                py_files.append(os.path.join(root, file))
+    return py_files
+
+
+def check_errors_in_file(file_path, pyspark_error_list):
+    """
+    Check if a file uses PySpark-specific errors correctly.
+
+    Parameters
+    ----------
+    file_path : str
+        Path to the file to check.
+    pyspark_error_list : list of str
+        List of PySpark-specific error names.

Review Comment:
   Added



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46112][BUILD][PYTHON] Implement lint check for PySpark custom errors [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #44203:
URL: https://github.com/apache/spark/pull/44203#discussion_r1418161851


##########
dev/check_pyspark_custom_errors.py:
##########
@@ -0,0 +1,100 @@
+import sys
+
+sys.path.insert(0, "python")
+import os
+from pyspark import errors as pyspark_errors
+from pyspark.errors.exceptions import connect as pyspark_connect_errors
+
+
+def find_py_files(path, exclude_paths):
+    """
+    Find all .py files in a directory, excluding files in specified subdirectories.
+
+    Parameters
+    ----------
+    path : str
+        The base directory to search for .py files.
+    exclude_paths : list of str
+        A list of subdirectories to exclude from the search.
+
+    Returns
+    -------
+    list of str
+        A list of paths to .py files.
+    """
+    py_files = []
+    for root, dirs, files in os.walk(path):
+        if any(exclude_path in root for exclude_path in exclude_paths):
+            continue
+        for file in files:
+            if file.endswith(".py"):
+                py_files.append(os.path.join(root, file))
+    return py_files
+
+
+def check_errors_in_file(file_path, pyspark_error_list):
+    """
+    Check if a file uses PySpark-specific errors correctly.
+
+    Parameters
+    ----------
+    file_path : str
+        Path to the file to check.
+    pyspark_error_list : list of str
+        List of PySpark-specific error names.
+    """
+    errors_found = []
+    with open(file_path, "r") as file:
+        for line_num, line in enumerate(file, start=1):
+            if line.strip().startswith("raise"):
+                parts = line.split()
+                if len(parts) > 1 and parts[1][0].isupper():
+                    if not any(pyspark_error in line for pyspark_error in pyspark_error_list):
+                        errors_found.append(f"{file_path}:{line_num}: {line.strip()}")
+    return errors_found
+
+
+def check_pyspark_custom_errors(target_paths, exclude_paths):
+    """
+    Check PySpark-specific errors in multiple paths.
+
+    Parameters
+    ----------
+    target_paths : list of str
+        List of paths to check for PySpark-specific errors.
+    exclude_paths : list of str
+        List of paths to exclude from the check.
+    """
+    all_errors = []
+    for path in target_paths:
+        for py_file in find_py_files(path, exclude_paths):
+            file_errors = check_errors_in_file(py_file, pyspark_error_list)
+            all_errors.extend(file_errors)
+    return all_errors
+
+
+if __name__ == "__main__":
+    # PySpark-specific errors
+    pyspark_error_list = [error for error in dir(pyspark_errors) if not error.startswith("__")]
+    connect_error_list = [
+        error for error in dir(pyspark_connect_errors) if not error.startswith("__")
+    ]
+    internal_error_list = ["RetryException", "StopIteration"]
+    pyspark_error_list += connect_error_list
+    pyspark_error_list += internal_error_list
+
+    # Target paths and exclude paths
+    TARGET_PATHS = ["python/pyspark/sql"]
+    EXCLUDE_PATHS = [
+        "python/pyspark/sql/tests",
+        "python/pyspark/sql/connect/proto",
+    ]
+
+    # Check errors
+    errors_found = check_pyspark_custom_errors(TARGET_PATHS, EXCLUDE_PATHS)
+    if errors_found:
+        print("\nPySpark custom errors check found issues in the following files:")
+        for error in errors_found:
+            print(error)
+        print("\nPlease use PySpark custom errors defined in pyspark.errors.")

Review Comment:
   maybe rephrase it as sth slike .. "Use or create a new error ..."



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46112][BUILD][PYTHON] Implement lint check for PySpark custom errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.
itholic commented on code in PR #44203:
URL: https://github.com/apache/spark/pull/44203#discussion_r1418176014


##########
dev/check_pyspark_custom_errors.py:
##########
@@ -0,0 +1,100 @@
+import sys
+
+sys.path.insert(0, "python")
+import os
+from pyspark import errors as pyspark_errors
+from pyspark.errors.exceptions import connect as pyspark_connect_errors
+
+
+def find_py_files(path, exclude_paths):
+    """
+    Find all .py files in a directory, excluding files in specified subdirectories.
+
+    Parameters
+    ----------
+    path : str
+        The base directory to search for .py files.
+    exclude_paths : list of str
+        A list of subdirectories to exclude from the search.
+
+    Returns
+    -------
+    list of str
+        A list of paths to .py files.
+    """
+    py_files = []
+    for root, dirs, files in os.walk(path):
+        if any(exclude_path in root for exclude_path in exclude_paths):
+            continue
+        for file in files:
+            if file.endswith(".py"):
+                py_files.append(os.path.join(root, file))
+    return py_files
+
+
+def check_errors_in_file(file_path, pyspark_error_list):
+    """
+    Check if a file uses PySpark-specific errors correctly.
+
+    Parameters
+    ----------
+    file_path : str
+        Path to the file to check.
+    pyspark_error_list : list of str
+        List of PySpark-specific error names.
+    """
+    errors_found = []
+    with open(file_path, "r") as file:
+        for line_num, line in enumerate(file, start=1):
+            if line.strip().startswith("raise"):
+                parts = line.split()
+                if len(parts) > 1 and parts[1][0].isupper():

Review Comment:
   Sure. Added `# Check for 'raise' statement and ensure the error raised is a capitalized word.`



##########
dev/check_pyspark_custom_errors.py:
##########
@@ -0,0 +1,100 @@
+import sys
+
+sys.path.insert(0, "python")
+import os
+from pyspark import errors as pyspark_errors
+from pyspark.errors.exceptions import connect as pyspark_connect_errors
+
+
+def find_py_files(path, exclude_paths):
+    """
+    Find all .py files in a directory, excluding files in specified subdirectories.
+
+    Parameters
+    ----------
+    path : str
+        The base directory to search for .py files.
+    exclude_paths : list of str
+        A list of subdirectories to exclude from the search.
+
+    Returns
+    -------
+    list of str
+        A list of paths to .py files.
+    """
+    py_files = []
+    for root, dirs, files in os.walk(path):
+        if any(exclude_path in root for exclude_path in exclude_paths):
+            continue
+        for file in files:
+            if file.endswith(".py"):
+                py_files.append(os.path.join(root, file))
+    return py_files
+
+
+def check_errors_in_file(file_path, pyspark_error_list):
+    """
+    Check if a file uses PySpark-specific errors correctly.
+
+    Parameters
+    ----------
+    file_path : str
+        Path to the file to check.
+    pyspark_error_list : list of str
+        List of PySpark-specific error names.
+    """
+    errors_found = []
+    with open(file_path, "r") as file:
+        for line_num, line in enumerate(file, start=1):
+            if line.strip().startswith("raise"):
+                parts = line.split()
+                if len(parts) > 1 and parts[1][0].isupper():
+                    if not any(pyspark_error in line for pyspark_error in pyspark_error_list):
+                        errors_found.append(f"{file_path}:{line_num}: {line.strip()}")
+    return errors_found
+
+
+def check_pyspark_custom_errors(target_paths, exclude_paths):
+    """
+    Check PySpark-specific errors in multiple paths.
+
+    Parameters
+    ----------
+    target_paths : list of str
+        List of paths to check for PySpark-specific errors.
+    exclude_paths : list of str
+        List of paths to exclude from the check.

Review Comment:
   Added!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [WIP][SPARK-46112][BUILD][PYTHON] Implement lint check for PySpark custom errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.
itholic commented on PR #44203:
URL: https://github.com/apache/spark/pull/44203#issuecomment-1842431421

   At least from my research before writing the script by my own, it seems there are no existing third-party linters that specifically address our custom error checking requirements in PySpark. IMHO the current implementation appears to be the most viable solution for now. But please let me know if there is any alternate way to support it!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46112][BUILD][PYTHON] Implement lint check for PySpark custom errors [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #44203:
URL: https://github.com/apache/spark/pull/44203#issuecomment-1845703607

   Merged to master for Apache Spark 4.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46112][BUILD][PYTHON] Implement lint check for PySpark custom errors [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #44203:
URL: https://github.com/apache/spark/pull/44203#discussion_r1418161417


##########
dev/check_pyspark_custom_errors.py:
##########
@@ -0,0 +1,100 @@
+import sys
+
+sys.path.insert(0, "python")
+import os
+from pyspark import errors as pyspark_errors

Review Comment:
   I think you should describe here that we shouldn't initialize Spark Session or anything related to JVM because at this point Spark is not built, and we shouldn't assume Spark build is available at this point.



##########
dev/check_pyspark_custom_errors.py:
##########
@@ -0,0 +1,100 @@
+import sys

Review Comment:
   add license



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46112][BUILD][PYTHON] Implement lint check for PySpark custom errors [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #44203:
URL: https://github.com/apache/spark/pull/44203#discussion_r1418160463


##########
dev/lint-python:
##########
@@ -46,6 +46,9 @@ while (( "$#" )); do
     --black)
       BLACK_TEST=true
       ;;
+    --custom-pyspark-error)

Review Comment:
   fix the doc above `cl_options`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [WIP][SPARK-46112][BUILD][PYTHON] Implement lint check for PySpark custom errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.
itholic commented on PR #44203:
URL: https://github.com/apache/spark/pull/44203#issuecomment-1842273830

   Let me mark it as an draft for now, because this PR would fail until the remaining migration PRs below are merged:
   - https://github.com/apache/spark/pull/44202
   - https://github.com/apache/spark/pull/44176
   - https://github.com/apache/spark/pull/44147
   
   cc @HyukjinKwon @dongjoon-hyun FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46112][BUILD][PYTHON] Implement lint check for PySpark custom errors [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #44203:
URL: https://github.com/apache/spark/pull/44203#discussion_r1418159534


##########
dev/check_pyspark_custom_errors.py:
##########
@@ -0,0 +1,100 @@
+import sys
+
+sys.path.insert(0, "python")
+import os
+from pyspark import errors as pyspark_errors
+from pyspark.errors.exceptions import connect as pyspark_connect_errors
+
+
+def find_py_files(path, exclude_paths):
+    """
+    Find all .py files in a directory, excluding files in specified subdirectories.
+
+    Parameters
+    ----------
+    path : str
+        The base directory to search for .py files.
+    exclude_paths : list of str
+        A list of subdirectories to exclude from the search.
+
+    Returns
+    -------
+    list of str
+        A list of paths to .py files.
+    """
+    py_files = []
+    for root, dirs, files in os.walk(path):
+        if any(exclude_path in root for exclude_path in exclude_paths):
+            continue
+        for file in files:
+            if file.endswith(".py"):
+                py_files.append(os.path.join(root, file))
+    return py_files
+
+
+def check_errors_in_file(file_path, pyspark_error_list):
+    """
+    Check if a file uses PySpark-specific errors correctly.
+
+    Parameters
+    ----------
+    file_path : str
+        Path to the file to check.
+    pyspark_error_list : list of str
+        List of PySpark-specific error names.
+    """
+    errors_found = []
+    with open(file_path, "r") as file:
+        for line_num, line in enumerate(file, start=1):
+            if line.strip().startswith("raise"):
+                parts = line.split()
+                if len(parts) > 1 and parts[1][0].isupper():
+                    if not any(pyspark_error in line for pyspark_error in pyspark_error_list):
+                        errors_found.append(f"{file_path}:{line_num}: {line.strip()}")
+    return errors_found
+
+
+def check_pyspark_custom_errors(target_paths, exclude_paths):
+    """
+    Check PySpark-specific errors in multiple paths.
+
+    Parameters
+    ----------
+    target_paths : list of str
+        List of paths to check for PySpark-specific errors.
+    exclude_paths : list of str
+        List of paths to exclude from the check.
+    """
+    all_errors = []
+    for path in target_paths:
+        for py_file in find_py_files(path, exclude_paths):
+            file_errors = check_errors_in_file(py_file, pyspark_error_list)
+            all_errors.extend(file_errors)
+    return all_errors
+
+
+if __name__ == "__main__":
+    # PySpark-specific errors
+    pyspark_error_list = [error for error in dir(pyspark_errors) if not error.startswith("__")]
+    connect_error_list = [
+        error for error in dir(pyspark_connect_errors) if not error.startswith("__")
+    ]
+    internal_error_list = ["RetryException", "StopIteration"]
+    pyspark_error_list += connect_error_list
+    pyspark_error_list += internal_error_list
+
+    # Target paths and exclude paths
+    TARGET_PATHS = ["python/pyspark/sql"]
+    EXCLUDE_PATHS = [
+        "python/pyspark/sql/tests",
+        "python/pyspark/sql/connect/proto",
+    ]
+
+    # Check errors
+    errors_found = check_pyspark_custom_errors(TARGET_PATHS, EXCLUDE_PATHS)
+    if errors_found:
+        print("\nPySpark custom errors check found issues in the following files:")

Review Comment:
   print it out to stderr



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46112][BUILD][PYTHON] Implement lint check for PySpark custom errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.
itholic commented on code in PR #44203:
URL: https://github.com/apache/spark/pull/44203#discussion_r1418176822


##########
dev/check_pyspark_custom_errors.py:
##########
@@ -0,0 +1,100 @@
+import sys
+
+sys.path.insert(0, "python")
+import os
+from pyspark import errors as pyspark_errors
+from pyspark.errors.exceptions import connect as pyspark_connect_errors
+
+
+def find_py_files(path, exclude_paths):
+    """
+    Find all .py files in a directory, excluding files in specified subdirectories.
+
+    Parameters
+    ----------
+    path : str
+        The base directory to search for .py files.
+    exclude_paths : list of str
+        A list of subdirectories to exclude from the search.
+
+    Returns
+    -------
+    list of str
+        A list of paths to .py files.
+    """
+    py_files = []
+    for root, dirs, files in os.walk(path):
+        if any(exclude_path in root for exclude_path in exclude_paths):
+            continue
+        for file in files:
+            if file.endswith(".py"):
+                py_files.append(os.path.join(root, file))
+    return py_files
+
+
+def check_errors_in_file(file_path, pyspark_error_list):
+    """
+    Check if a file uses PySpark-specific errors correctly.
+
+    Parameters
+    ----------
+    file_path : str
+        Path to the file to check.
+    pyspark_error_list : list of str
+        List of PySpark-specific error names.
+    """
+    errors_found = []
+    with open(file_path, "r") as file:
+        for line_num, line in enumerate(file, start=1):
+            if line.strip().startswith("raise"):
+                parts = line.split()
+                if len(parts) > 1 and parts[1][0].isupper():
+                    if not any(pyspark_error in line for pyspark_error in pyspark_error_list):
+                        errors_found.append(f"{file_path}:{line_num}: {line.strip()}")
+    return errors_found
+
+
+def check_pyspark_custom_errors(target_paths, exclude_paths):
+    """
+    Check PySpark-specific errors in multiple paths.
+
+    Parameters
+    ----------
+    target_paths : list of str
+        List of paths to check for PySpark-specific errors.
+    exclude_paths : list of str
+        List of paths to exclude from the check.
+    """
+    all_errors = []
+    for path in target_paths:
+        for py_file in find_py_files(path, exclude_paths):
+            file_errors = check_errors_in_file(py_file, pyspark_error_list)
+            all_errors.extend(file_errors)
+    return all_errors
+
+
+if __name__ == "__main__":
+    # PySpark-specific errors
+    pyspark_error_list = [error for error in dir(pyspark_errors) if not error.startswith("__")]
+    connect_error_list = [
+        error for error in dir(pyspark_connect_errors) if not error.startswith("__")
+    ]
+    internal_error_list = ["RetryException", "StopIteration"]
+    pyspark_error_list += connect_error_list
+    pyspark_error_list += internal_error_list
+
+    # Target paths and exclude paths
+    TARGET_PATHS = ["python/pyspark/sql"]
+    EXCLUDE_PATHS = [
+        "python/pyspark/sql/tests",
+        "python/pyspark/sql/connect/proto",
+    ]
+
+    # Check errors
+    errors_found = check_pyspark_custom_errors(TARGET_PATHS, EXCLUDE_PATHS)
+    if errors_found:
+        print("\nPySpark custom errors check found issues in the following files:")
+        for error in errors_found:
+            print(error)
+        print("\nPlease use PySpark custom errors defined in pyspark.errors.")

Review Comment:
   Rephrase:
   
   ```diff
   -        print("\nPlease use PySpark custom errors defined in pyspark.errors.")
   +        print("\nUse existing or create a new custom error from pyspark.errors.", file=sys.stderr)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46112][BUILD][PYTHON] Implement lint check for PySpark custom errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.
itholic commented on code in PR #44203:
URL: https://github.com/apache/spark/pull/44203#discussion_r1418175863


##########
dev/check_pyspark_custom_errors.py:
##########
@@ -0,0 +1,100 @@
+import sys
+
+sys.path.insert(0, "python")
+import os
+from pyspark import errors as pyspark_errors
+from pyspark.errors.exceptions import connect as pyspark_connect_errors
+
+
+def find_py_files(path, exclude_paths):
+    """
+    Find all .py files in a directory, excluding files in specified subdirectories.
+
+    Parameters
+    ----------
+    path : str
+        The base directory to search for .py files.
+    exclude_paths : list of str
+        A list of subdirectories to exclude from the search.
+
+    Returns
+    -------
+    list of str
+        A list of paths to .py files.
+    """
+    py_files = []
+    for root, dirs, files in os.walk(path):
+        if any(exclude_path in root for exclude_path in exclude_paths):
+            continue
+        for file in files:
+            if file.endswith(".py"):
+                py_files.append(os.path.join(root, file))
+    return py_files
+
+
+def check_errors_in_file(file_path, pyspark_error_list):
+    """
+    Check if a file uses PySpark-specific errors correctly.
+
+    Parameters
+    ----------
+    file_path : str
+        Path to the file to check.
+    pyspark_error_list : list of str
+        List of PySpark-specific error names.
+    """
+    errors_found = []
+    with open(file_path, "r") as file:
+        for line_num, line in enumerate(file, start=1):
+            if line.strip().startswith("raise"):
+                parts = line.split()
+                if len(parts) > 1 and parts[1][0].isupper():
+                    if not any(pyspark_error in line for pyspark_error in pyspark_error_list):
+                        errors_found.append(f"{file_path}:{line_num}: {line.strip()}")
+    return errors_found
+
+
+def check_pyspark_custom_errors(target_paths, exclude_paths):
+    """
+    Check PySpark-specific errors in multiple paths.
+
+    Parameters
+    ----------
+    target_paths : list of str
+        List of paths to check for PySpark-specific errors.
+    exclude_paths : list of str
+        List of paths to exclude from the check.
+    """
+    all_errors = []
+    for path in target_paths:
+        for py_file in find_py_files(path, exclude_paths):
+            file_errors = check_errors_in_file(py_file, pyspark_error_list)
+            all_errors.extend(file_errors)
+    return all_errors
+
+
+if __name__ == "__main__":
+    # PySpark-specific errors
+    pyspark_error_list = [error for error in dir(pyspark_errors) if not error.startswith("__")]
+    connect_error_list = [
+        error for error in dir(pyspark_connect_errors) if not error.startswith("__")
+    ]
+    internal_error_list = ["RetryException", "StopIteration"]
+    pyspark_error_list += connect_error_list
+    pyspark_error_list += internal_error_list
+
+    # Target paths and exclude paths
+    TARGET_PATHS = ["python/pyspark/sql"]
+    EXCLUDE_PATHS = [
+        "python/pyspark/sql/tests",
+        "python/pyspark/sql/connect/proto",
+    ]
+
+    # Check errors
+    errors_found = check_pyspark_custom_errors(TARGET_PATHS, EXCLUDE_PATHS)
+    if errors_found:
+        print("\nPySpark custom errors check found issues in the following files:")

Review Comment:
   Added `sys.stderr` for all required places.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46112][BUILD][PYTHON] Implement lint check for PySpark custom errors [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #44203:
URL: https://github.com/apache/spark/pull/44203#discussion_r1418159779


##########
dev/check_pyspark_custom_errors.py:
##########
@@ -0,0 +1,100 @@
+import sys
+
+sys.path.insert(0, "python")
+import os
+from pyspark import errors as pyspark_errors
+from pyspark.errors.exceptions import connect as pyspark_connect_errors
+
+
+def find_py_files(path, exclude_paths):
+    """
+    Find all .py files in a directory, excluding files in specified subdirectories.
+
+    Parameters
+    ----------
+    path : str
+        The base directory to search for .py files.
+    exclude_paths : list of str
+        A list of subdirectories to exclude from the search.
+
+    Returns
+    -------
+    list of str
+        A list of paths to .py files.
+    """
+    py_files = []
+    for root, dirs, files in os.walk(path):
+        if any(exclude_path in root for exclude_path in exclude_paths):
+            continue
+        for file in files:
+            if file.endswith(".py"):
+                py_files.append(os.path.join(root, file))
+    return py_files
+
+
+def check_errors_in_file(file_path, pyspark_error_list):
+    """
+    Check if a file uses PySpark-specific errors correctly.
+
+    Parameters
+    ----------
+    file_path : str
+        Path to the file to check.
+    pyspark_error_list : list of str
+        List of PySpark-specific error names.
+    """
+    errors_found = []
+    with open(file_path, "r") as file:
+        for line_num, line in enumerate(file, start=1):
+            if line.strip().startswith("raise"):
+                parts = line.split()
+                if len(parts) > 1 and parts[1][0].isupper():

Review Comment:
   Can you add a comment which case this searches for?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46112][BUILD][PYTHON] Implement lint check for PySpark custom errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.
itholic commented on code in PR #44203:
URL: https://github.com/apache/spark/pull/44203#discussion_r1418176822


##########
dev/check_pyspark_custom_errors.py:
##########
@@ -0,0 +1,100 @@
+import sys
+
+sys.path.insert(0, "python")
+import os
+from pyspark import errors as pyspark_errors
+from pyspark.errors.exceptions import connect as pyspark_connect_errors
+
+
+def find_py_files(path, exclude_paths):
+    """
+    Find all .py files in a directory, excluding files in specified subdirectories.
+
+    Parameters
+    ----------
+    path : str
+        The base directory to search for .py files.
+    exclude_paths : list of str
+        A list of subdirectories to exclude from the search.
+
+    Returns
+    -------
+    list of str
+        A list of paths to .py files.
+    """
+    py_files = []
+    for root, dirs, files in os.walk(path):
+        if any(exclude_path in root for exclude_path in exclude_paths):
+            continue
+        for file in files:
+            if file.endswith(".py"):
+                py_files.append(os.path.join(root, file))
+    return py_files
+
+
+def check_errors_in_file(file_path, pyspark_error_list):
+    """
+    Check if a file uses PySpark-specific errors correctly.
+
+    Parameters
+    ----------
+    file_path : str
+        Path to the file to check.
+    pyspark_error_list : list of str
+        List of PySpark-specific error names.
+    """
+    errors_found = []
+    with open(file_path, "r") as file:
+        for line_num, line in enumerate(file, start=1):
+            if line.strip().startswith("raise"):
+                parts = line.split()
+                if len(parts) > 1 and parts[1][0].isupper():
+                    if not any(pyspark_error in line for pyspark_error in pyspark_error_list):
+                        errors_found.append(f"{file_path}:{line_num}: {line.strip()}")
+    return errors_found
+
+
+def check_pyspark_custom_errors(target_paths, exclude_paths):
+    """
+    Check PySpark-specific errors in multiple paths.
+
+    Parameters
+    ----------
+    target_paths : list of str
+        List of paths to check for PySpark-specific errors.
+    exclude_paths : list of str
+        List of paths to exclude from the check.
+    """
+    all_errors = []
+    for path in target_paths:
+        for py_file in find_py_files(path, exclude_paths):
+            file_errors = check_errors_in_file(py_file, pyspark_error_list)
+            all_errors.extend(file_errors)
+    return all_errors
+
+
+if __name__ == "__main__":
+    # PySpark-specific errors
+    pyspark_error_list = [error for error in dir(pyspark_errors) if not error.startswith("__")]
+    connect_error_list = [
+        error for error in dir(pyspark_connect_errors) if not error.startswith("__")
+    ]
+    internal_error_list = ["RetryException", "StopIteration"]
+    pyspark_error_list += connect_error_list
+    pyspark_error_list += internal_error_list
+
+    # Target paths and exclude paths
+    TARGET_PATHS = ["python/pyspark/sql"]
+    EXCLUDE_PATHS = [
+        "python/pyspark/sql/tests",
+        "python/pyspark/sql/connect/proto",
+    ]
+
+    # Check errors
+    errors_found = check_pyspark_custom_errors(TARGET_PATHS, EXCLUDE_PATHS)
+    if errors_found:
+        print("\nPySpark custom errors check found issues in the following files:")
+        for error in errors_found:
+            print(error)
+        print("\nPlease use PySpark custom errors defined in pyspark.errors.")

Review Comment:
   Rephrase:
   
   ```diff
   - print("\nPlease use PySpark custom errors defined in pyspark.errors.")
   + print("\nUse existing or create a new custom error from pyspark.errors.", file=sys.stderr)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [WIP][SPARK-46112][BUILD][PYTHON] Implement lint check for PySpark custom errors [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #44203:
URL: https://github.com/apache/spark/pull/44203#issuecomment-1842416587

   I wonder if there are other thridparty linter integration for this .. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46112][BUILD][PYTHON] Implement lint check for PySpark custom errors [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #44203:
URL: https://github.com/apache/spark/pull/44203#discussion_r1418160837


##########
dev/check_pyspark_custom_errors.py:
##########
@@ -0,0 +1,100 @@
+import sys
+
+sys.path.insert(0, "python")
+import os
+from pyspark import errors as pyspark_errors
+from pyspark.errors.exceptions import connect as pyspark_connect_errors
+
+
+def find_py_files(path, exclude_paths):
+    """
+    Find all .py files in a directory, excluding files in specified subdirectories.
+
+    Parameters
+    ----------
+    path : str
+        The base directory to search for .py files.
+    exclude_paths : list of str
+        A list of subdirectories to exclude from the search.
+
+    Returns
+    -------
+    list of str
+        A list of paths to .py files.
+    """
+    py_files = []
+    for root, dirs, files in os.walk(path):
+        if any(exclude_path in root for exclude_path in exclude_paths):
+            continue
+        for file in files:
+            if file.endswith(".py"):
+                py_files.append(os.path.join(root, file))
+    return py_files
+
+
+def check_errors_in_file(file_path, pyspark_error_list):
+    """
+    Check if a file uses PySpark-specific errors correctly.
+
+    Parameters
+    ----------
+    file_path : str
+        Path to the file to check.
+    pyspark_error_list : list of str
+        List of PySpark-specific error names.
+    """
+    errors_found = []
+    with open(file_path, "r") as file:
+        for line_num, line in enumerate(file, start=1):
+            if line.strip().startswith("raise"):
+                parts = line.split()
+                if len(parts) > 1 and parts[1][0].isupper():
+                    if not any(pyspark_error in line for pyspark_error in pyspark_error_list):
+                        errors_found.append(f"{file_path}:{line_num}: {line.strip()}")
+    return errors_found
+
+
+def check_pyspark_custom_errors(target_paths, exclude_paths):
+    """
+    Check PySpark-specific errors in multiple paths.
+
+    Parameters
+    ----------
+    target_paths : list of str
+        List of paths to check for PySpark-specific errors.
+    exclude_paths : list of str
+        List of paths to exclude from the check.

Review Comment:
   describe `Returns`



##########
dev/check_pyspark_custom_errors.py:
##########
@@ -0,0 +1,100 @@
+import sys
+
+sys.path.insert(0, "python")
+import os
+from pyspark import errors as pyspark_errors
+from pyspark.errors.exceptions import connect as pyspark_connect_errors
+
+
+def find_py_files(path, exclude_paths):
+    """
+    Find all .py files in a directory, excluding files in specified subdirectories.
+
+    Parameters
+    ----------
+    path : str
+        The base directory to search for .py files.
+    exclude_paths : list of str
+        A list of subdirectories to exclude from the search.
+
+    Returns
+    -------
+    list of str
+        A list of paths to .py files.
+    """
+    py_files = []
+    for root, dirs, files in os.walk(path):
+        if any(exclude_path in root for exclude_path in exclude_paths):
+            continue
+        for file in files:
+            if file.endswith(".py"):
+                py_files.append(os.path.join(root, file))
+    return py_files
+
+
+def check_errors_in_file(file_path, pyspark_error_list):
+    """
+    Check if a file uses PySpark-specific errors correctly.
+
+    Parameters
+    ----------
+    file_path : str
+        Path to the file to check.
+    pyspark_error_list : list of str
+        List of PySpark-specific error names.

Review Comment:
   Returns



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46112][BUILD][PYTHON] Implement lint check for PySpark custom errors [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun closed pull request #44203: [SPARK-46112][BUILD][PYTHON] Implement lint check for PySpark custom errors
URL: https://github.com/apache/spark/pull/44203


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46112][BUILD][PYTHON] Implement lint check for PySpark custom errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.
itholic commented on code in PR #44203:
URL: https://github.com/apache/spark/pull/44203#discussion_r1418176326


##########
dev/check_pyspark_custom_errors.py:
##########
@@ -0,0 +1,100 @@
+import sys
+
+sys.path.insert(0, "python")
+import os
+from pyspark import errors as pyspark_errors

Review Comment:
   Added:
   
   ```
   # No Spark Session or JVM-related initialization should be done here,
   # as Spark build is not assumed to be available at this point.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46112][BUILD][PYTHON] Implement lint check for PySpark custom errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.
itholic commented on code in PR #44203:
URL: https://github.com/apache/spark/pull/44203#discussion_r1418175172


##########
dev/lint-python:
##########
@@ -46,6 +46,9 @@ while (( "$#" )); do
     --black)
       BLACK_TEST=true
       ;;
+    --custom-pyspark-error)

Review Comment:
   Fixed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46112][BUILD][PYTHON] Implement lint check for PySpark custom errors [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.
itholic commented on code in PR #44203:
URL: https://github.com/apache/spark/pull/44203#discussion_r1418176488


##########
dev/check_pyspark_custom_errors.py:
##########
@@ -0,0 +1,100 @@
+import sys

Review Comment:
   Added license.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org