You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "bjornjorgensen (via GitHub)" <gi...@apache.org> on 2023/04/08 19:27:13 UTC
[GitHub] [spark] bjornjorgensen commented on pull request #40525: [SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

bjornjorgensen commented on PR #40525:
URL: https://github.com/apache/spark/pull/40525#issuecomment-1500961292

   @itholic Thank you, great work :) 
   
   After this PR 
   `from pyspark import pandas as ps `
   
   ModuleNotFoundError                       Traceback (most recent call last)
   File /opt/spark/python/pyspark/sql/connect/utils.py:45, in require_minimum_grpc_version()
        44 try:
   ---> 45     import grpc
        46 except ImportError as error:
   
   ModuleNotFoundError: No module named 'grpc'
   
   The above exception was the direct cause of the following exception:
   
   ImportError                               Traceback (most recent call last)
   Cell In[1], line 11
         9 import pyarrow
        10 from pyspark import SparkConf, SparkContext
   ---> 11 from pyspark import pandas as ps
        12 from pyspark.sql import SparkSession
        13 from pyspark.sql.functions import col, concat, concat_ws, expr, lit, trim
   
   File /opt/spark/python/pyspark/pandas/__init__.py:59
        50     warnings.warn(
        51         "'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to "
        52         "set this environment variable to '1' in both driver and executor sides if you use "
      (...)
        55         "already launched."
        56     )
        57     os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"
   ---> 59 from pyspark.pandas.frame import DataFrame
        60 from pyspark.pandas.indexes.base import Index
        61 from pyspark.pandas.indexes.category import CategoricalIndex
   
   File /opt/spark/python/pyspark/pandas/frame.py:88
        85 from pyspark.sql.window import Window
        87 from pyspark import pandas as ps  # For running doctests and reference resolution in PyCharm.
   ---> 88 from pyspark.pandas._typing import (
        89     Axis,
        90     DataFrameOrSeries,
        91     Dtype,
        92     Label,
        93     Name,
        94     Scalar,
        95     T,
        96     GenericColumn,
        97 )
        98 from pyspark.pandas.accessors import PandasOnSparkFrameMethods
        99 from pyspark.pandas.config import option_context, get_option
   
   File /opt/spark/python/pyspark/pandas/_typing.py:25
        22 from pandas.api.extensions import ExtensionDtype
        24 from pyspark.sql.column import Column as PySparkColumn
   ---> 25 from pyspark.sql.connect.column import Column as ConnectColumn
        26 from pyspark.sql.dataframe import DataFrame as PySparkDataFrame
        27 from pyspark.sql.connect.dataframe import DataFrame as ConnectDataFrame
   
   File /opt/spark/python/pyspark/sql/connect/column.py:19
         1 #
         2 # Licensed to the Apache Software Foundation (ASF) under one or more
         3 # contributor license agreements.  See the NOTICE file distributed with
      (...)
        15 # limitations under the License.
        16 #
        17 from pyspark.sql.connect.utils import check_dependencies
   ---> 19 check_dependencies(__name__)
        21 import datetime
        22 import decimal
   
   File /opt/spark/python/pyspark/sql/connect/utils.py:35, in check_dependencies(mod_name)
        33 require_minimum_pandas_version()
        34 require_minimum_pyarrow_version()
   ---> 35 require_minimum_grpc_version()
   
   File /opt/spark/python/pyspark/sql/connect/utils.py:47, in require_minimum_grpc_version()
        45     import grpc
        46 except ImportError as error:
   ---> 47     raise ImportError(
        48         "grpc >= %s must be installed; however, " "it was not found." % minimum_grpc_version
        49     ) from error
        50 if LooseVersion(grpc.__version__) < LooseVersion(minimum_grpc_version):
        51     raise ImportError(
        52         "gRPC >= %s must be installed; however, "
        53         "your version was %s." % (minimum_grpc_version, grpc.__version__)
        54     )
   
   ImportError: grpc >= 1.48.1 must be installed; however, it was not found.        
   
   `pip install grpc`
   
   Collecting grpc
     Downloading grpc-1.0.0.tar.gz (5.2 kB)
     Preparing metadata (setup.py) ... error
     error: subprocess-exited-with-error
     
     × python setup.py egg_info did not run successfully.
     │ exit code: 1
     ╰─> [6 lines of output]
         Traceback (most recent call last):
           File "<string>", line 2, in <module>
           File "<pip-setuptools-caller>", line 34, in <module>
           File "/tmp/pip-install-vp4d8s4c/grpc_c0f1992ad8f7456b8ac09ecbaeb81750/setup.py", line 33, in <module>
             raise RuntimeError(HINT)
         RuntimeError: Please install the official package with: pip install grpcio
         [end of output]
     
     note: This error originates from a subprocess, and is likely not a problem with pip.
   error: metadata-generation-failed
   
   × Encountered error while generating package metadata.
   ╰─> See above for output.
   
   note: This is an issue with the package mentioned above, not pip.
   hint: See above for details.
   Note: you may need to restart the kernel to use updated packages.
   
   After 
   `pip install grpcio`  
   then it works. 
   
   
   I don't think that ever pandas users that try pandas API on spark will use spark connect. So can we change this back? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org